|
Meeting Context Provided by Dr. Shiffrin
There have been exciting and rapidly evolving developments in
computational mining of electronic text databases. Such databases have an
explicit knowledge structure-- e.g. the meaning of the sentences and the
intended communication-- but also contain an enormous amount of implicit
knowledge, such as the meaning of words, the rules of syntax and grammar, the
underlying topics of discourse, the relational structure of the topics and
contents, and much more. Computational algorithms are being developed to extract
various parts of this implicit structure, for various goals and purposes. One
purpose is the mapping of science, or other domains of knowledge. The extracted
maps usually exist in a high dimensional space, and much work is needed to
display the results in a useful way. One approach involves interactive focusing
of such maps on lower dimensional representations and local regions of interest.
This is especially useful when databases exist longitudinally and the evolution
of the underlying structure is of critical interest (e.g. which science topics
are growing and which declining, and how does their interconnectivity change).
The technical term for such algorithms is 'unsupervised
learning' and there are two main branches to consider-- what might be called
descriptive and what might be called model based. Descriptive models make few
assumptions about the processes that generate the data base, and search for a
reasonably low dimensional (e.g. 300 dimensional) restriction of the extremely
high dimensional database, a restriction that nonetheless captures as much as
possible of the underlying structure. An example is latent semantic analysis (LSA,
as developed by Tom Landauer, Sue Dumais, and their colleagues), a form of
factor analysis that uses singular value decomposition to produce the lower
dimensional representation. It has been used, for example to produce a 300
dimensional space in which the words in the database with similar meaning are
close together. In such an approach the dimensions of the resultant space are
not usually interpretable, so the space can be thought of as a simplified
description of the structure implicit in the database. Nonetheless, the approach
has proved useful in many domains, for many purposes.
The model based approaches are sometimes called 'generative'--
they employ a model to predict how the words in the database are generated, and
estimate the (perhaps hundreds of thousands of) parameters of the model so as to
maximize the probability of the observed database. Examples are the 'topics
models' developed by Mark Steyvers, Tom Griffiths and their colleagues. In the
simplest version, each sub-part of the database is supposed to be generated with
a probabilistic choice of topics (from say 300 topics), and each such chosen
topic is supposed to generate the words for that sub-part by a choice from the
probability distribution of words for that topic across the vocabulary. The huge
number of parameters are estimated with Markov Chain Monte Carlo and Expectation
Maximation. In this approach, the resultant topics make fit together
semantically and make psychological sense to observers.
These and other similar techniques have enormous potential for
uncovering implicit structure (generally speaking, for many purposes), and for
mapping science in particular.
Back to Top^
|