Contact Site Map Home APA Online Public Policy Home Public Policy Home
PPO Masthead
Science Policy Public Interest Policy Education Policy News Take Action Fellowships About PPO

Meeting Context Provided by Dr. Shiffrin

There have been exciting and rapidly evolving developments in computational mining of electronic text databases. Such databases have an explicit knowledge structure-- e.g. the meaning of the sentences and the intended communication-- but also contain an enormous amount of implicit knowledge, such as the meaning of words, the rules of syntax and grammar, the underlying topics of discourse, the relational structure of the topics and contents, and much more. Computational algorithms are being developed to extract various parts of this implicit structure, for various goals and purposes. One purpose is the mapping of science, or other domains of knowledge. The extracted maps usually exist in a high dimensional space, and much work is needed to display the results in a useful way. One approach involves interactive focusing of such maps on lower dimensional representations and local regions of interest. This is especially useful when databases exist longitudinally and the evolution of the underlying structure is of critical interest (e.g. which science topics are growing and which declining, and how does their interconnectivity change).

The technical term for such algorithms is 'unsupervised learning' and there are two main branches to consider-- what might be called descriptive and what might be called model based. Descriptive models make few assumptions about the processes that generate the data base, and search for a reasonably low dimensional (e.g. 300 dimensional) restriction of the extremely high dimensional database, a restriction that nonetheless captures as much as possible of the underlying structure. An example is latent semantic analysis (LSA, as developed by Tom Landauer, Sue Dumais, and their colleagues), a form of factor analysis that uses singular value decomposition to produce the lower dimensional representation. It has been used, for example to produce a 300 dimensional space in which the words in the database with similar meaning are close together. In such an approach the dimensions of the resultant space are not usually interpretable, so the space can be thought of as a simplified description of the structure implicit in the database. Nonetheless, the approach has proved useful in many domains, for many purposes.

The model based approaches are sometimes called 'generative'-- they employ a model to predict how the words in the database are generated, and estimate the (perhaps hundreds of thousands of) parameters of the model so as to maximize the probability of the observed database. Examples are the 'topics models' developed by Mark Steyvers, Tom Griffiths and their colleagues. In the simplest version, each sub-part of the database is supposed to be generated with a probabilistic choice of topics (from say 300 topics), and each such chosen topic is supposed to generate the words for that sub-part by a choice from the probability distribution of words for that topic across the vocabulary. The huge number of parameters are estimated with Markov Chain Monte Carlo and Expectation Maximation. In this approach, the resultant topics make fit together semantically and make psychological sense to observers.

These and other similar techniques have enormous potential for uncovering implicit structure (generally speaking, for many purposes), and for mapping science in particular.

Back to Top^

© 2009 American Psychological Association
750 First Street, NE, Washington, DC 20002-4242
Telephone: 800-374-2721; 202-336-5500. TDD/TTY: 202-336-6123
PsychNET® | Contact | Terms of Use | Privacy Policy | Security | Advertise with us