| Karin M. Verspoor |
I am a computational linguist, which means that I work on software that tries to understand text at some level. My interests are primarily in the interaction of linguistic processing with world knowledge (usually represented in some sort of hierarchical, ontological structure), and the implications of this for the representation of linguistic knowledge, specifically at the lexical level.
I work at the Los Alamos National Laboratory, in the Knowledge and Information Systems Science team in the CCS-3 group of the Computer, Computational and Statistical Sciences division.
| Publications (follow link) |
| My current projects |
The project encompasses a range of research that generally addresses information extraction from text in various ways. The codebase that I am developing in this context addresses several application areas, in addition to exploring some more fundamental issues in natural language processing and semantics.
- Word Sense Disambiguation using Semantics-enhanced Language Models (with Shou-de Lin)
We have developed a language model-based approach to word sense disambiguation, in which the standard word-oriented n-gram language model is extended to incorporate semantic features of those words (specifically, the possible WordNet senses for those words). In combination with a learning algorithm such as Expectation Maximization, it is possible to learn such a "semantics enhanced" language model from an untagged corpus. We have shown that this approach rivals state of the art performance on (unsupervised) word sense disambiguation (publication pending).
- Keyterm extraction from text
I am working on techniques for extracting keyterms from texts. At the most basic, this involves using standard statistical techniques such as Term Frequency Inverse Document Frequency (TF*IDF) to identify key words in a document. This has been applied for our customers to support quick assessment of the relevance of a document to a user, and as a foundational step in doing knowledge engineering and terminology management for a domain of interest.My research in this area explores semantic grounding of phrases. With Chris Weaver of New Mexico State University, I am exploring the use of latent semantic analysis (LSA) as a mechanism for establishing semantic contexts for multi-word expressions. These contexts characterize the meaning of phrases in terms of their similarity (in LSA terms) to individual words -- so we essentially define the phrase in terms of a collection of related words. This is work in progress.
- Development of standards for interoperability of text analysis systems
Through efforts to develop solutions for customers that exploit research in text processing being done across several national laboratories, I have led efforts to adopt common architectural standards, to support the exchange and reuse of text analysis modules. We adopted IBM's Unstructured Information Management Architecture, and I developed a common annotation type system that is now deployed and in use across the participants in our customers' projects. I have been migrating my entire codebase to use the UIMA framework. Due to my involvement in this effort, I was asked to serve on the OASIS-Open Technical Committee on Unstructured Information Management, which is working to establish interoperability standards for text analysis systems.
This project is described in more detail on Mike Wall's website, or on the LANL-internal project page.My research here started with BioCreative work (described below) to try to extract protein function from biological journal publications, but evolved into developing techniques for exploiting the structure of the Gene Ontology to support automated function prediction. In the process, I have become quite familiar with knowledge resources for bioinformatics, including the Unified Medical Language System (UMLS), the National Cancer Institute Thesaurus (NCIT), the annotations available in UniProt, EBI's GOA, and ontology efforts like the Gene Ontology and others available in the Open Biological Ontologies (OBO) effort. We have been exploring the use of such resources to support semantic characterization and exploration of biological data sets, in combination with mathematical techniques such as formal concept analysis.
- Gessler, DDG; Joslyn, CA; Verspoor, KM; and Schmidt, SE: (2006) Deconstruction, Reconstruction, and Ontogenesis for Large, Monolithic, Legacy Ontologies in Semantic Web Service Applications, Los Alamos Technical Report 06-5859.
- CA Joslyn; DDG Gessler, SE Schmidt, and KM Verspoor (2006). Distributed Representations of Bio-Ontologies for Semantic Web Services, in: Joint BioLINK 9th Bio-Ontologies Meeting. (JBB 06) (LAUR 06-5490)
- Maguitman, A., Rechtsteiner, A., Verspoor, K., Strauss, C.E., Rocha, L. (2006) Large-Scale Testing Of Bibliome Informatics Using Pfam Protein Families. Pacific Symposium of Biocomputing 11:76-87 (LAUR 05-6756)
- Verspoor, Karin, JD Cohn, SM Mniszewski, and CA Joslyn (2006). A Categorization Approach to Automated Ontological Function Annotation, Protein Science, v. 15, pp. 1544-1549. (LAUR 05-8673)
- Verspoor, Karin, JD Cohn, SM Mniszewski, and CA Joslyn. (2005) POSOLE: Automated Ontological Annotation for Function Prediction, in: Proc. Automated Function Prediction SIG, ISMB 05 (LAUR 05-4778)
- Verspoor, Karin, JD Cohn, SM Mniszewski, and CA Joslyn.: (2004) Nearest Neighbor Categorization for Function Prediction. In Proceedings 5th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP5) (LAUR 04-7477)
- KM Verspoor, CA Joslyn, JA Ambrosiano, A Backer, O Bodenreider, L Hirschman, P Karp, H Kelly, S Loranger, M Musen. R Sriram, C Wroe: (2005) Knowledge Integration for Biothreat Response, LAUR 05-0907.
This project also sponsored the following additional work on biological language processing:
- Verspoor, Karin (2005). Towards a Semantic Lexicon for Biological Language Processing. Comparative and Functional Genomics, vol. 6, issue 1-2, p. 61-66. DOI: 10.1002/cfg.451 (originally in Proceedings of the BioLINK workshop at ISMB'04, July 2004. local preprint )
We submitted runs for Task 2 of the Critical Assessment of Information Extraction systems in Biology. The goal of Task 2 is to perform automatic annotation of a given protein to a node in the Gene Ontology on the basis of the information contained in a document, and to select evidence text for that annotation from the document. Our solution is built around pseudo-distance based categorization into the GO developed by Cliff Joslyn and Sue Mniszewski (see Joslyn, Cliff, Susan Mniszewski, Andy Fulmer, and Gary Heaton: (2003) "Measures on Ontological Spaces of Biological Function", poster at the 2003 Pacific Symposium on Biocomputing (PSB 03)), draws heavily on work by Tiago Simas and Luis Rocha on proximity measurements (see Rocha [2002], "Semi-metric Behavior in Document Networks and its Application to Recommendation Systems", In: Soft Computing Agents: A New Perspective for Dynamic Information Systems. V. Loia (Ed.) International Series Frontiers in Artificial Intelligence and Applications. IOS Press, pp. 137-163.) and incorporates NLP components such as a morphological normalizer, a statistical term frequency analyzer, and a named entity recognizer.
- Verspoor, Karin, Judith Cohn, Cliff Joslyn, Sue Mniszewski, Andreas Rechtsteiner, Luis M. Rocha, Tiago Simas. Protein Annotation as Term Categorization in the Gene Ontology using Word Proximity Networks. BMC Bioinformatics, vol 6, supplement 1. (Los Alamos Unclassified Report 04-3934).
- Verspoor, Karin, Judith Cohn, Cliff Joslyn, Sue Mniszewski, Andreas Rechtsteiner, Luis M. Rocha, Tiago Simas (2004). Protein Annotation as Term Categorization in the Gene Ontology. In the Proceedings of the BioCreative Workshop, March 28-30, 2004. Granada, Spain. (Los Alamos Unclassified Report 04-1460).
- Verspoor, Karin (2003). "Text Mining for Bioinformatics". In Proceedings of the first Rocky Mountain Regional Bioinformatics meeting, Aspen, CO, December 5-7, 2003. (Los Alamos Unclassified Report LAUR 03-8898). Abstract Poster
Natural language processing of biological publications and exploitation of knowledge systems services to discover and represent bionetwork pathways. We worked with Procter & Gamble to extract certain protein/protein and protein/gene interactions from the biological literature using NLP (information extraction) techniques.
| Previous Work |
Before coming to the Lab, I worked for a start-up company called Applied Semantics in Los Angeles (formerly Oingo Corp., and now owned by Google, Inc.). There I worked on word sense disambiguation, including methods for statistical inference of relationships between words and more linguistic approaches to narrow down word meanings in context, and to identify the relationships the words are involved in. Applied Semantics built, and actively made use of, a huge ontology representing words, their meanings, and the relations between them. I worked on text summarization and categorization using the ontology as an important knowledge source.
Previously I was Director of Natural Language Engineering at Webmind, Inc., formerly called Intelligenesis Corporation. I joined Webmind, Inc. in June 1998. I was working on language understanding and production for WebMind(tm), an artificial intelligence written in Java designed to dynamically analyze and interpret networked data. We were working towards a system capable of deep semantic understanding of texts and intelligent, conversational, query answering by building on the general intelligence dynamics of the Webmind architecture (such as probabilistic reasoning, category formation, short-term memory modeling, etc.). The natural language processing done within the module includes techniques derived from information retrieval algorithms, statistical parsing, unification-based analysis, and information extraction, and makes use of resources like WordNet. Webmind no longer exists, but you can get an idea of the ideas behind the system we were building at Ben Goertzel's site.
My first job after my PhD was as a research fellow at the Microsoft Research Institute, located at Macquarie University. I was a member of the Language Technology group. I joined the Language Technology Group in September 1997 as a member of the Dynamic Document Delivery (DDD) and COnnecting Reasoning, Action and Language (CORAL) projects.
My research has centered on issues at the interface of syntax, semantics, and pragmatics, and the use of formal representations of word and world knowledge to facilitate both computational natural language understanding and natural language generation. This research background is directly relevant to the kind of robust, broad-coverage, semantically deep text processing I am currently working on.
In December 1997, I was awarded a Macquarie University Research Grant to pursue multilingual text generation research and I investigated the extension of the use of a phrasal lexicon for text generation in different languages. The results of that work can be viewed by looking at the Multilingual Peba system and the Power system.
My PhD research at the Centre for Cognitive Science at the University of Edinburgh focused on lexical semantics and modeling of verbal sense extensions using a constraint-based representational formalism compatible with Head-Driven Phrase Structure Grammar.
| Important Links |
A description of some of my industry research work (December 2001)
My PhD thesis (PDF)
More information about my theses
Useful Links
|
Karin Verspoor Last Updated: February 8, 2007 |