Named Entity Recognition
Student Mentors:
Kevin Small (ksmall at uiuc dot edu)
Jing Jiang (jiang4 at uiuc dot edu)
DSSI Students:
Introduction
Named entity recognition (NER) (also known as entity identification and entity extraction) is the task of locating and classifying small chunks of text that are names of certain predefined types of objects such as people, locations, and organizations. NER is a subtask of information extraction (IE), and has been extensively studied in the NLP community. It is an important problem as other applications would benefit from an accurate NER system (e.g. relation extraction, Web search, question answering). However, despite the importance of NER and the significant existing research devoted to this task, a widely-used NER tool that achieves state-of-the-art performance and provides a flexible user interface does not exist to the best of our knowledge. This project aims to build an NER tool based on classifier techniques while supporting a powerful user interface to facilitate effective algorithm design, experimentation, evaluation, and annotation. This tool will allow a user to rapidly adapt the basic system to various domains and achieve world class results.
Architecture
The architecture is comprised of two fundamental components, a graphical user interface (GUI), and a named entity recognition (NER) system based on machine learning techniques.
GUI
The GUI will be implemented using the Java Swing libraries. There are several subcomponents in the GUI component:
- Learning Architecture: to handle entity tagging using
pre-trained models, learning from new labeled data, etc.
- Data Annotation: to provide the annotation interface.
- Feature management: to enable different feature configurations.
- Evaluation: to enable evalution of the system.
NER
The NER component will be implemented in Java and likely use LBJ as a starting point for the learning system. The following subcomponents are included:
- Data Management: to use the Java XML libraries to implement a flexible document management system.
- Feature Extraction: to transform the textual data into
feature vector representation for the machine learning system.
- Machine Learning System: to perform learning and inference.
Schedule
General Schedule
- Week 4: Team role assignment, literature survey, architecture
brainstorming
- Week 5: Architecture specification, skeleton system
- Week 6: A functional system including evaluation
- Week 7: Engineering a better system, definining "stretch" goals
- Week 8: Evaluation, formalizing future work
Group Meetings
We will set aside 3-4 hours a week to meet as a team to discuss issues
that affect all aspects of the project.
- Monday morning - Architecture Meeting (to discuss system wide
changes amongst entire group)
- Monday last hour - Architecture "Vote" (to decide system wide
changes amongst entire group)
- Wednesday last hour - Progress "Reports"
- Friday afternoon - Group-wide Demonstration, Future Work Brainstorming
Individual Roles/Responsibilities
Each team member will have a major / minor responsibility. A major
responsibility means that you are responsible for generating code for
this part of the system. A minor responsibility means that you are
responsible for knowing how this part of the system works and being
able to add and review code. With this system, everybody will have
designated people to discuss ideas with and review code while
maintaining a notion of primary responsibility. Furthermore, if some
areas require more work than planned, there will be some flexibility
in human resource allocation.
- GUI / Evaluation - The primary responsibility of this team member
is to design and code the GUI interface for the machine learning,
feature extraction, and evaluation parts of the GUI. (Jeremy)
- GUI / XML - The primary responsibility of this team member is to
design and code the annotation portion of the GUI. (Jonathan)
- XML / System Integration - The primary responsibility of this team
member is to preprocess the text, write the XML interface, and manage
the data. Also, there is an interest in developing a server mode and
this person would be responsible for developing the communication
interface for this mode. (Thanh)
- Feature Extraction / XML - This team member can be viewed as the
more "pure" feature extraction person. He would be responsible for
moving from text into an appropriate machine learning representation. (Xiaozhong)
- Feature Extraction / Machine Learning - This team member
will be responsible for developing "higher level" features that may
require real-time processing with the machine learning program. Also,
if there are structured learning constraints within the architecture,
this team member would fall under primary responsibilities. (Martin)
- Machine Learning / Evaluation - The primary machine learning
person. A slight warning is if the learning component ends up being
straight-forward (i.e. we just use LBJ with slight changes), this team
member will likely become part for GUI programming. (Vikas)
- System Integration - The primary system integration person. There
is no "minor" responsibility in this case as this team member would be
expected to understand all of the subcomponents in order to oversee as
seamless of an integration as possible. (Sean)
Reading List
- Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the
CoNLL-2003 Shared Task: Language-Independent Named Entity
Recognition. In Proceedings of CoNLL-2003, Edmonton, Canada,
2003, pp. 142-147.
- Tong Zhang, Fred Damerau and David Johnson. Text
Chunking based on a Generalization of Winnow. Journal of
Machine Learning Research 2 (2002), pp. 615-637.
- Vasin Punyakanok and Dan Roth. The Use of Classifiers
in Sequential Inference. In Advances in Neural Information
Processing Systems, 2001, pp. 995-1001.
- Michael Collins. Discriminative Training Methods for
Hidden Markov Models: Theory and Experiments with Perceptron
Algorithms. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP), Philadelphia, July
2002, pp. 1-8.
- James Mayfield, Paul McNamee and Christine Piatko. Named Entity Recognition using Hundreds of Thousands of
Features. In Proceedings of CoNLL-2003, Edmonton, Canada,
2003, pp. 184-187.
- Hai Leong Chieu and Hwee Tou Ng. Named Entity
Recognition with a Maximum Entropy Approach. In Proceedings of
CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.
- The Swing Tutorial
- The Java/XML
Tutorial