UIUC header
Mias_header2

Named Entity Recognition

Student Mentors:
Kevin Small (ksmall at uiuc dot edu)
Jing Jiang (jiang4 at uiuc dot edu)

DSSI Students:
Vikas Bahirwani
Sean Clarke
Jeremy Grozavescu
Xiaozhong Liu
Thanh-Nhan Nguyen
Martin Paradesi
Jonathan Ray

Introduction

Named entity recognition (NER) (also known as entity identification and entity extraction) is the task of locating and classifying small chunks of text that are names of certain predefined types of objects such as people, locations, and organizations. NER is a subtask of information extraction (IE), and has been extensively studied in the NLP community. It is an important problem as other applications would benefit from an accurate NER system (e.g. relation extraction, Web search, question answering). However, despite the importance of NER and the significant existing research devoted to this task, a widely-used NER tool that achieves state-of-the-art performance and provides a flexible user interface does not exist to the best of our knowledge. This project aims to build an NER tool based on classifier techniques while supporting a powerful user interface to facilitate effective algorithm design, experimentation, evaluation, and annotation. This tool will allow a user to rapidly adapt the basic system to various domains and achieve world class results.

Architecture

The architecture is comprised of two fundamental components, a graphical user interface (GUI), and a named entity recognition (NER) system based on machine learning techniques.

GUI

The GUI will be implemented using the Java Swing libraries. There are several subcomponents in the GUI component:

  • Learning Architecture: to handle entity tagging using pre-trained models, learning from new labeled data, etc.
  • Data Annotation: to provide the annotation interface.
  • Feature management: to enable different feature configurations.
  • Evaluation: to enable evalution of the system.

NER

The NER component will be implemented in Java and likely use LBJ as a starting point for the learning system. The following subcomponents are included:

  • Data Management: to use the Java XML libraries to implement a flexible document management system.
  • Feature Extraction: to transform the textual data into feature vector representation for the machine learning system.
  • Machine Learning System: to perform learning and inference.

Schedule

General Schedule

  • Week 4: Team role assignment, literature survey, architecture brainstorming
  • Week 5: Architecture specification, skeleton system
  • Week 6: A functional system including evaluation
  • Week 7: Engineering a better system, definining "stretch" goals
  • Week 8: Evaluation, formalizing future work

Group Meetings

We will set aside 3-4 hours a week to meet as a team to discuss issues that affect all aspects of the project.

  • Monday morning - Architecture Meeting (to discuss system wide changes amongst entire group)
  • Monday last hour - Architecture "Vote" (to decide system wide changes amongst entire group)
  • Wednesday last hour - Progress "Reports"
  • Friday afternoon - Group-wide Demonstration, Future Work Brainstorming

Individual Roles/Responsibilities

Each team member will have a major / minor responsibility. A major responsibility means that you are responsible for generating code for this part of the system. A minor responsibility means that you are responsible for knowing how this part of the system works and being able to add and review code. With this system, everybody will have designated people to discuss ideas with and review code while maintaining a notion of primary responsibility. Furthermore, if some areas require more work than planned, there will be some flexibility in human resource allocation.

  • GUI / Evaluation - The primary responsibility of this team member is to design and code the GUI interface for the machine learning, feature extraction, and evaluation parts of the GUI. (Jeremy)
  • GUI / XML - The primary responsibility of this team member is to design and code the annotation portion of the GUI. (Jonathan)
  • XML / System Integration - The primary responsibility of this team member is to preprocess the text, write the XML interface, and manage the data. Also, there is an interest in developing a server mode and this person would be responsible for developing the communication interface for this mode. (Thanh)
  • Feature Extraction / XML - This team member can be viewed as the more "pure" feature extraction person. He would be responsible for moving from text into an appropriate machine learning representation. (Xiaozhong)
  • Feature Extraction / Machine Learning - This team member will be responsible for developing "higher level" features that may require real-time processing with the machine learning program. Also, if there are structured learning constraints within the architecture, this team member would fall under primary responsibilities. (Martin)
  • Machine Learning / Evaluation - The primary machine learning person. A slight warning is if the learning component ends up being straight-forward (i.e. we just use LBJ with slight changes), this team member will likely become part for GUI programming. (Vikas)
  • System Integration - The primary system integration person. There is no "minor" responsibility in this case as this team member would be expected to understand all of the subcomponents in order to oversee as seamless of an integration as possible. (Sean)

Reading List