Motivation:
In recent years, image collections have increased both in numbers and in size, and represent a huge amount of important information. Both privately and publicly owned images are available over the Internet for browsing and searching, and the number of users and application areas are increasing. To avoid that numerous image collections become hidden treasures, efficient techniques for image retrieval are crucial.
Up to today, image search provided by major search engines such as Google, MSN, and Yahoo! relies on textual descriptions of images found on the Web pages containing the images and the file names of the images. These search engines do not analyze the pixel content of images and hence cannot be used to search unannotated image collections. Automatic annotation of images by words is a crucial technology to ensure the "visibility" of images on the Internet, due to the complex and fragmented nature of the networked communities. Automatic image annotation has become an imperative but highly challenging task. The capability of annotating images by computers can lead to breakthroughs in a wide range of applications including Web image search, online picture-sharing communities, and scientific experiments.
Background:
Automatic image annotation (also known as automatic image tagging) is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.
This method can be regarded as a type of multi-class image classification with a very large number of classes - as large as the vocabulary size. Typically, image analysis in the form of extracted feature vectors and the training annotation words are used by machine learning techniques to attempt to automatically apply annotations to new images. The first methods learned the correlations between image features and training annotations, then techniques were developed using machine translation to try and translate the textual vocabulary with the 'visual vocabulary', or clustered regions known as blobs. Work following these efforts have included classification approaches, relevance models and so on.
Some annotation engines are online, including the ALIPR.com real-time tagging engine developed by Penn State researchers and Behold - an image search engine that indexes over 1 million images using automatically generated tags. There are still several big challenges. First, the concepts annotated in training data are incomplete (some entities are not labeled in the annotation) and inconsistent (sometimes sky is labeled and sometimes it is not). Second, there is no agreement as to which visual features are useful. Third, how best to model p(c|dv), the probability of a concept c given the document visuals dv, is not known.
Proposed Project
However, all this work uses datasets where there are a small body of words known to annotate the image according to some conventions. All these methods would be much more useful if they could be applied to free form objects, where it is not known which text -- if any -- refers to the image.
Our research hypothesis is: for many illustrated digital pages, the language near a picture will have different forms, depending on whether or not it refers to what is depicted in the image. This hypothesis, if true, will lead to methods of considerable impact, because it will allow statistical natural language methods, possibly joint image understanding methods, to interpret pictures at least roughly. An attractive feature of the hypothesis is that the changes in form are likely to be independent of subject, at least to first order. The languages cues that suggest that a picture nearby does, indeed, depict a horse are unlikely to be different from those that suggest it depicts a monkey.
The summer project consists of the following tasks:
Task 1: Data Collection
The students will collect a large set of illustrated digital pages (e.g., over one million pages), covering a range of physical objects (use simple clustering methods to identify topics of pages). They will achieve this by using Google's image search and from Flickr.com, peekaboom etc.
In order to crawl the Webpages, the students need to build a Web crawler. The crawling operations will be provided by the API, which is written in java. The students will design and develop a crawler and use it to collect amount of useful webpages: images with surrounding text descriptions.
Task2: Visual Feature Description
The students will investigate a image feature pipeline including several image features: color, texture, shape,local (e.g., SIFT, Salient Points) and globalfeatures (e.g., color histogram), and with spatial information (e.g., color correlogram). They can also refer to MPEG-7 for standard color, texture and shape feature descriptions. The students can learn to build simple image feature extraction and evaluation methods.
Task3: Joint Probability Model of Picture and Words
We will use a small fixed set of physical object topics in the first instance. We will investigate different ways of constructing joint probability model of picture and words e.g., from more objective description (e.g., tiger in the image) to more subjective description (e.g., autumn in Urbana-Champaign).
Timeline:
Stage 1 (Week 3-4): Build good tools to crawl a large amount of tagged images and webpages: images and descriptions.
Stage 2 (Week 5-6): Implement image feature extraction codes. Build image features on clusters of computers. Analyze and choose useful and good features.
Stage 3 (Week 7-8): Combine tagging models.
Group Size: 8 students