In many data analysis applications, we often need to gather data from the Web. As the Web evolves as our ultimate information repository, such Web data gathering has quickly become ubiquitous necessary. For any serious information gathering, we must rely on crawling. Starting from some web pages, a crawler extracts the hyperlinks and adds them to a "to-crawl" list. It then recursively visits the list in some order. This crawling process has two major problems. It is costly as it takes up huge computing resources, Internet bandwidth, and Web server loads. It is also ineffective because a crawler must traverse the Web without much prior knowledge of the Web.
To make crawling more affordable and effective, we propose to build a "Virtual Web". The Virtual Web is a system that provides an enhanced view of the Web on which various client applications can query. The Virtual Web will crawl the physical Web regularly to our local cluster and enhance it with augmented graph structures and metadata. We will then enable application clients to crawl on this Virtual Web.
Our objectives in the Virtual Web are two-fold. It is a research platform to investigate the technical challenges. It also provides an efficient and effective web data gathering service for various data analysis applications.
Motivation: The Virtual Web aims to support various application clients to efficiently and effectively gather web data. It provides a Virtual Graph on which the clients can crawl. The Virtual Graph is a much more connected web graph with many "virtual" nodes and edges added to it. For example, the Virtual Graph contains a virtual node called "airfare site" and connects any airfare websites to this node. When we want to get all the airfare webpages, we can start crawling from this "airfare site" node. This example illustrates that with the help of virtual nodes and edges, two related webpages can be connected, even if there are no physical hyperlinks between them. Hence, what virtual nodes and edges to provide in the Virtual Graph affect how effective we crawl.
Objective: Define a set of virtual nodes and edges for the Virtual Graph and demonstrate its usefulness. This set can be specific for one crawling task or general for any crawling tasks.
Challenges: What virtual nodes and edges to provide so that they are useful for at least one crawling task? For instance, one possible set is entity nodes - see Application task 2 on entities. The virtual nodes can be phone number nodes, email nodes, etc. How to construct them so that the construction is incremental and scalable with respect to the number of webpages in the Virtual Web? How to access them efficiently? How to measure usefulness?
Deliverables: The set of virtual nodes and edges augmented on the Virtual Graph. Documentation that describes how this set is defined and constructed. A crawling task that uses your proposed set of virtual nodes and edges (you should work with the Application teams on developing the crawling task). The webpages obtained from this crawl.
Group Size: 2 people
Timeline:
Motivation: In order to crawl the Virtual Web, users need to build a Virtual Web crawler. A Virtual Web Crawler contains a classifier and a sequence of crawling operations. The classifier indicates which pages should be crawled and the crawling operations specify how the crawler should crawl the Virtual Web. One crawling operation is given a webpage, return all its children links. The crawling operations are already provided by the Virtual Graph API, which is written in java.
We would like to minimize the user effort in building a Virtual Web crawler. One approach is to build the crawler visually. For example, each crawling operation can be considered as a visual block, and we can connect the blocks together to specify a sequence of crawling operations. Yahoo Pipes (http://pipes.yahoo.com/pipes/) is one such example. It lets users create data mashups visually. An immediate benefit is that users do not need to know java (or any other programming language) in order to build a crawler.
Objective: Design and develop a visual crawler programming tool, and demonstrate how you can it to assemble a Virtual Web crawler.
Challenges: How to visualize the building process? How can we make crawler building more intuitive and interactive? How to make visual programming equivalent to traditional programming? What are the visual blocks to support? How to build this programming tool so that it is extensible? For example, given a new crawling operation, we need to create a new visual block for it.
Deliverables: Documentation that describes the tool (design, architecture, implementation, usage). A crawler that is assembled by the tool.
Group Size: 2 people
Timeline:
Motivation: Often times, applications are in need for obtaining data relevant to the application domain. Consider the following two motivating application scenarios:
Objective: Demonstrate that you can collect a comprehensive/high quality set of webpages for at least two interesting domains ( such as the ecommerce domain about cell phones and the academic domain about computer science community mentioned in the two scenarios above). To make this task complete, you should also build a simple application upon your dataset.
Challenges: How can we leverage the interface provided by the Virtual Web in getting data for a specific domain, such that we can avoid looking at all the webpages? How can we judge the quality of the data obtained? Although precision is important, recall is even more critical for such tasks. Is our solution general such that it can be easily adapted for obtaining webpages of other domains as well?
Deliverables: Documentation that describes and explains the strategy used in obtaining the data. The data (webpages) obtained. A simple application built upon the dataset.
Group Size: 2 people
Timeline:
Motivation : In our WISDM project at UIUC, we are studying novel search problems. Our entity search views the Web as a collection as entities (e.g. phones, emails, etc) rather than a collection of documents. Therefore, what entity search engine returns is no longer a list of documents, but rather a list of entities. Such a search scenario has to get webpages about the entity types it supports to begin with. For instance, we may want to build a yellowbook application using entity search, where we can search for phone numbers and email addressses. What we need is all the webpages that contain phone number or email address. It will be great if we can use the Virtual Web to help us quickly locate such webpages.
Objective : Demonstrate that you can get all the webpages containing certain entity types, whose pattern could be described using regular expressions.
Challenges : How can we support efficient regular expression search via the interface provided by the Virtual Web? How can we record the information about entities extracted by the regular expressions? Can we easily support entity search upon the Virtual Web platform?
Deliverables : Documentation that describes and explains the strategy used in obtaining the data. The data (webpages) obtained for a given set of regular expression (e.g. email pattern, phone pattern, etc). A simple search application, which uses the entities, upon the dataset.
Group Size: 2 people
Timeline :