UIUC header
Mias_header2

Virtual Web: What If You Own the Entire Web?

Project Mentors:
Hanna Zhong (hzhong [AT] uiuc [DOT] edu)
Tao Cheng (tcheng3 [AT] cs [DOT] uiuc [DOT] edu)

In many data analysis applications, we often need to gather data from the Web. As the Web evolves as our ultimate information repository, such Web data gathering has quickly become ubiquitous necessary. For any serious information gathering, we must rely on crawling. Starting from some web pages, a crawler extracts the hyperlinks and adds them to a "to-crawl" list. It then recursively visits the list in some order. This crawling process has two major problems. It is costly as it takes up huge computing resources, Internet bandwidth, and Web server loads. It is also ineffective because a crawler must traverse the Web without much prior knowledge of the Web.

To make crawling more affordable and effective, we propose to build a "Virtual Web". The Virtual Web is a system that provides an enhanced view of the Web on which various client applications can query. The Virtual Web will crawl the physical Web regularly to our local cluster and enhance it with augmented graph structures and metadata. We will then enable application clients to crawl on this Virtual Web.

Such "virtual crawling" will be:

  • Efficient: Crawling is done on the local storage; it does not take Internet bandwidth on every repeated access. The cost is amortized by serving multiple clients.
  • Effective: Crawling is done on the "virtual graph," a much more connected web graph with many "virtual" nodes and edges added to the web graph. The virtual graph provides a roadmap of the Web and we can exploit this to develop many interesting and innovative crawling algorithms.
Towards this novel concept, the technical challenges arise in two major categories:

  • Virtual Graph: How can we build and store the virtual graph? How can we maintain and update the graph as the Web is dynamic and changing?
  • Virtual Crawling: How can we support various crawling clients with different schemes? How to concurrently serve a large number of virtual crawlers?

Our objectives in the Virtual Web are two-fold. It is a research platform to investigate the technical challenges. It also provides an efficient and effective web data gathering service for various data analysis applications.

1) System Tasks: S.1 and S.2

Task A.1: Virtual Graph Construction: Discovering and Building Webpage Connections


Motivation: The Virtual Web aims to support various application clients to efficiently and effectively gather web data. It provides a Virtual Graph on which the clients can crawl. The Virtual Graph is a much more connected web graph with many "virtual" nodes and edges added to it. For example, the Virtual Graph contains a virtual node called "airfare site" and connects any airfare websites to this node. When we want to get all the airfare webpages, we can start crawling from this "airfare site" node. This example illustrates that with the help of virtual nodes and edges, two related webpages can be connected, even if there are no physical hyperlinks between them. Hence, what virtual nodes and edges to provide in the Virtual Graph affect how effective we crawl.

Objective: Define a set of virtual nodes and edges for the Virtual Graph and demonstrate its usefulness. This set can be specific for one crawling task or general for any crawling tasks.

Challenges: What virtual nodes and edges to provide so that they are useful for at least one crawling task? For instance, one possible set is entity nodes - see Application task 2 on entities. The virtual nodes can be phone number nodes, email nodes, etc. How to construct them so that the construction is incremental and scalable with respect to the number of webpages in the Virtual Web? How to access them efficiently? How to measure usefulness?

Deliverables: The set of virtual nodes and edges augmented on the Virtual Graph. Documentation that describes how this set is defined and constructed. A crawling task that uses your proposed set of virtual nodes and edges (you should work with the Application teams on developing the crawling task). The webpages obtained from this crawl.

Group Size: 2 people


TimeTables

Timeline:

  • Stage 1 (Week 3): Familiar yourselves with the Virtual Web and understand how the Virtual Web crawlers work.
  • Stage 2 (Week 4): Propose a set of virtual nodes and edges. Come up with a concrete approach for discovering .and constructing the set.
  • Stage 3 (Week 5-6): Construct the virtual nodes and edges, and augment them on the Virtual Graph.
  • Stage 4 (Week 7-8): Build a crawler - you should work closely with the Application teams on building this crawler - that uses your proposed set. Run experiments to verify that your proposed set is useful.

Task A.2: Visual Crawler Programming: Minimizing Crawler Construction


Motivation: In order to crawl the Virtual Web, users need to build a Virtual Web crawler. A Virtual Web Crawler contains a classifier and a sequence of crawling operations. The classifier indicates which pages should be crawled and the crawling operations specify how the crawler should crawl the Virtual Web. One crawling operation is given a webpage, return all its children links. The crawling operations are already provided by the Virtual Graph API, which is written in java.

We would like to minimize the user effort in building a Virtual Web crawler. One approach is to build the crawler visually. For example, each crawling operation can be considered as a visual block, and we can connect the blocks together to specify a sequence of crawling operations. Yahoo Pipes (http://pipes.yahoo.com/pipes/) is one such example. It lets users create data mashups visually. An immediate benefit is that users do not need to know java (or any other programming language) in order to build a crawler.

Objective: Design and develop a visual crawler programming tool, and demonstrate how you can it to assemble a Virtual Web crawler.

Challenges: How to visualize the building process? How can we make crawler building more intuitive and interactive? How to make visual programming equivalent to traditional programming? What are the visual blocks to support? How to build this programming tool so that it is extensible? For example, given a new crawling operation, we need to create a new visual block for it.

Deliverables: Documentation that describes the tool (design, architecture, implementation, usage). A crawler that is assembled by the tool.

Group Size: 2 people




Timeline:

  • Stage 1 (Week 3): Familiar yourselves with the Virtual Web and how the Virtual Web crawlers work.
  • Stage 2 (Week 4-5): Define what visual blocks to support. Come up with a concrete approach to do visual crawler programming.
  • Stage 3 (Week 6-7): Implement your approach.
  • Stage 4 (Week 8): Assemble a Virtual Web crawler using your approach.

2) Application Tasks-- A.1 and A.2

Task A.1: Focused Crawl using the Virtual Web:

Obtaining Webpages About a Certain Domain


Motivation: Often times, applications are in need for obtaining data relevant to the application domain. Consider the following two motivating application scenarios:

  • Scenario 1: Assuming we want to build a meta-website about all the cell phone information on the Web. The very first task to conquer is to collect cell phone related webpages. How can we utilize the Virtual Web to build the basis of such an application?
  • Scenario 2: Community information management is a recent research direction. One such example is the DBLife project at wisconsin (http://dblife.cs.wisc.edu/). One big challenge of such an application is how to obtain relevant webpages about the community. Currently obtaining relevant webpages is largely based on a set of human-identified domains, such as dblp, cs department pages, conference pages. Can we develop a more principled way to gather information about a certain community, say about the computer science community, using the Virtual Web?

Objective: Demonstrate that you can collect a comprehensive/high quality set of webpages for at least two interesting domains ( such as the ecommerce domain about cell phones and the academic domain about computer science community mentioned in the two scenarios above). To make this task complete, you should also build a simple application upon your dataset.

Challenges: How can we leverage the interface provided by the Virtual Web in getting data for a specific domain, such that we can avoid looking at all the webpages? How can we judge the quality of the data obtained? Although precision is important, recall is even more critical for such tasks. Is our solution general such that it can be easily adapted for obtaining webpages of other domains as well?

Deliverables: Documentation that describes and explains the strategy used in obtaining the data. The data (webpages) obtained. A simple application built upon the dataset.

Group Size: 2 people

TimeTables

Timeline:

  • Stage 1 (Week 3): Familiar yourselves with the Virtual Web interfaces, etc. Complete a simple crawl task: Find all the webpages that mention "university of illinois" as a phrase and have at least two backlinks (ie. there is at least two other webpages pointing to such pages).
  • Stage 2 (Week 4-5): Select two application domains to focus on. Come up with a concrete approach for getting data for the selected domains. Justify your approach analytically.
  • Stage 3 (Week 6-7): Implement your approach and run experiments to verify your approach empirically. Possible desgin changes could still be made in this stage.
  • Stage 4 (Week 8): Build a search application upon one of the datasets. For instance, a full text search application using Lucene as the underlying platform.

Task A.2: Supporting Entity Search: Obtaining Webpages about Certain Entities using the Virtual Web


Motivation : In our WISDM project at UIUC, we are studying novel search problems. Our entity search views the Web as a collection as entities (e.g. phones, emails, etc) rather than a collection of documents. Therefore, what entity search engine returns is no longer a list of documents, but rather a list of entities. Such a search scenario has to get webpages about the entity types it supports to begin with. For instance, we may want to build a yellowbook application using entity search, where we can search for phone numbers and email addressses. What we need is all the webpages that contain phone number or email address. It will be great if we can use the Virtual Web to help us quickly locate such webpages.

Objective : Demonstrate that you can get all the webpages containing certain entity types, whose pattern could be described using regular expressions.

Challenges : How can we support efficient regular expression search via the interface provided by the Virtual Web? How can we record the information about entities extracted by the regular expressions? Can we easily support entity search upon the Virtual Web platform?

Deliverables : Documentation that describes and explains the strategy used in obtaining the data. The data (webpages) obtained for a given set of regular expression (e.g. email pattern, phone pattern, etc). A simple search application, which uses the entities, upon the dataset.

Group Size: 2 people




Timeline :

  • Stage 1 (Week 3): Familiar yourselves with the Virtual Web interfaces, etc. Complete a simple crawl task: Find all the webpages that mention "university of illinois" as a phrase and have at least two backlinks (ie. there is at least two other webpages pointing to such pages).
  • Stage 2 (Week 4-5): Literature study. Come up with a concrete approach for getting webpages that match certain regular expressions. Justify your approach analytically.
  • Stage 3 (Week 6-7): Implement your approach and run experiments to verify your approach empirically. Possible design changes could still be made in this stage.
  • Stage 4 (Week 8): Build a simple, entity-aware, search application upon the dataset. Candidate search platforms could be Lumur, Lucene, etc.