|
|
|
||||||
|
In order to make template extraction as palatable as possible I am going to start by walking through how one discovers the templates used to generate a set of URLs. I'm going to use Amazon.com as an example. Search for "book" on Amazon.com. You should see a page with a number of results / data records. Placing your mouse over each of the titles shows many similar urls, some of which I have pasted below:
We can easily see a pattern (the template) in those urls: http://www.amazon.com/<title>/dp/<book id>/ref=<ref id>?ie=UTF8&s=<product type>&qid=1196364228&sr=8-<index> Our job is to find an algorithm that will allow us to automatically discover that template. (I should mention that if one continues to other pages in the search results that there are more differences in the URLs that leads to a different pattern. In order to keep the discussion simple, I will only use the first page's URLs. This demonstrates that a single page, even one that contains multiple records, might not contain enough information to deduce a complete template. I also should mention that the ability to label the fields in the template as I did in the example above will not be covered (in the near future) in this series of postings. I will only state that there are techniques available in the literature on web wrappers for labeling fields.) We are going to take two approaches to extract templates from URLs. The first will use traditional string edit distance algorithms and the second will exploit the fact that there is underlying structure in the url (i.e. the scheme, authority, path, query, and fragment). |
|
||
|
The goal of template extraction is to discover the template(s) (if there are any) used to generate a page. There are two major categories of templates: those that are used within a single web page (such as the individual search results entries on a Google search result page) and those that are used across web pages (such as a story template used on CNN.com). In the former case, the goal is to be able to extract the template(s) from any page that has at least two similar data records on it. In the latter case, the goal is to be able to extract the template(s) from any two pages that have similar data records. As more similar records or pages are found we expect our precision to increase. By visually looking at a rendered Google search results page, for example, a human can easily see similar, repeated sections. Although it is a bit more difficult, someone familiar with HTML can look at the source to that same Google page and find the repeated sections. Similarly, there are two general approaches to automatic template discovery -- one that relies on the rendered representation of the page and one that relies on the source of the page. A survey of the research literature does not show a distinct advantage of one technique over the other and in many cases the underlying algorithms are very similar. I am going to focus on the latter approach since that is where my experience lies. |
|
||
|
With the recent movement towards mashups, the semantic web and market intelligence, there is a large need to get at the data and information that is stored in web pages. Data extraction startups are popping up like weeds (e.g. InfoSquire and QL2). Many of these startups focus on services where you specify what sites you want scraped and they provide you with the resulting data feed. The technologies that they use are primarily rules-based (e.g. regex). Rules-based systems are highly brittle given the dynamic nature of the web. Specifically, there is a high maintenance cost to maintaining and monitoring the rules to ensure that they are up to date with any changes made to the underlying web pages. The ability to automatically generate data extractors with high precision would be a vast improvement over a rules-based system. Much research has gone into extracting data (either structured or unstructured) from a web page using web wrappers. A web wrapper is a tool for "converting information implicitly stored as an HTML document into information explicitly stored as a data-structure for further processing" [W4F]. One particular type of automatically generated web wrapper uses template extraction. Template extraction is the inverse of creating a web page from a template -- for a given web page, attempt to deduce the template that was used to generate the page. If a template can be generated for any (template derived) web page, then the data that populates that template can be easily extracted. Over the next few weeks I am going to focus on machine learing and other automatic web wrapper technologies in a series of postings. |
|
||
| I'm in the process of updating the blog template. Things will be a little goofy for a day or so. |
|
||
|
"Testing by itself does not improve software quality. Test results are an indicator of quality, but in and of themselves, they don't improve it. Trying to improve Software quality by increasing the amount of testing is like trying to lose weight by weighing yourself more often. What you eat before you step onto the scale determines how much you will weigh, and the software development techniques you use determine how many errors testing will find. If you want to lose weight, don't buy a new scale; change your diet. If you want to improve your software, don't test more; develop better."
Steve McConnell, Code Complete
|
|
|
Unless otherwise expressly stated, all original material of whatever nature created by Rob Grzywinski and included in this weblog and any related pages, including the weblog's archives, is licensed under a Creative Commons License. |