Woozle Wuzzle
URL Similarity

In order to make template extraction as palatable as possible I am going to start by walking through how one discovers the templates used to generate a set of URLs. I'm going to use Amazon.com as an example.

Search for "book" on Amazon.com. You should see a page with a number of results / data records. Placing your mouse over each of the titles shows many similar urls, some of which I have pasted below:

http://www.amazon.com/Harry-Potter-Half-Blood-Prince-Book/dp/0439785960/ref=pd_bbs_1?ie=UTF8&s=books&qid=1196364228&sr=8-1
http://www.amazon.com/Little-Green-Book-Getting-Your/dp/0131576070/ref=pd_bbs_2?ie=UTF8&s=books&qid=1196364228&sr=8-2
http://www.amazon.com/Learning-Curve-LC97727-Lamaze-Caterpillar/dp/B00009IMD8/ref=pd_bbs_3?ie=UTF8&s=baby-products&qid=1196364228&sr=8-3
http://www.amazon.com/Book-General-Ignorance-John-Mitchinson/dp/0307394913/ref=pd_bbs_sr_4?ie=UTF8&s=books&qid=1196364228&sr=8-4
http://www.amazon.com/The-Daring-Book-for-Girls/dp/B000UZJQNM/ref=sr_1_13?ie=UTF8&s=books&qid=1196364228&sr=8-13
http://www.amazon.com/Inconvenient-Book-Solutions-Biggest-Problems/dp/B000WJVLLG/ref=sr_1_16?ie=UTF8&s=books&qid=1196364228&sr=8-16

We can easily see a pattern (the template) in those urls:

http://www.amazon.com/<title>/dp/<book id>/ref=<ref id>?ie=UTF8&s=<product type>&qid=1196364228&sr=8-<index>

Our job is to find an algorithm that will allow us to automatically discover that template.

(I should mention that if one continues to other pages in the search results that there are more differences in the URLs that leads to a different pattern. In order to keep the discussion simple, I will only use the first page's URLs. This demonstrates that a single page, even one that contains multiple records, might not contain enough information to deduce a complete template.

I also should mention that the ability to label the fields in the template as I did in the example above will not be covered (in the near future) in this series of postings. I will only state that there are techniques available in the literature on web wrappers for labeling fields.)

We are going to take two approaches to extract templates from URLs. The first will use traditional string edit distance algorithms and the second will exploit the fact that there is underlying structure in the url (i.e. the scheme, authority, path, query, and fragment).

Comments
Comment by JOE at April 16, 2008 12:00 AM

Hi!
I'm a university student in China and I got here via Google. Right now I'm working on a task which involves URL similarity calculation and template generating. The information you provided in the article is quite related to what I need but I need more detailed algorithm or approach. I'm wondering if you can provide me more information about this. Either publishing links/articles here or emailing me would be fine. And thanks a lot!

 

Comment by rgrzywinski at April 16, 2008 06:32 AM

Hello Joe! Thanks for visiting and commenting.

Unfortunately life is putting up its usual set of roadblocks and I haven't been able to spend much time on this lately.

I would recommend Bing Liu's book "Web Data Mining" for more information.

Please do come back and share any knowledge that you have. I'm facinated with this type of work!

 

Post a comment













Remember personal info?






Creative Commons License Unless otherwise expressly stated, all original material of whatever nature created by Rob Grzywinski and included in this weblog and any related pages, including the weblog's archives, is licensed under a Creative Commons License.