Woozle Wuzzle
String Edit Distance

One commonly used approach for measuring similarity between strings is the string edit distance. The basic premise of the string edit distance is to find the minimum number of character edit operations (copy, replace, add, and remove) needed to transform one string into the other. For example, the edit distance between parks and spark is two: one edit is needed to add a leading s to parks and a second edit is needed to delete the trailing s. Not suprisingly, string edit distance is used in spell-checking applications.

It is convenient to look at string edit distance as a string alignment problem. The example below should provide some insight into this:

 String 1:     p a r k s 
 String 2:   s p a r k   
 Operation:  i c c c c d 

The operation is the character operation needed to align or transform string 1 into string 2. i is insert, c is copy, r is replace, and d is delete. (You may find other references using s (substitute) instead of r (replace).) The space at the beginning of string 1 or at the end of string 2 is called a gap.

For a given set of strings, there may be many ways to align the strings. For example, the strings abc123 and abcqwertabc123 have two obvious alignments:

 a b c                 1 2 3
 a b c q w e r t a b c 1 2 3

and

                 a b c 1 2 3
 a b c q w e r t a b c 1 2 3

The particular application would determine which alignment is preferred.

Different distance metrics are used to select the desired alignment. For example, the measurement used in the original example of the cost between parks and spark is the Levenshtein distance. In this metric the cost of replacing, adding or deleting a character is one whereas the cost of equal (or copied) characters is zero. (In the conversion of parks to spark, the cost of copying letters p a r k is zero.) The second alignment above (where abc123 has no gap) can be obtained using the Smith-Waterman distance.

For simple strings it is relatively easy to understand how to transform one string into another but with longer strings or very dissimilar strings it is not so clear. For example:

 See Spot run
 We work and play

There are brute force techniques that consider all possible alignments and find the one(s) with the minimum cost. I refer you to a few good references on string edit distance and the associated dynamic programming algorithms for more information.

Comments
Comment by Amzad Chowdhury at April 4, 2008 01:35 PM

Not much details on the topic. If u have details information about string edit distance and related topics,can u please give reference where I can find the information.I am very interested to know about the topic,like how it work..are there any algorithm..and so on.So if u can get in touch with me as soon as possible I will be very happy.Thank you.

 

Comment by rgrzywinski at April 5, 2008 06:59 AM

Hello Amzad! I apologize for the lack of information. I had planned on posting much about the topic, but as it always seems to happen, life has put too many obstacles in the road and I have to focus all of my time elsewhere.

To start your research, you can follow all of the links I provide in the articles. They provide all of the base information that you need to know. A good book on the topic is "Web Data Mining" by Bing Liu.

If you have any specific questions, please don't hesitate to email me. I will try to answer to the best of my ability.

 

Comment by amzad chowdhury at April 25, 2008 01:14 PM

how do i relate string edit distance and directed acyclic graph to find the shortest path?pls reply asap..

 

Comment by rgrzywinski at April 26, 2008 07:39 AM

Sounds like someone's got a school homework assignment to do :) When you work through the answer please post it here for others to learn from.

Best of luck to you!

 

Post a comment













Remember personal info?






Creative Commons License Unless otherwise expressly stated, all original material of whatever nature created by Rob Grzywinski and included in this weblog and any related pages, including the weblog's archives, is licensed under a Creative Commons License.