Remember Levenshtein Distance: the number of changes to turn one string into another? Here’s a naͯve implementation in Clojure:
(defn levenshtein [s t] (let [m (count s) n (count t) d (make-array Integer/TYPE (inc m) (inc n))] (dotimes [i m] (aset-int d i 0 i)) (dotimes [j n] (aset-int d 0 j j)) (doseq [j (range 1 (inc n))] (doseq [i (range 1 (inc m))] (if (= (.charAt s (dec i)) (.charAt t (dec j))) (aset-int d i j (aget d (dec i) (dec j))) (aset-int d i j (min (inc (aget d (dec i) j)) (inc (aget d i (dec j))) (inc (aget d (dec i) (dec j)))))))) (aget d m n))) (assert (= 0 (levenshtein "" ""))) (assert (= 3 (levenshtein "foo" "foobar"))) (assert (= 3 (levenstein "kitten" "sitting"))) (assert (= 3 (levenstein "Saturday" "Sunday")))
I ripped that straight off Wikipedia’s pseudocode. Not functional at all, and probably not particularly efficient either.
But I’ve never seen this: the number of changes to turn one string into another, scaled by the lengths of the strings. Call it the “edit quotient,” computed as the Levenshtein Distance divided by the mean of the lengths of the two strings. The edit quotient of two empty strings is zero.
(defn edit-quotient [s t] (let [sum (+ (count s) (count t))] (if (pos? sum) (/ (levenstein s t) (/ sum 2)) 0))) (assert (= 0 (edit-quotient "" ""))) (assert (= 0 (edit-quotient "foof" ""))) (assert (= 1 (edit-quotient "foo" "bar"))) (assert (= 6/7 (edit-quotient "foof" "bar"))) (assert (= 2/3 (edit-quotient "foo" "faa"))) (assert (= 2/7 (edit-quotient "foof" "foo"))) (assert (= 1/2 (edit-quotient "foobar" "faabir")))
This has some interesting properties. The edit quotient is zero if the two strings are completely identical, one if they are completely different. Values in between zero and one give some idea of how different the strings are.
There are lots of normalized edit distance variants like this that date back to at least over a decade and a half. The earliest I know if this: http://www.computer.org/portal/web/csdl/doi/10.1109/34.232078
I tried to get to a more functional version from the pseudo code on wikipedia some days ago:
https://gist.github.com/761018
Probably a lot slower but a nice exercise.
Aria- Thanks. I knew this had to exist somewhere.