Edit Quotient – Digital Digressions by Stuart Sierra

Remember Levenshtein Distance: the number of changes to turn one string into another? Here’s a naÍ¯ve implementation in Clojure:

(defn levenshtein [s t]
  (let [m (count s)
        n (count t)
        d (make-array Integer/TYPE (inc m) (inc n))]
    (dotimes [i m] (aset-int d i 0 i))
    (dotimes [j n] (aset-int d 0 j j))
    (doseq [j (range 1 (inc n))]
      (doseq [i (range 1 (inc m))]
        (if (= (.charAt s (dec i)) (.charAt t (dec j)))
          (aset-int d i j (aget d (dec i) (dec j)))
          (aset-int d i j (min (inc (aget d (dec i) j))
                               (inc (aget d i (dec j)))
                               (inc (aget d (dec i) (dec j))))))))
    (aget d m n)))

(assert (= 0 (levenshtein "" "")))
(assert (= 3 (levenshtein "foo" "foobar")))
(assert (= 3 (levenstein "kitten" "sitting")))
(assert (= 3 (levenstein "Saturday" "Sunday")))

I ripped that straight off Wikipedia’s pseudocode. Not functional at all, and probably not particularly efficient either.

But I’ve never seen this: the number of changes to turn one string into another, scaled by the lengths of the strings. Call it the “edit quotient,” computed as the Levenshtein Distance divided by the mean of the lengths of the two strings. The edit quotient of two empty strings is zero.

(defn edit-quotient [s t]
  (let [sum (+ (count s) (count t))]
    (if (pos? sum)
      (/ (levenstein s t)
         (/ sum 2))
      0)))

(assert (= 0 (edit-quotient "" "")))
(assert (= 0 (edit-quotient "foof" "")))
(assert (= 1 (edit-quotient "foo" "bar")))
(assert (= 6/7 (edit-quotient "foof" "bar")))
(assert (= 2/3 (edit-quotient "foo" "faa")))
(assert (= 2/7 (edit-quotient "foof" "foo")))
(assert (= 1/2 (edit-quotient "foobar" "faabir")))

This has some interesting properties. The edit quotient is zero if the two strings are completely identical, one if they are completely different. Values in between zero and one give some idea of how different the strings are.

3 Replies to “Edit Quotient”

Aria Haghighi says:

January 9, 2011 at 11:27 pm

There are lots of normalized edit distance variants like this that date back to at least over a decade and a half. The earliest I know if this: http://www.computer.org/portal/web/csdl/doi/10.1109/34.232078
Remco van 't Veer says:

January 10, 2011 at 3:31 am

I tried to get to a more functional version from the pseudo code on wikipedia some days ago:

https://gist.github.com/761018

Probably a lot slower but a nice exercise.
Stuart says:

January 10, 2011 at 8:42 am

Aria- Thanks. I knew this had to exist somewhere.

Comments are closed.