Wagner–Fischer algorithm
Table of Contents
A dynamic programming algorithm that computes string distance; or, more generally, an edit script that transforms one string into another.
1. in Haskell
1.1. distance
{-# LANGUAGE MultiWayIf #-} import Control.Arrow import Control.Monad.State import Data.Array import Data.Foldable import Data.List import Data.List.NonEmpty qualified as NonEmpty import Data.Maybe import Data.Ord import Data.Traversable import Data.Tuple import Test.QuickCheck
dist :: Eq a => [a] -> [a] -> Int dist s t = ds!(m, n) where s' = listArray (1, m) s m = length s t' = listArray (1, n) t n = length t ds = listArray bounds [ if | j == 0 -> i | i == 0 -> j | s'!i == t'!j -> ds!(i-1, j-1) | otherwise -> minimum [ ds!(i-1, j) , ds!(i, j-1) , ds!(i-1, j-1) ] + 1 | (i, j) <- range bounds ] bounds = ((0, 0), (m, n))
This is a really nice example of the sort of thing that you can only get in a lazy language, or at least a language with some facility for laziness.
Let's write some tests to convince ourselves that this does what we think it does. First, string distance is a metric, so it ought to satisfy the metric space axioms.
prop_dist_refl xs = dist xs xs == 0 prop_dist_neq_positive xs ys = xs /= ys ==> dist xs ys > 0 prop_dist_sym xs ys = dist xs ys == dist ys xs prop_dist_triangle xs ys zs = dist xs zs <= dist xs ys + dist ys zs
What else?
Let's consider a simpler kind of string comparison. Given a character \( c \), we'll say that the \( c \)-count distance between two strings \( \sigma, \tau \) is the (unsigned) difference between the number of occurrences of \( c \) in \( \sigma \) and the number of occurrences of \( c \) in \( \tau \). Then we'll say that the count distance between two strings \( \sigma, \tau \) is the sum of the \( c \)-count difference between \( \sigma \) and \( \tau \) for all characters \( c \).
We can pretty easily compute the count distance between strings over an ordered set of characters.
countDist :: Ord a => [a] -> [a] -> Int countDist xs ys = go (multisetOf xs) (multisetOf ys) where go [] ys = sum (snd <$> ys) go xs [] = sum (snd <$> xs) go xs@((x, m) : xs') ys@((y, n) : ys') = case compare x y of LT -> m + go xs' ys EQ -> abs (n - m) + go xs' ys' GT -> n + go xs ys' multisetOf xs = [(NonEmpty.head g, length g) | g <- NonEmpty.group $ sort $ xs]
We could also relax the Ord
constraint to an Eq
constraint, but the resulting algorithm would be more complex and less efficient.
Incidentally, count distance is also a metric.
prop_countDist_refl xs = countDist xs xs == 0 prop_countDist_neq_positive xs ys = xs /= ys ==> countDist xs ys > 0 prop_countDist_sym xs ys = countDist xs ys == countDist ys xs prop_countDist_triangle xs ys zs = countDist xs zs <= countDist xs ys + countDist ys zs
Now, how does count distance relate to edit distance?
Although we have proposed count distance as a metric on strings, it is really a metric on multisets, which can be viewed as equivalence classes of strings. In other words, count distance is invariant on operations that preserve multiset equivalence. The fundamental operations that do not preserve multiset equivalence are:
- adding a character;
- removing a character.
These operations are the basis for the appropriate notion of difference for count distance. The appropriate notion of difference for edit distance is an edit script, which allows these operations, but also allows the operation of replacing a character. Replacement is like a combination of a deletion and an insertion, but “more efficient”; i.e., So we can view count distance as an inefficient kind of edit distance. Therefore, the count difference between two strings should be an upper bound on their edit distance.
prop_length_bound xs ys = dist xs ys <= countDist xs ys
1.2. difference
The difference between two strings is an edit script, which we model as a list of edits.
type Script a = [Edit a]
We will say what an edit is momentarily. For now, it is enough to think of an edit as an action that transforms a
appEdit :: Edit a -> [a] -> (Maybe a, [a])
To apply a script to a string, we simply apply all of its edits in sequence and collect the outputs into a list.
appScript :: Script a -> [a] -> [a] appScript s xs = catMaybes $ flip evalState xs $ for s $ state . appEdit
Now, what is an edit? We define it to be one of the following:
- do nothing;
- delete an element;
- replace an element with a specified new character;
- insert a specified new character.
data Edit a = Pass | Delete | Replace a | Insert a
It might seem odd that “do nothing” is considered an edit, but it simplifies the implementation of appScript
. Without a “do nothing” edit, each edit would have to carry around the position at which it was meant to be applied, and appScript
would have to figure out how to apply an edit at the correct position.
The implementation of appEdit
is straightforward.
appEdit e xs = case e of Pass -> (Just (head xs), tail xs) Delete -> (Nothing, tail xs) Replace x -> (Just x, tail xs) Insert x -> (Just x, xs)
Edit scripts come equipped with a notion of “size”. The size of a script is just the sum of the sizes of its edits.
editSize :: Edit a -> Int scriptSize :: Script a -> Int scriptSize = sum . map editSize
And the size of an edit is 0
for Pass
and 1
for everything else.
editSize Pass = 0 editSize _ = 1
diff :: Eq a => [a] -> [a] -> (Int, Script a) diff s t = reverse <$> ds!(m, n) where s' = listArray (1, m) s m = length s t' = listArray (1, n) t n = length t ds = listArray ((0, 0), (m, n)) [ if | i == 0 && j == 0 -> (0, []) | j == 0 -> (const i *** (Delete :)) (ds!(i-1, 0)) | i == 0 -> (const j *** (Insert (t'!j) :)) (ds!(0, j-1)) | s'!i == t'!j -> (Pass :) <$> ds!(i-1, j-1) | otherwise -> first (+1) $ minimumBy (comparing fst) [ (Delete :) <$> ds!(i-1, j), (Insert (t'!j) :) <$> ds!(i, j-1), (Replace (t'!j) :) <$> ds!(i-1, j-1) ] | (i, j) <- range ((0, 0), (m, n)) ]