Wagner–Fischer algorithm

1. in Haskell
- 1.1. distance
- 1.2. difference

A dynamic programming algorithm that computes string distance; or, more generally, an edit script that transforms one string into another.

1. in Haskell

1.1. distance

{-# LANGUAGE MultiWayIf #-}

import Control.Arrow
import Control.Monad.State
import Data.Array
import Data.Foldable
import Data.List
import Data.List.NonEmpty qualified as NonEmpty
import Data.Maybe
import Data.Ord
import Data.Traversable
import Data.Tuple
import Test.QuickCheck

dist :: Eq a => [a] -> [a] -> Int
dist s t = ds!(m, n) where
  s' = listArray (1, m) s
  m = length s
  t' = listArray (1, n) t
  n = length t

  ds = listArray bounds [
    if | j == 0 -> i
       | i == 0 -> j
       | s'!i == t'!j -> ds!(i-1, j-1)
       | otherwise -> minimum [ ds!(i-1, j)
                              , ds!(i, j-1)
                              , ds!(i-1, j-1)
                              ] + 1
      | (i, j) <- range bounds
    ]
  bounds = ((0, 0), (m, n))

This is a really nice example of the sort of thing that you can only get in a lazy language, or at least a language with some facility for laziness.

Let's write some tests to convince ourselves that this does what we think it does. First, string distance is a metric, so it ought to satisfy the metric space axioms.

prop_dist_refl xs =
  dist xs xs == 0

prop_dist_neq_positive xs ys =
  xs /= ys ==> dist xs ys > 0

prop_dist_sym xs ys =
  dist xs ys == dist ys xs

prop_dist_triangle xs ys zs =
  dist xs zs <= dist xs ys + dist ys zs

What else?

Let's consider a simpler kind of string comparison. Given a character \( c \), we'll say that the \( c \)-count distance between two strings \( \sigma, \tau \) is the (unsigned) difference between the number of occurrences of \( c \) in \( \sigma \) and the number of occurrences of \( c \) in \( \tau \). Then we'll say that the count distance between two strings \( \sigma, \tau \) is the sum of the \( c \)-count difference between \( \sigma \) and \( \tau \) for all characters \( c \).

We can pretty easily compute the count distance between strings over an ordered set of characters.

countDist :: Ord a => [a] -> [a] -> Int
countDist xs ys = go (multisetOf xs) (multisetOf ys) where
  go [] ys = sum (snd <$> ys)
  go xs [] = sum (snd <$> xs)
  go xs@((x, m) : xs') ys@((y, n) : ys') =
    case compare x y of
      LT -> m + go xs' ys
      EQ -> abs (n - m) + go xs' ys'
      GT -> n + go xs ys'
  multisetOf xs = [(NonEmpty.head g, length g) | g <- NonEmpty.group $ sort $ xs]

We could also relax the Ord constraint to an Eq constraint, but the resulting algorithm would be more complex and less efficient.

Incidentally, count distance is also a metric.

prop_countDist_refl xs =
  countDist xs xs == 0

prop_countDist_neq_positive xs ys =
  xs /= ys ==> countDist xs ys > 0

prop_countDist_sym xs ys =
  countDist xs ys == countDist ys xs

prop_countDist_triangle xs ys zs =
  countDist xs zs <= countDist xs ys + countDist ys zs

Now, how does count distance relate to edit distance?

Although we have proposed count distance as a metric on strings, it is really a metric on multisets, which can be viewed as equivalence classes of strings. In other words, count distance is invariant on operations that preserve multiset equivalence. The fundamental operations that do not preserve multiset equivalence are:

adding a character;
removing a character.

These operations are the basis for the appropriate notion of difference for count distance. The appropriate notion of difference for edit distance is an edit script, which allows these operations, but also allows the operation of replacing a character. Replacement is like a combination of a deletion and an insertion, but “more efficient”; i.e., So we can view count distance as an inefficient kind of edit distance. Therefore, the count difference between two strings should be an upper bound on their edit distance.

prop_length_bound xs ys =
  dist xs ys <= countDist xs ys

1.2. difference

The difference between two strings is an edit script, which we model as a list of edits.

type Script a = [Edit a]

We will say what an edit is momentarily. For now, it is enough to think of an edit as an action that transforms a

appEdit :: Edit a -> [a] -> (Maybe a, [a])

To apply a script to a string, we simply apply all of its edits in sequence and collect the outputs into a list.

appScript :: Script a -> [a] -> [a]
appScript s xs = catMaybes $ flip evalState xs $ for s $ state . appEdit

Now, what is an edit? We define it to be one of the following:

do nothing;
delete an element;
replace an element with a specified new character;
insert a specified new character.

data Edit a = Pass | Delete | Replace a | Insert a

It might seem odd that “do nothing” is considered an edit, but it simplifies the implementation of appScript. Without a “do nothing” edit, each edit would have to carry around the position at which it was meant to be applied, and appScript would have to figure out how to apply an edit at the correct position.

The implementation of appEdit is straightforward.

appEdit e xs = case e of
  Pass -> (Just (head xs), tail xs)
  Delete -> (Nothing, tail xs)
  Replace x -> (Just x, tail xs)
  Insert x -> (Just x, xs)

Edit scripts come equipped with a notion of “size”. The size of a script is just the sum of the sizes of its edits.

editSize :: Edit a -> Int

scriptSize :: Script a -> Int
scriptSize = sum . map editSize

And the size of an edit is 0 for Pass and 1 for everything else.

editSize Pass = 0
editSize _ = 1

diff :: Eq a => [a] -> [a] -> (Int, Script a)
diff s t = reverse <$> ds!(m, n) where
  s' = listArray (1, m) s
  m = length s
  t' = listArray (1, n) t
  n = length t

  ds = listArray ((0, 0), (m, n)) [
    if | i == 0 && j == 0 -> (0, [])
       | j == 0 -> (const i *** (Delete :)) (ds!(i-1, 0))
       | i == 0 -> (const j *** (Insert (t'!j) :)) (ds!(0, j-1))
       | s'!i == t'!j -> (Pass :) <$> ds!(i-1, j-1)
       | otherwise -> first (+1) $ minimumBy (comparing fst) [
           (Delete :) <$> ds!(i-1, j),
           (Insert (t'!j) :) <$> ds!(i, j-1),
           (Replace (t'!j) :) <$> ds!(i-1, j-1)
         ]
      | (i, j) <- range ((0, 0), (m, n))
    ]

Wagner–Fischer algorithm

Table of Contents

1. in Haskell

1.1. distance

1.2. difference