Edit distance and edit steps

Recently I was working on a little app for myself to help me keep track of some information on my computer. The app is powered by an NSMetadataQuery (essentially a Spotlight search) that reports back everything it finds.

One of the interesting things about NSMetadataQuery is that after it has done its initial “gathering” of the results, all further updates are reported as a single array of results. Did something get added to the results? Here’s a new array of the state of all results now. Did something get removed? Here’s another array.

This behavior is nice if you’re just throwing the information into a tableView and then calling reloadData(), but it would be really nice if you could compare the two arrays of results (the “before” and “after” arrays) and then perform the necessary insert, remove, move, and reload calls. Not only would this be more efficient (since you don’t have to re-build every cell in the table), but you also get to have control over the animations.

Some naïve approaches would be start going through the two arrays and find out what has been inserted, what has been deleted, and so on. And while this approach will work, it will also result in a whole bunch of extraneous calls. An item may be in both arrays, but you might think it has moved because it’s at a different position in the new array, so you generate a “move” call. However, with this approach it’s really hard to detect that it has only moved because something was inserted before it, which means the move call is ultimately redundant.

We can do better.

Edit Distances

There’s a pretty well-known algorithm out there for determining how similar (or dissimilar) two strings are, called the Levenshtein algorithm. In a nutshell, it tells you how many steps you need to take in order to transform one word into another word. So if you have the word “sword”, it takes two steps to turn it into “words”: one step to delete the “s” from the front, and another step to append the “s” to the end.

Implementing the Levenshtein algorithm is fairly straight-forward for anyone who can readily translate pseudocode into your language of choice. The algorithm is on the Wikipedia page.

Looking at the implementation of the Levenshtein algorithm reveals a way that we can potentially use this: A “string” (from the algorithm’s point-of-view) is nothing more than an indexable collection of characters, and the only processing that happens on the characters themselves is an equality check.

In other words, the Levenshtein algorithm can easily be generalized to work on any CollectionType of Equatables. Suddenly, it looks like just the thing we need to implement more-efficient array diffing.

One downside of the Levenshtein algorithm is that it only returns an Int. It only tells us how many steps we would need, but not what the actual steps are. For that, we’ll turn to a specific implementation of the Levenshtein algorithm, called the Wagner-Fischer algorithm.

Fundamentally, W-F is exactly the same as Levenshtein. The only difference is that instead of being recursive, it uses a doubly-nested for loop, and memoizes previous computations into an n by m matrix. Levenshtein simply re-computes the information it needs. Put in CS terms, we sacrifice the storage efficiency of L (which has no storage/caching) for the computational efficiency of W-F (cache information → fewer computations). I prefer the W-F approach. Memory is pretty cheap, and unless we’re dealing with absurdly large collections, this probably won’t be an issue (but it’s something to be aware of).

However, again we’re stuck with the obvious limitation that we’re still only computing the edit distance (the number of edit steps), and not the steps themselves. However, there’s a key realization here: the number returned by W-F is the number of steps required to transform one collection into another. If we had the actual steps themselves, then steps.count would necessarily be equal to what W-F would typically return. So if we could modify W-F to compute the list of steps instead of the count of steps, then the final array’s .count would still be equal to the result of the original algorithm!

In other words, we want to have a matrix that holds arrays of steps, instead of just integers, and we’re guaranteed to get the same result.

Picking Apart the Algorithm

At first glance, it can be a bit difficult to conceptualize how W-F and L actually work. To understand it, let’s take it apart.

First, W-F and L work on the premise of three basic character transformations:

  1. deletion
  2. insertion
  3. substitution

Each of these operations can be easily understood:

  • transforming “A” into “” (an empty string) requires 1 deletion
  • transforming “AB” into “” requires 2 deletions
  • transforming a string of n characters into the empty string requires n deletions

We can visualize these rules by laying them out in tables:

String Steps to empty string
“” 0
A 1
B 2
C 3
D 4
  • transforming an empty string into “a” requires 1 insertion
  • transforming an empty string into “ab” requires 2 insertions
  • transforming an empty string into a string of n characters requires n insertions

This can also be visualized in a table:

String “” a b c d
Steps from empty string 0 1 2 3 4
  • transforming “A” into “a” requires 1 substitution
  • transforming “AB” into “ab” requires 2 substitutions
  • etc.

This is where things get interesting. We visualize this by combining the previous two tables into a matrix:

“” a b c d

and filling in what we know so far:

“” a b c d
“” 0 1 2 3 4
A 1
B 2
C 3
D 4

Here, let’s stop a minute and examine the matrix to look for some patterns. Running vertically down the left is the word we are starting with. Running horizontally across the top is the word we’re going to. Moving vertically from A to "" is a deletion (because we have to delete characters to get back to the empty string). So we can infer that whenever we move vertically across the matrix, we are deleting characters. Similarly, moving horizontally is an insertion. The only other direction to move is diagonally down from the top left, which is a substitution.

Armed with this, we can now deduce an algorithm. When figuring out what goes in position (i, j) in the matrix, we need to check three different values: the one immediately to the left, the one immediately above, and the one diagonally to the upper left. (We’re filling the matrix from the top-left corner towards the bottom-right corner, so those are the only positions that will be filled)

If the old value and the new value are the same, then we aren’t going to be changing anything, and we want to take the value from the diagonal position ((i-1, j-1)) and use that for (i, j).

If the values are different, then we want to minimize the changes to get to this position. So we’ll find the smallest value from the three possible values; that represents the “shortest path” to get to the previous position. Once we’ve found that, we add “1” to that value to arrive at the current position.

Then rinse and repeat until the matrix is filled. The value in the final bottom-right position is the minimal edit distance.

Edit Steps

That’s how the algorithm works, but like I said, we don’t want the edit distance. We want the edit steps.

Let’s go back to that matrix:

“” a b c d
“” 0 1 2 3 4
A 1
B 2
C 3
D 4

Do you remember what those numbers mean? This will help:

“” a b c d
“” 0 1 insertion 2 insertions 3 insertions 4 insertions
A 1 deletion
B 2 deletions
C 3 deletions
D 4 deletions

Remember: moving horizontally is an insertion. Moving vertically is a deletion. Moving diagonally is a substitution. So what goes in that (A, a) position? First, look at the three squares around it and find the one with the smallest number of edits. It’s the diagonal one, right? So since we’re going to move diagonally, we’re going to append a substitution:

“” a b c d
“” 0 1 insertion 2 insertions 3 insertions 4 insertions
A 1 deletion 1 substitution
B 2 deletions
C 3 deletions
D 4 deletions

We should be able to fill out the rest of the table now (d = deletion, i = insertion, s = substitution):

“” a b c d
“” 0 1i 2i 3i 4i
A 1d 1s 1i, 1s 2i, 1s 3i, 1s
B 2d 1d, 1s 2s 1i, 2s 2i, 2s
C 3d 2d, 1s 1d, 2s 3s 1i, 3s
D 4d 3d, 1s 2d, 2s 1d, 3s 4s

In the final position, we see that the optimal edit steps are 4 substitutions. And if we were also keeping track of what the substitutions actually are, then we could very nicely print out a list of steps like this:

ABCD -> abcd:
    at index 0 replace with a
    at index 1 replace with b
    at index 2 replace with c
    at index 3 replace with d

So, that’s the basics of the algorithm. But we’re not done yet. When it comes to reloading tableViews, we also care about moves.

Fortunately, this is trivial to add to this algorithm. When the algorithm finishes, we know that we’ve already got the fewest number of steps needed to get from one string to the other. So all we need to do is look through the array of steps to see if we ever delete and insert the same value. If we do, remove both steps and replace it with a single move.

That allows us to take something like:

abcdefgh -> agbcdefh:
    insert g at index 1
    delete g at index 6

And turn it into:

abcdefgh -> agbcdefh:
    move g from index 6 to 1

We’re reducing our already-minimal steps, so we know that this is just as efficient as the earlier result.

So, that’s the Wagner-Fischer algorithm, modified to report actual edit steps and not just the edit distance. And like I said earlier, the only comparison you’re doing on these two collections is checking for equality, so it is a very simple process to genericize this to work on any two CollectionTypes of Equatables.

The Next Step

Write the code! I’ve laid out everything you need, and the Wikipedia articles also have very helpful pseudocode.

The Next Next Step

Here’s a fun follow-up to try on your own:

Sometimes, the values you’re modeling have properties that can change over time, but that do not fundamentally affect the equality of the object. A naïve example would be a Tweet model value having an Int property for to represent the number of retweets, and a second Int to represent the number of favorites likes.1 The Int values can change, but the tweet is fundamentally the same tweet.

How would you modify the algorithm to still return the same number of insertions, deletions, and substitutions (and maybe moves), but also tell you which values just need to be reloaded?

1 Yes, I know you wouldn’t actually want to write an app this way, but it’s an easy example to get the point across.