8.2

VI Advanced Topics

28 Algorithms That Exploit State

    28.1 Disjoint Sets Redux

      28.1.1 Optimizations

      28.1.2 Analysis

    28.2 Set Membership by Hashing Redux

      28.2.1 Improving Access Time

      28.2.2 Better Hashing

      28.2.3 Bloom Filters

    28.3 Avoiding Recomputation by Remembering Answers

      28.3.1 An Interesting Numeric Sequence

        28.3.1.1 Using State to Remember Past Answers

        28.3.1.2 From a Tree of Computation to a DAG

        28.3.1.3 The Complexity of Numbers

        28.3.1.4 Abstracting Memoization

      28.3.2 Edit-Distance for Spelling Correction

      28.3.3 Nature as a Fat-Fingered Typist

      28.3.4 Dynamic Programming

        28.3.4.1 Catalan Numbers with Dynamic Programming

        28.3.4.2 Levenshtein Distance and Dynamic Programming

      28.3.5 Contrasting Memoization and Dynamic Programming

28.1 Disjoint Sets Redux

Here’s how we can use this to implement union-find afresh. We will try to keep things as similar to the previous version [Checking Component Connectedness] as possible, to enhance comparison.

First, we have to update the definition of an element, making the parent field be mutable:

data Element:
  | elt(val, ref parent :: Option<Element>)
end

To determine whether two elements are in the same set, we will still rely on fynd. However, as we will soon see, fynd no longer needs to be given the entire set of elements. Because the only reason is-in-same-set consumed that set was to pass it on to fynd, we can remove it from here. Nothing else changes:

fun is-in-same-set(e1 :: Element, e2 :: Element) -> Boolean:
  s1 = fynd(e1)
  s2 = fynd(e2)
  identical(s1, s2)
end

Updating is now the crucial difference: we use mutation to change the value of the parent:

fun update-set-with(child :: Element, parent :: Element):
  child!{parent: some(parent)}
end

In parent: some(parent), the first parent is the name of the field, while the second one is the parameter name. In addition, we must use some to satisfy the option type. Naturally, it is not none because the entire point of this mutation is to change the parent to be the other element, irrespective of what was there before.

Given this definition, union also stays largely unchanged, other than the change to the return type. Previously, it needed to return the updated set of elements; now, because the update is performed by mutation, there is no longer any need to return anything:

fun union(e1 :: Element, e2 :: Element):
  s1 = fynd(e1)
  s2 = fynd(e2)
  if identical(s1, s2):
    s1
  else:
    update-set-with(s1, s2)
  end
end

Finally, fynd. Its implementation is now remarkably simple. There is no longer any need to search through the set. Previously, we had to search because after union operations have occurred, the parent reference might have no longer been valid. Now, any such changes are automatically reflected by mutation. Hence:

fun fynd(e :: Element) -> Element:
  cases (Option) e!parent:
    | none => e
    | some(p) => fynd(p)
  end
end

28.1.1 Optimizations

Look again at fynd. In the some case, the element bound to e is not the set name; that is obtained by recursively traversing parent references. As this value returns, however, we don’t do anything to reflect this new knowledge! Instead, the next time we try to find the parent of this element, we’re going to perform this same recursive traversal all over again.

Using mutation helps address this problem. The idea is as simple as can be: compute the value of the parent, and update it.

fun fynd(e :: Element) -> Element:
  cases (Option) e!parent block:
    | none => e
    | some(p) =>
      new-parent = fynd(p)
      e!{parent: some(new-parent)}
      new-parent
  end
end

Note that this update will apply to every element in the recursive chain to find the set name. Therefore, applying fynd to any of those elements the next time around will benefit from this update. This idea is called path compression.

There is one more interesting idea we can apply. This is to maintain a rank of each element, which is roughly the depth of the tree of elements for which that element is their set name. When we union two elements, we then make the one with larger rank the parent of the one with the smaller rank. This has the effect of avoiding growing very tall paths to set name elements, instead tending towards “bushy” trees. This too reduces the number of parents that must be traversed to find the representative.

28.1.2 Analysis

This optimized union-find data structure has a remarkble analysis. In the worst case, of course, we must traverse the entire chain of parents to find the name element, which takes time proportional to the number of elements in the set. However, once we apply the above optimizations, we never need to traverse that same chain again! In particular, if we conduct an amortized analysis over a sequence of set equality tests after a collection of union operations, we find that the cost for subsequent checks is very small—indeed, about as small a function can get without being constant. The actual analysis is quite sophisticated; it is also one of the most remarkable algorithm analyses in all of computer science.

28.2 Set Membership by Hashing Redux

We have already seen solutions to set membership. First we saw how to represent sets as lists [Representing Sets by Lists], then as (balanced) binary trees [A Fine Balance: Tree Surgery].Don’t confuse this with union-find, which is a different kind of problem on sets [Disjoint Sets Redux]. With this we were able to reduce insertion and membership to logarithmic time in the number of elements. Along the way, we also learned that the essence of using these representations was to reduce any datatype to a comparable, ordered element—for efficiency, usually a number [Converting Values to Ordered Values]—which we called hashing.

Let us now ask whether we can use these numbers in any other way. Suppose our set has only five elements, which map densely to the values between 0 and 4. We can then have a five element list of boolean values, where the boolean at each index of the list indicates whether the element corresponding to that position is in the set or not. Both membership and insertion, however, require traversing potentially the entire list, giving us solutions linear in the number of elements.

That’s not all. Unless we can be certain that there will be only five elements, we can’t be sure to bound the size of the representation. Also, we haven’t yet shown how to actually hash in a way that makes the representation dense; barring that, our space consumption gets much worse, in turn affecting time.

There is, actually, a relatively simple solution to the problem of reducing numbers densely to a range: given the hash, we apply modular arithmetic. That is, if we want to use a list of five elements to represent the set, we simply compute the hash’s modulo five. This gives us an easy solution to that problem.

Except, of course, not quite: two different hashes could easily have the same modulus. That is, suppose we need to record that the set contains the (hash) value 5; the resulting list would be

[list: true, false, false, false, false]

Now suppose we want to ask whether the value 15 is in the set; we cannot tell from this representation whether it’s in the set or not, because we can’t tell whether the true represents 5, 15, 25, or any other value whose modulus 5 is 0. Therefore, we have to record the actual elements in the set; for type-consistency, we should be using an Option:

[list: some(5), none, none, none, none]

Now we can tell that 5 is in the set while 4 is not. However, this now makes it impossible to have both 5 and 10 in the set; therefore, our real representation needs to be a list at each position:

[list: [list: 5], empty, empty, empty, empty]

If we also add 10 to the set, we get:

[list: [list: 5, 10], empty, empty, empty, empty]

and now we can tell that both 5 and 10 are in the set, but 15 is not. These sub-lists are known as buckets.

Good; now we have another way of representing sets so we can check for membership. However, in the worst case one of those lists is going to contain all elements in the set, and we may have to traverse the entire list to find an element in it, which means membership testing will take time linear in the number of elements. Insertion, in turn, takes time proportional to the size of the modulus because we may have to traverse the entire outer list to get to the right sub-list.

Can we improve on this?

28.2.1 Improving Access Time

Given that we currently have no way of ensuring we won’t get hash collisions, for now we’re stuck with a list of elements at each position that could be the size of the set we are trying to represent. Therefore, we can’t get around that (yet). But, we’re currently paying time in the size of the outer list just to insert an element, and surely we can do better than that!

We can, but it requires a different data structure: the array.There are other data structures that will also do better, but the one we’re about to see is important and widely used. You can look up arrays in the Pyret documentation. The key characteristics of an array are:
  • Accessing the nth element of an array takes constant, not linear, time in n. This is sometimes known as random-access, because it takes the same time to access any random element, as opposed to just a known element.

  • Arrays are updated by mutation. Thus, a change to an array is seen by all references to the array.

The former property warrants some discussion: how can an array provide random access whereas a list requires time linear in the index of the element we’re accessing? This is because of a trade-off: a list can be extended indefinitely as the program extends, but an array cannot. An array must declare its size up front, and cannot grow without copying all the elements into a larger array. Therefore, we should only use arrays when we have a clearly identifiable upper-bound on their size (and that bound is not too large, or else we may not even be able to find that much contiguous space in the system). But the problem we’re working on has exactly this characteristic.

So let’s try defining sets afresh. We start with an array of a fixed size, with each element an empty list:

SIZE = 19
v = array-of(empty, SIZE)

We need to use modular arithmetic to find the right bucket:

fun find-bucket(n): num-modulo(n, SIZE) end

With this, we can determine whether an element is in the set:

fun get-bucket(n): array-get-now(v, find-bucket(n)) end
fun is-in(n): get-bucket(n).member(n) end

To actually add an element to the set, we put it in the list associated with the appropriate bucket:

fun set-bucket(n, anew): array-set-now(v, find-bucket(n), anew) end
fun put(n):
  when not(is-in(n)):
    set-bucket(n, link(n, get-bucket(n)))
  end
end

Checking whether the element is already in the bucket is an important part of our complexity argument because we have implicitly assumed there won’t be duplicate elements in buckets.

Exercise

What impact do duplicate elements have on the complexity of operations?

The data structure we have defined above is known as a hash table (which is a slightly confusing name, because it isn’t really a table of hashes, but this is the name used conventionally in computer science).

28.2.2 Better Hashing

Using arrays therefore appears to address one issue: insertion. Finding the relevant bucket takes constant time, linking the new element takes constant time, and so the entire operation takes constant time...except, we have to also check whether the element is already in the bucket, to avoid storing duplicates. We have gotten rid of the traversal through the outer list representing the set, but the member operation on the inner list remains unchanged. In principle it won’t, but in practice we can make it much better.

Note that collisions are virtually inevitable. If we have uniformly distributed data, then collisions show up sooner than we might expect.This follows from the reasoning behind what is known as the birthday problem, commonly presented as how many people need to be in a room before the likelihood that two of them share a birthday exceeds some percentage. For the likelihood to exceed half we need just 23 people! Therefore, it is wise to prepare for the possibility of collisions.

The key is to know something about the distribution of hash values. For instance, if we knew our hash values are all multiples of 10, then using a table size of 10 would be a terrible idea (because all elements would hash to the same bucket, turning our hash table into a list). In practice, it is common to use uncommon prime numbers as the table size, since a random value is unlikely to have it as a divisor. This does not yield a theoretical improvement (unless you can make certain assumptions about the input, or work through the math very carefully), but it works well in practice. In particular, since the typical hashing function uses memory addresses for objects on the heap, and on most systems these addresses are multiples of 4, using a prime like 31 is often a fairly good bet.

28.2.3 Bloom Filters

Another way to improve the space and time complexity is to relax the properties we expect of the operations. Right now, set membership gives perfect answers, in that it answers true exactly when the element being checked was previously inserted into the set. But suppose we’re in a setting where we can accept a more relaxed notion of correctness, where membership tests can “lie” slightly in one direction or the other (but not both, because that makes the representation almost useless). Specifically, let’s say that “no means no” (i.e., if the set representation says the element isn’t present, it really isn’t) but “yes sometimes means no” (i.e., if the set representation says an element is present, sometimes it might not be). In short, if the set says the element isn’t in it, this should be guaranteed; but if the set says the element is present, it may not be. In the latter case, we either need some other—more expensive—technique to determine truth, or we might just not care.

Where is such a data structure of use? Suppose we are building a Web site that uses password-based authentication. Because many passwords have been leaked in well-publicized breaches, it is safe to assume that hackers have them and will guess them. As a result, we want to not allow users to select any of these as passwords. We could use a hash-table to reject precisely the known leaked passwords. But for efficiency, we could use this imperfect hash instead. If it says “no”, then we allow the user to use that password. But if it says “yes”, then either they are using a password that has been leaked, or they have an entirely different password that, purely by accident, has the same hash value, but no matter; we can just disallow that password as well.A related use is for filtering out malicious Web sites. The URL shortening system, bitly, uses it for this purpose.

Another example is in updating databases or memory stores. Suppose we have a database of records, which we update frequently. It is often more efficient to maintain a journal of changes: i.e., a list that sequentially records all the changes that have occurred. At some interval (say overnight), the journal is “flushed”, meaning all these changes are applied to the database proper. But that means every read operation has become highly inefficient, because it has to check the entire journal first (for updates) before accessing the database. Again, here we can use this faulty notion of a hash table: if the hash of the record locator says “no”, then the record certainly hasn’t been modified and we go directly to the database; if it says “yes” then we have to check the journal.

We have already seen a simple example implementation of this idea earlier, when we used a single list (or array) of booleans, with modular arithmetic, to represent the set. When the set said 4 was not present, this was absolutely true; but when it said 5 and 10 are both present, only one of these was present. The advantage was a huge saving in space and time: we needed only one bit per bucket, and did not need to search through a list to answer for membership. The downside, of course, was a hugely inaccurate set data structure, and one with correlated failure tied to the modulus.

There is a simple way to improve this solution: instead of having just one array, have several (but a fixed number of them). When an element is added to the set, it is added to each array; when checking for membership, every array is consulted. The set only answers affirmatively to membership if all the arrays do so.

Naturally, using multiple arrays offers absolutely no advantage if the arrays are all the same size: since both insertion and lookup are deterministic, all will yield the same answer. However, there is a simple antidote to this: use different array sizes. In particular, by using array sizes that are relatively prime to one another, we minimize the odds of a clash (only hashes that are the product of all the array sizes will fool the array).

This data structure, called a Bloom Filter, is a probabilistic data structure. Unlike our earlier set data structure, this one is not guaranteed to always give the right answer; but contrary to the space-time tradeoff, we save both space and time by changing the problem slightly to accept incorrect answers. If we know something about the distribution of hash values, and we have some acceptable bound of error, we can design hash table sizes so that with high probability, the Bloom Filter will lie within the acceptable error bounds.

28.3 Avoiding Recomputation by Remembering Answers

We have on several instances already referred to a space-time tradeoff. The most obvious tradeoff is when a computation “remembers” prior results and, instead of recomputing them, looks them up and returns the answers. This is an instance of the tradeoff because it uses space (to remember prior answers) in place of time (recomputing the answer). Let’s see how we can write such computations.

28.3.1 An Interesting Numeric Sequence

Suppose we want to create properly-parenthesized expressions, and ignore all non-parenthetical symbols. How many ways are there of creating parenthesized expressions given a certain number of opening (equivalently, closing) parentheses?

If we have zero opening parentheses, the only expression we can create is the empty expression. If we have one opening parenthesis, the only one we can construct is “()” (there must be a closing parenthesis since we’re interested only in properly-parenthesized expressions). If we have two opening parentheses, we can construct “(())” and “()()”. Given three, we can construct “((()))”, “(())()”, “()(())”, “()()()”, and “(()())”, for a total of five. And so on. Observe that the solutions at each level use all the possible solutions at one level lower, combined in all the possible ways.

There is actually a famous mathematical sequence that corresponds to the number of such expressions, called the Catalan sequence. It has the property of growing quite large very quickly: starting from the modest origins above, the tenth Catalan number (i.e., tenth element of the Catalan sequence) is 16796. A simple recurrence formula gives us the Catalan number, which we can turn into a simple program:

fun catalan(n):
  if n == 0: 1
  else if n > 0:
    for fold(acc from 0, k from range(0, n)):
      acc + (catalan(k) * catalan(n - 1 - k))
    end
  end
end

This function’s tests look as follows—
<catalan-tests> ::=

check:
  catalan(0) is 1
  catalan(1) is 1
  catalan(2) is 2
  catalan(3) is 5
  catalan(4) is 14
  catalan(5) is 42
  catalan(6) is 132
  catalan(7) is 429
  catalan(8) is 1430
  catalan(9) is 4862
  catalan(10) is 16796
  catalan(11) is 58786
end

but beware! When we time the function’s execution, we find that the first few tests run very quickly, but somewhere between a value of 10 and 20depending on your machine and programming language implementation—you ought to see things start to slow down, first a little, then with extreme effect.

Do Now!

Check at what value you start to observe a significant slowdown on your machine. Plot the graph of running time against input size. What does this suggest?

The reason the Catalan computation takes so long is precisely because of what we alluded to earlier: at each level, we depend on computing the Catalan number of all the smaller levels; this computation in turn needs the numbers of all of its smaller levels; and so on down the road.

Exercise

Map the subcomputations of catalan to see why the computation time explodes as it does. What is the worst-case time complexity of this function?

28.3.1.1 Using State to Remember Past Answers

Therefore, this is clearly a case where trading space for time is likely to be of help. How do we do this? We need a notion of memory that records all previous answers and, on subsequent attempts to compute them, checks whether they are already known and, if so, just returns them instead of recomputing them.

Do Now!

What critical assumption is this based on?

Naturally, this assumes that for a given input, the answer will always be the same. As we have seen, functions with state violate this liberally, so typical stateful functions cannot utilize this optimization. Ironically, we will use state to implement this optimization, so we will have a stateful function that always returns the same answer on a given input—and thereby use state in a stateful function to simulate a stateless one. Groovy, dude!

First, then, we need some representation of memory. We can imagine several, but here’s a simple one:

data MemoryCell:
  | mem(in, out)
end

var memory :: List<MemoryCell> = empty

Now how does catalan need to change? We have to first look for whether the value is already in memory; if it is, we return it without any further computation, but if it isn’t, then we compute the result, store it in memory, and then return it:

fun catalan(n :: Number) -> Number:
  answer = find(lam(elt): elt.in == n end, memory)
  cases (Option) answer block:
    | none =>
      result =
        if n == 0: 1
        else if n > 0:
          for fold(acc from 0, k from range(0, n)):
            acc + (catalan(k) * catalan(n - 1 - k))
          end
        end
      memory := link(mem(n, result), memory)
      result
    | some(v) => v.out
  end
end

And that’s it! Now running our previous tests will reveal that the answer computes much quicker, but in addition we can dare to run bigger computations such as catalan(50).

This process, of converting a function into a version that remembers its past answers, is called memoization.

28.3.1.2 From a Tree of Computation to a DAG

What we have subtly done is to convert a tree of computation into a DAG over the same computation, with equivalent calls being reused. Whereas previously each call was generating lots of recursive calls, which induced still more recursive calls, now we are reusing previous recursive calls—i.e., sharing the results computed earlier. This, in effect, points the recursive call to one that had occurred earlier. Thus, the shape of computation converts from a tree to a DAG of calls.

This has an important complexity benefit. Whereas previously we were performing a super-exponential number of calls, now we perform only one call per input and share all previous calls—thereby reducing catalan(n) to take a number of fresh calls proportional to n. Looking up the result of a previous call takes time proportional to the size of memory (because we’ve represented it as a list; better representations would improve on that), but that only contributes another linear multiplicative factor, reducing the overall complexity to quadratic in the size of the input. This is a dramatic reduction in overall complexity. In contrast, other uses of memoization may result in much less dramatic improvements, turning the use of this technique into a true engineering trade-off.

28.3.1.3 The Complexity of Numbers

As we start to run larger computations, however, we may start to notice that our computations are starting to take longer than linear growth. This is because our numbers are growing arbitrarily large—for instance, catalan(100) is 896519947090131496687170070074100632420837521538745909320and computations on numbers can no longer be constant time, contrary to what we said earlier [The Size of the Input]. Indeed, when working on cryptographic problems, the fact that operations on numbers do not take constant time are absolutely critical to fundamental complexity results (and, for instance, the presumed unbreakability of contemporary cryptography).

28.3.1.4 Abstracting Memoization

Now we’ve achieved the desired complexity improvement, but there is still something unsatisfactory about the structure of our revised definition of catalan: the act of memoization is deeply intertwined with the definition of a Catalan number, even though these should be intellectually distinct. Let’s do that next.

In effect, we want to separate our program into two parts. One part defines a general notion of memoization, while the other defines catalan in terms of this general notion.

What does the former mean? We want to encapsulate the idea of “memory” (since we presumably don’t want this stored in a variable that any old part of the program can modify). This should result in a function that takes the input we want to check; if it is found in the memory we return that answer, otherwise we compute the answer, store it, and return it. To compute the answer, we need a function that determines how to do so. Putting together these pieces:

data MemoryCell:
  | mem(in, out)
end

fun memoize-1<T, U>(f :: (T -> U)) -> (T -> U):

  var memory :: List<MemoryCell> = empty

  lam(n):
    answer = find(lam(elt): elt.in == n end, memory)
    cases (Option) answer block:
      | none =>
        result = f(n)
        memory := link(mem(n, result), memory)
        result
      | some(v) => v.out
    end
  end
end

We use the name memoize-1 to indicate that this is a memoizer for single-argument functions. Observe that the code above is virtually identical to what we had before, except where we had the logic of Catalan number computation, we now have the parameter f determining what to do.

With this, we can now define catalan as follows:

rec catalan :: (Number -> Number) =
  memoize-1(
    lam(n):
      if n == 0: 1
      else if n > 0:
        for fold(acc from 0, k from range(0, n)):
          acc + (catalan(k) * catalan(n - 1 - k))
        end
      end
    end)

Note several things about this definition:
  1. We don’t write fun catalan(...): ...; because the procedure bound to catalan is produced by memoize-1.

  2. Note carefully that the recursive calls to catalan have to be to the function bound to the result of memoization, thereby behaving like an object. Failing to refer to this same shared procedure means the recursive calls will not be memoized, thereby losing the benefit of this process.

  3. We need to use rec for reasons we saw earlier [Streams From Functions].

  4. Each invocation of memoize-1 creates a new table of stored results. Therefore the memoization of different functions will each get their own tables rather than sharing tables, which is a bad idea!

Exercise

Why is sharing memoization tables a bad idea? Be concrete.

28.3.2 Edit-Distance for Spelling Correction

Text editors, word processors, mobile phones, and various other devices now routinely implement spelling correction or offer suggestions on (mis-)spellings. How do they do this? Doing so requires two capabilities: computing the distance between words, and finding words that are nearby according to this metric. In this section we will study the first of these questions. (For the purposes of this discussion, we will not dwell on the exact definition of what a “word” is, and just deal with strings instead. A real system would need to focus on this definition in considerable detail.)

Do Now!

Think about how you might define the “distance between two words”. Does it define a metric space?

Exercise

Will the definition we give below define a metric space over the set of words?

Though there may be several legitimate ways to define distances between words, here we care about the distance in the very specific context of spelling mistakes. Given the distance measure, one use might be to compute the distance of a given word from all the words in a dictionary, and offer the closest word (i.e., the one with the least distance) as a proposed correction.Obviously, we can’t compute the distance to every word in a large dictionary on every single entered word. Making this process efficient constitutes the other half of this problem. Briefly, we need to quickly discard most words as unlikely to be close enough, for which a representation such as a bag-of-words (here, a bag of characters) can greatly help. Given such an intended use, we would like at least the following to hold:
  • That the distance from a word to itself be zero.

  • That the distance from a word to any word other than itself be strictly positive. (Otherwise, given a word that is already in the dictionary, the “correction” might be a different dictionary word.)

  • That the distance between two words be symmetric, i.e., it shouldn’t matter in which order we pass arguments.

Exercise

Observe that we have not included the triangle inequality relative to the properties of a metric. Why not? If we don’t need the triangle inequality, does this let us define more interesting distance functions that are not metrics?

Given a pair of words, the assumption is that we meant to type one but actually typed the other. Here, too, there are several possible definitions, but a popular one considers that there are three ways to be fat-fingered:
  1. we left out a character;

  2. we typed a character twice; or,

  3. we typed one character when we meant another.

In particular, we are interested in the fewest edits of these forms that need to be performed to get from one word to the other. For natural reasons, this notion of distance is called the edit distance or, in honor of its creator, the Levenshtein distance.See more on Wikipedia.

There are several variations of this definition possible. For now, we will consider the simplest one, which assumes that each of these errors has equal cost. For certain input devices, we may want to assign different costs to these mistakes; we might also assign different costs depending on what wrong character was typed (two characters adjacent on a keyboard are much more likely to be a legitimate error than two that are far apart). We will return briefly to some of these considerations later [Nature as a Fat-Fingered Typist].

Under this metric, the distance between “kitten” and “sitting” is 3 because we have to replace “k” with “s”, replace “e” with “i”, and insert “g” (or symmetrically, perform the opposite replacements and delete “g”). Here are more examples:
<levenshtein-tests> ::=

check:
  levenshtein(empty, empty) is 0
  levenshtein([list:"x"], [list: "x"]) is 0
  levenshtein([list: "x"], [list: "y"]) is 1
  # one of about 600
  levenshtein(
    [list: "b", "r", "i", "t", "n", "e", "y"],
    [list: "b", "r", "i", "t", "t", "a", "n", "y"])
    is 3
  # http://en.wikipedia.org/wiki/Levenshtein_distance
  levenshtein(
    [list: "k", "i", "t", "t", "e", "n"],
    [list: "s", "i", "t", "t", "i", "n", "g"])
    is 3
  levenshtein(
    [list: "k", "i", "t", "t", "e", "n"],
    [list: "k", "i", "t", "t", "e", "n"])
    is 0
  # http://en.wikipedia.org/wiki/Levenshtein_distance
  levenshtein(
    [list: "S", "u", "n", "d", "a", "y"],
    [list: "S", "a", "t", "u", "r", "d", "a", "y"])
    is 3
  # http://www.merriampark.com/ld.htm
  levenshtein(
    [list: "g", "u", "m", "b", "o"],
    [list: "g", "a", "m", "b", "o", "l"])
    is 2
  # http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html
  levenshtein(
    [list: "a", "c", "g", "t", "a", "c", "g", "t", "a", "c", "g", "t"],
    [list: "a", "c", "a", "t", "a", "c", "t", "t", "g", "t", "a", "c", "t"])
    is 4
  levenshtein(
    [list: "s", "u", "p", "e", "r", "c", "a", "l", "i",
      "f", "r", "a", "g", "i", "l", "i", "s", "t" ],
    [list: "s", "u", "p", "e", "r", "c", "a", "l", "y",
      "f", "r", "a", "g", "i", "l", "e", "s", "t" ])
    is 2
end

The basic algorithm is in fact very simple:
<levenshtein> ::=

rec levenshtein :: (List<String>, List<String> -> Number) =
  <levenshtein-body>

where, because there are two list inputs, there are four cases, of which two are symmetric:
<levenshtein-body> ::=
If both inputs are empty, the answer is simple:
<levenshtein-both-empty> ::=

if is-empty(s) and is-empty(t): 0

When one is empty, then the edit distance corresponds to the length of the other, which needs to inserted (or deleted) in its entirety (so we charge a cost of one per character):
<levenshtein-one-empty> ::=

else if is-empty(s): t.length()
else if is-empty(t): s.length()

If neither is empty, then each has a first character. If they are the same, then there is no edit cost associated with this character (which we reflect by recurring on the rest of the words without adding to the edit cost). If they are not the same, however, we consider each of the possible edits:
<levenshtein-neither-empty> ::=

else:
  if s.first == t.first:
    levenshtein(s.rest, t.rest)
  else:
    min3(
      1 + levenshtein(s.rest, t),
      1 + levenshtein(s, t.rest),
      1 + levenshtein(s.rest, t.rest))
  end
end

In the first case, we assume s has one too many characters, so we compute the cost as if we’re deleting it and finding the lowest cost for the rest of the strings (but charging one for this deletion); in the second case, we symmetrically assume t has one too many; and in the third case, we assume one character got replaced by another, so we charge one but consider the rest of both words (e.g., assume “s” was typed for “k” and continue with “itten” and “itting”). This uses the following helper function:

fun min3(a :: Number, b :: Number, c :: Number):
  num-min(a, num-min(b, c))
end

This algorithm will indeed pass all the tests we have written above, but with a problem: the running time grows exponentially. That is because, each time we find a mismatch, we recur on three subproblems. In principle, therefore, the algorithm takes time proportional to three to the power of the length of the shorter word. In practice, any prefix that matches causes no branching, so it is mismatches that incur branching (thus, confirming that the distance of a word with itself is zero only takes time linear in the size of the word).

Observe, however, that many of these subproblems are the same. For instance, given “kitten” and “sitting”, the mismatch on the initial character will cause the algorithm to compute the distance of “itten” from “itting” but also “itten” from “sitting” and “kitten” from “itting”. Those latter two distance computations will also involve matching “itten” against “itting”. Thus, again, we want the computation tree to turn into a DAG of expressions that are actually evaluated.

The solution, therefore, is naturally to memoize. First, we need a memoizer that works over two arguments rather than one:

data MemoryCell2<T, U, V>:
  | mem(in-1 :: T, in-2 :: U, out :: V)
end

fun memoize-2<T, U, V>(f :: (T, U -> V)) -> (T, U -> V):

  var memory :: List<MemoryCell2<T, U, V>> = empty

  lam(p, q):
    answer = find(
      lam(elt): (elt.in-1 == p) and (elt.in-2 == q) end,
      memory)
    cases (Option) answer block:
      | none =>
        result = f(p, q)
        memory :=
        link(mem(p, q, result), memory)
        result
      | some(v) => v.out
    end
  end
end

Most of the code is unchanged, except that we store two arguments rather than one, and correspondingly look up both.

With this, we can redefine levenshtein to use memoization:
<levenshtein-memo> ::=

rec levenshtein :: (List<String>, List<String> -> Number) =
  memoize-2(
    lam(s, t):
      if is-empty(s) and is-empty(t): 0
      else if is-empty(s): t.length()
      else if is-empty(t): s.length()
      else:
        if s.first == t.first:
          levenshtein(s.rest, t.rest)
        else:
          min3(
            1 + levenshtein(s.rest, t),
            1 + levenshtein(s, t.rest),
            1 + levenshtein(s.rest, t.rest))
        end
      end
    end)

where the argument to memoize-2 is precisely what we saw earlier as <levenshtein-body> (and now you know why we defined levenshtein slightly oddly, not using fun).

The complexity of this algorithm is still non-trivial. First, let’s introduce the term suffix: the suffix of a string is the rest of the string starting from any point in the string. (Thus “kitten”, “itten”, “ten”, “n”, and “” are all suffixes of “kitten”.) Now, observe that in the worst case, starting with every suffix in the first word, we may need to perform a comparison against every suffix in the second word. Fortunately, for each of these suffixes we perform a constant computation relative to the recursion. Therefore, the overall time complexity of computing the distance between strings of length \(m\) and \(n\) is \(O([m, n \rightarrow m \cdot n])\). (We will return to space consumption later [Contrasting Memoization and Dynamic Programming].)

Exercise

Modify the above algorithm to produce an actual (optimal) sequence of edit operations. This is sometimes known as the traceback.

28.3.3 Nature as a Fat-Fingered Typist

We have talked about how to address mistakes made by humans. However, humans are not the only bad typists: nature is one, too!

When studying living matter we obtain sequences of amino acids and other such chemicals that comprise molecules, such as DNA, that hold important and potentially determinative information about the organism. These sequences consist of similar fragments that we wish to identify because they represent relationships in the organism’s behavior or evolution.This section may need to be skipped in some states and countries. Unfortunately, these sequences are never identical: like all low-level programmers, nature slips up and sometimes makes mistakes in copying (called—wait for it—mutations). Therefore, looking for strict equality would rule out too many sequences that are almost certainly equivalent. Instead, we must perform an alignment step to find these equivalent sequences. As you might have guessed, this process is very much a process of computing an edit distance, and using some threshold to determine whether the edit is small enough.To be precise, we are performing local sequence alignment. This algorithm is named, after its creators, Smith-Waterman, and because it is essentially identical, has the same complexity as the Levenshtein algorithm.

The only difference between traditional presentations of Levenshtein and Smith-Waterman is something we alluded to earlier: why is every edit given a distance of one? Instead, in the Smith-Waterman presentation, we assume that we have a function that gives us the gap score, i.e., the value to assign every character’s alignment, i.e., scores for both matches and edits, with scores driven by biological considerations. Of course, as we have already noted, this need is not peculiar to biology; we could just as well use a “gap score” to reflect the likelihood of a substitution based on keyboard characteristics.

28.3.4 Dynamic Programming

We have used memoization as our canonical means of saving the values of past computations to reuse later. There is another popular technique for doing this called dynamic programming. This technique is closely related to memoization; indeed, it can be viewed as the dual method for achieving the same end. First we will see dynamic programming at work, then discuss how it differs from memoization.

Dynamic programming also proceeds by building up a memory of answers, and looking them up instead of recomputing them. As such, it too is a process for turning a computation’s shape from a tree to a DAG of actual calls. The key difference is that instead of starting with the largest computation and recurring to smaller ones, it starts with the smallest computations and builds outward to larger ones.

We will revisit our previous examples in light of this approach.

28.3.4.1 Catalan Numbers with Dynamic Programming

To begin with, we need to define a data structure to hold answers. Following convention, we will use an array.What happens when we run out of space? We can use the doubling technique we studied for Halloween Analysis.

MAX-CAT = 11

answers :: Array<Option<Number>> = array-of(none, MAX-CAT + 1)

Then, the catalan function simply looks up the answer in this array:

fun catalan(n):
  cases (Option) array-get-now(answers, n):
    | none => raise("looking at uninitialized value")
    | some(v) => v
  end
end

But how do we fill the array? We initialize the one known value, and use the formula to compute the rest in incremental order:

fun fill-catalan(upper):
  array-set-now(answers, 0, some(1))
  when upper > 0:
    for map(n from range(1, upper + 1)):
      block:
        cat-at-n =
          for fold(acc from 0, k from range(0, n)):
            acc + (catalan(k) * catalan(n - 1 - k))
          end
        array-set-now(answers, n, some(cat-at-n))
      end
    end
  end
end

fill-catalan(MAX-CAT)

The resulting program obeys the tests in <catalan-tests>.

Notice that we have had to undo the natural recursive definition—which proceeds from bigger values to smaller ones—to instead use a loop that goes from smaller values to larger ones. In principle, the program has the danger that when we apply catalan to some value, that index of answers will have not yet been initialized, resultingin an error. In fact, however, we know that because we fill all smaller indices in answers before computing the next larger one, we will never actually encounter this error. Note that this requires careful reasoning about our program, which we did not need to perform when using memoization because there we made precisely the recursive call we needed, which either looked up the value or computed it afresh.

28.3.4.2 Levenshtein Distance and Dynamic Programming

Now let’s take on rewriting the Levenshtein distance computation:
<levenshtein-dp> ::=

fun levenshtein(s1 :: List<String>, s2 :: List<String>):
  <levenshtein-dp/1>
end

We will use a table representing the edit distance for each prefix of each word. That is, we will have a two-dimensional table with as many rows as the length of s1 and as many columns as the length of s2. At each position, we will record the edit distance for the prefixes of s1 and s2 up to the indices represented by that position in the table.

Note that index arithmetic will be a constant burden: if a word is of length \(n\), we have to record the edit distance to its \(n + 1\) positions, the extra one corresponding to the empty word. This will hold for both words:
<levenshtein-dp/1> ::=

s1-len = s1.length()
s2-len = s2.length()
answers = array2d(s1-len + 1, s2-len + 1, none)
<levenshtein-dp/2>

Observe that by creating answers inside levenshtein, we can determine the exact size it needs to be based on the inputs, rather than having to over-allocate or dynamically grow the array.

We have initialized the table with none, so we will get an error if we accidentally try to use an uninitialized entry.Which proved to be necessary when writing and debugging this code! It will therefore be convenient to create helper functions that let us pretend the table contains only numbers:
<levenshtein-dp/2> ::=

fun put(s1-idx :: Number, s2-idx :: Number, n :: Number):
  answers.set(s1-idx, s2-idx, some(n))
end
fun lookup(s1-idx :: Number, s2-idx :: Number) -> Number:
  a = answers.get(s1-idx, s2-idx)
  cases (Option) a:
    | none => raise("looking at uninitialized value")
    | some(v) => v
  end
end

Now we have to populate the array. First, we initialize the row representing the edit distances when s2 is empty, and the column where s1 is empty. At \((0, 0)\), the edit distance is zero; at every position thereafter, it is the distance of that position from zero, because that many characters must be added to one or deleted from the other word for the two to coincide:
<levenshtein-dp/3> ::=

for each(s1i from range(0, s1-len + 1)):
  put(s1i, 0, s1i)
end
for each(s2i from range(0, s2-len + 1)):
  put(0, s2i, s2i)
end
<levenshtein-dp/4>

Now we finally get to the heart of the computation. We need to iterate over every character in each word. these characters are at indices 0 to s1-len - 1 and s2-len - 1, which are precisely the ranges of values produced by range(0, s1-len) and range(0, s2-len).
<levenshtein-dp/4> ::=

for each(s1i from range(0, s1-len)):
  for each(s2i from range(0, s2-len)):
  <levenshtein-dp/compute-dist>
  end
end
<levenshtein-dp/get-result>

Note that we’re building our way “out” from small cases to large ones, rather than starting with the large input and working our way “down”, recursively, to small ones.

Do Now!

Is this strictly true?

No, it isn’t. We did first fill in values for the “borders” of the table. This is because doing so in the midst of <levenshtein-dp/compute-dist> would be much more annoying. By initializing all the known values, we keep the core computation cleaner. But it does mean the order in which we fill in the table is fairly complex.

Now, let’s return to computing the distance. For each pair of positions, we want the edit distance between the pair of words up to and including those positions. This distance is given by checking whether the characters at the pair of positions are identical. If they are, then the distance is the same as it was for the previous pair of prefixes; otherwise we have to try the three different kinds of edits:
<levenshtein-dp/compute-dist> ::=

dist =
  if get(s1, s1i) == get(s2, s2i):
    lookup(s1i, s2i)
  else:
    min3(
      1 + lookup(s1i, s2i + 1),
      1 + lookup(s1i + 1, s2i),
      1 + lookup(s1i, s2i))
  end
put(s1i + 1, s2i + 1, dist)

As an aside, this sort of “off-by-one” coordinate arithmetic is traditional when using tabular representations, because we write code in terms of elements that are not inherently present, and therefore have to create a padded table to hold values for the boundary conditions. The alternative would be to allow the table to begin its addressing from -1 so that the main computation looks traditional.

At any rate, when this computation is done, the entire table has been filled with values. We still have to read out the answer, with lies at the end of the table:
<levenshtein-dp/get-result> ::=

lookup(s1-len, s2-len)

Even putting aside the helper functions we wrote to satiate our paranoia about using undefined values, we end up with:As of this writing, the current version of the Wikipedia page on the Levenshtein distance features a dynamic programming version that is very similar to the code above. By writing in pseudocode, it avoids address arithmetic issues (observe how the words are indexed starting from 1 instead of 0, which enables the body of the code to look more “normal”), and by initializing all elements to zero it permits subtle bugs because an uninitialized table element is indistinguishable from a legitimate entry with edit distance of zero. The page also shows the recursive solution and alludes to memoization, but does not show it in code.

fun levenshtein(s1 :: List<String>, s2 :: List<String>):
  s1-len = s1.length()
  s2-len = s2.length()
  answers = array2d(s1-len + 1, s2-len + 1, none)

  for each(s1i from range(0, s1-len + 1)):
    put(s1i, 0, s1i)
  end
  for each(s2i from range(0, s2-len + 1)):
    put(0, s2i, s2i)
  end

  for each(s1i from range(0, s1-len)):
    for each(s2i from range(0, s2-len)):
      dist =
        if get(s1, s1i) == get(s2, s2i):
          lookup(s1i, s2i)
        else:
          min3(
            1 + lookup(s1i, s2i + 1),
            1 + lookup(s1i + 1, s2i),
            1 + lookup(s1i, s2i))
        end
      put(s1i + 1, s2i + 1, dist)
    end
  end

  lookup(s1-len, s2-len)
end

which is worth contrasting with the memoized version (<levenshtein-memo>).For more examples of canonical dynamic programming problems, see this page and think about how each can be expressed as a direct recursion.

28.3.5 Contrasting Memoization and Dynamic Programming

Now that we’ve seen two very different techniques for avoiding recomputation, it’s worth contrasting them. The important thing to note is that memoization is a much simpler technique: write the natural recursive definition; determine its space complexity; decide whether this is problematic enough to warrant a space-time trade-off; and if it is, apply memoization. The code remains clean, and subsequent readers and maintainers will be grateful for that. In contrast, dynamic programming requires a reorganization of the algorithm to work bottom-up, which can often make the code harder to follow and full of subtle invariants about boundary conditions and computation order.

That said, the dynamic programming solution can sometimes be more computationally efficient. For instance, in the Levenshtein case, observe that at each table element, we (at most) only ever use the ones that are from the previous row and column. That means we never need to store the entire table; we can retain just the fringe of the table, which reduces space to being proportional to the sum, rather than product, of the length of the words. In a computational biology setting (when using Smith-Waterman), for instance, this saving can be substantial. This optimization is essentially impossible for memoization.

In more detail, here’s the contrast:

Memoization

    

Dynamic Programming

Top-down

    

Bottom-up

Depth-first

    

Breadth-first

Black-box

    

Requires code reorganization

All stored calls are necessary

    

May do unnecessary computation

Cannot easily get rid of unnecessary data

    

Can more easily get rid of unnecessary data

Can never accidentally use an uninitialized answer

    

Can accidentally use an uninitialized answer

Needs to check for the presence of an answer

    

Can be designed to not need to check for the presence of an answer

As this table should make clear, these are essentialy dual approaches. What is perhaps left unstated in most dynamic programming descriptions is that it, too, is predicated on the computation always producing the same answer for a given input—i.e., being a pure function.

From a software design perspective, there are two more considerations.

First, the performance of a memoized solution can trail that of dynamic programming when the memoized solution uses a generic data structure to store the memo table, whereas a dynamic programming solution will invariably use a custom data structure (since the code needs to be rewritten against it anyway). Therefore, before switching to dynamic programming for performance reasons, it makes sense to try to create a custom memoizer for the problem: the same knowledge embodied in the dynamic programming version can often be encoded in this custom memoizer (e.g., using an array instead of list to improve access times). This way, the program can enjoy speed comparable to that of dynamic programming while retaining readability and maintainability.

Second, suppose space is an important consideration and the dynamic programming version can make use of significantly less space. Then it does make sense to employ dynamic programming instead. Does this mean the memoized version is useless?

Do Now!

What do you think? Do we still have use for the memoized version?

Yes, of course we do! It can serve as an oracle [Oracles for Testing] for the dynamic programming version, since the two are supposed to produce identical answers anyway—and the memoized version would be a much more efficient oracle than the purely recursive implemenation, and can therefore be used to test the dynamic programming version on much larger inputs.

In short, always first produce the memoized version. If you need more performance, consider customizing the memoizer’s data structure. If you need to also save space, and can arrive at a more space-efficient dynamic programming solution, then keep both versions around, using the former to test the latter (the person who inherits your code and needs to alter it will thank you!).

Exercise

We have characterized the fundamental difference between memoization and dynamic programming as that between top-down, depth-first and bottom-up, breadth-first computation. This should naturally raise the question, what about:
  • top-down, breadth-first

  • bottom-up, depth-first

orders of computation. Do they also have special names that we just happen to not know? Are they uninteresting? Or do they not get discussed for a reason?