17.5 Equality, Ordering, and Hashing
17.5.1 Converting Values to Ordered Values
In Making Sets Grow on Trees, we noted that a single comparison needs to eliminate an entire set of values. With numbers, we were able to accomplish that easily: every bigger or smaller number was excluded by a comparison. But what if the data in the set are not actually numbers? Then we have to convert an arbitrary datum into a datatype that permits such comparison. This is known as hashing.
A hash function consumes an arbitrary value and produces a comparable
representation of it (its hash)—
Let us now consider how one can compute hashes. If the input datatype
is a number, it can serve as its own hash. Comparison simply uses
numeric comparison (e.g., <
). Then, transitivity of <
ensures that if an element \(A\) is less than another element \(B\),
then \(A\) is also less than all the other elements bigger than
\(B\).
Suppose instead the input is a string. We can of course use the principle above for strings: e.g., replacing number inequality with string inequality. Strings have a lexicographic (or “alphabetic”) ordering that permit them to be treated similar to numbers.
But what if we are handed more complex datatypes?
- Consider a list of primes as long as the string. Raise each prime by the corresponding number, and multiply the result. For instance, if the string is represented by the character codes
[6, 4, 5]
(the first character has code6
, the second one4
, and the third5
), we get the hashnum-expt(2, 6) * num-expt(3, 4) * num-expt(5, 5)
or16200000
. - Simply add together all the character codes. For the above example, this would correspond to the has
6 + 4 + 5
or15
.
16200000
can only map to the input above, and none other).
This is also known as the Gödel encoding. This is
computationally expensive.
The second encoding is, of course, not invertible (e.g., simply
permute the characters and, by commutativity, the sum will be the
same), but computationally much cheaper. It is also easy to implement:
fun hash-of(s :: String):
fold({(a :: Number, b :: Number): a + b},
0,
string-to-code-points(s))
end
check:
hash-of("Hello") is 500
hash-of("World!") is 553
hash-of("🏴☠️") is 195692
end
Now let us consider more general datatypes. The principle of hashing
will be similar. If we have a datatype with several variants, we can
order the variants lexicographically, and use a numeric tag to
represent the variants, and recursively encode the datum and the
variant tag. For each field of a record, we need an ordering of the
fields—
4
is less than the hash of
3
! All we need is a function that is
non-trivial: not everything should be equal; and
deterministic: every time we ask for a hash, we should get the same answer.
Exercise
Why do we care about these two properties? Think about what would could go wrong if each one was violated.
17.5.2 Hashing in Practice
In practice, programmers do not want hash functions to do what we have
described above. While Gödel encoding is extremely expensive, even
computing hash-of
takes time linear in the size of a string,
which can get quite expensive if strings are large or we compute
hashes often or both.
Instead, many programming languages do something very pragmatic. They need a value that can be compared for equality and ordering [Equality and Ordering]. Integers, we’ve already seen, already fit this bill very nicely. But how to obtain an integer out of arbitrary values, even datatype instances, quickly?
Simple: They just use the memory address of the datum. Every
value has a memory address, and the language can obtain it in constant
time by looking up the directory. Granted, these values may be
allocated anywhere with respect to each other, but that’s okay—
In practice, however, things are not quite so simple. For instance,
suppose we want two structurally equivalent values to have the same
hash. If they are allocated in different addresses, they will hash
differently. Therefore, many languages that use such a strategy also
allow programmers to write their own hashing functions, often to work
in conjunction with this built-in notion of hashing. These end up
looking not too different from the hashing strategies we described
above. Therefore, some of that complexity is inescapable, especially
if a programmer wants structural rather than reference
equality—
In the rest of this material, we will therefore continue with the simple hash function above, for multiple reasons. First, it is sufficient to illustrate how hashing works. Second, in practice, when built-in hashing does not suffice, we do write (more complex versions of) functions like the above. And finally, because it’s all laid bare, it’s easy for us to experiment with.
17.5.3 Equality and Ordering
What we’ve seen [A Fine Balance: Tree Surgery] for the construction of balanced binary search trees is that we need some way of putting elements in order. In the examples we used numbers because they’re a very friendly datatype: they have several properties that we take for granted. However, not all data have these properties.
The critical property that numbers have is that they are orderable. This follows because they are comparable, and the comparison is ternary: it produces three answers, “less than”, “equal to”, and “greater than”.
However, not all data have this property. What are data that might not have these properties? Actually, there are multiple possible properties here: Is something orderable? Is something even comparable?
|
| Comparable |
| Orderable |
Numbers |
| Yes (but not Roughnums!) |
| Yes |
Booleans |
| Yes |
| Yes |
Data instances |
| Yes |
| Not by default |
Roughnums |
| No |
| Yes |
Functions |
| Not really |
| No |
So…life is complicated.
That means you could potentially misuse a BBST on the wrong kind of data. Ideally, we would want to know if we’re doing this. In Pyret’s type system we chose not to build this in, but in some languages, the type system actually lets you capture these properties.
(\x -> x + 1) < (\x -> x) |
* No instance for (Ord (Integer -> Integer)) |
arising from a use of `<' |