What we’ve seen [A Fine Balance: Tree Surgery] for the construction of balanced binary search trees is that we need some way of putting elements in order. In the examples we used numbers because they’re a very friendly datatype: they have several properties that we take for granted. However, not all data have these properties.
The critical property that numbers have is that they are orderable. This follows because they are comparable, and the comparison is ternary: it produces three answers, “less than”, “equal to”, and “greater than”.
However, not all data have this property. What are data that might not have these properties? Actually, there are multiple possible properties here: Is something orderable? Is something even comparable?
Yes (but not Roughnums!)
Not by default
So…life is complicated.
That means you could potentially misuse a BBST on the wrong kind of data. Ideally, we would want to know if we’re doing this. In Pyret’s type system we chose not to build this in, but in some languages, the type system actually lets you capture these properties.
(\x -> x + 1) < (\x -> x)
* No instance for (Ord (Integer -> Integer))
arising from a use of `<'
It’s easy to lose track of where we are. This all began because we wanted to talk about ordering, which was because we wanted to arrange things in balanced binary search trees, which we wanted because we wanted to quickly find elements or show that they weren’t there, which we wanted because we wanted to create…sets.
So now we’re finding that some datatypes lend themselves naturally to comparison and ordering and others do not. In particular, a BBST requires not only Eq but also Ord, which (in Haskell) is the stronger requirement, i.e., even fewer datatypes will satisfy it. But we may want to make sets with weaker data.
Which of these two do we really need? Either, both, neither?
Actually, we don’t really require Ord; it’s nice, for performance, but not essential. But what about Eq? Yeah, that’s pretty essential! The whole definition of a set is a collection with no duplicates, but without Eq, we can’t even talk about “duplicates”.
Above, there are two datatypes that satisfy Eq but not Ord: Booleans and datatype instances. Booleans are not that interesting: there are only four sets of Booleans one can make. But instances of data definitions are very common! And we definitely want to make sets of those. So it would be nice to have a way to create sets of it without first having to create a comparison function.
non-trivial: not everything should be equal; and
consistent: every time we ask for a given pair of values, we should get the same answer.
Why do we care about these two properties? Think about what would could go wrong if each one was violated.
We can achieve this through hashing.
Hashing in practice cannot do things like Gödel encodings. That would involve enormous, expensive computations! It would take time proportional to the size of the datum or worse! It would take up a lot of space and a lot of time! It would be a disaster.
What programming languages instead do is something very pragmatic. They need to convert a value into a Eq-able, Ord-able value. Integers, we’ve already seen, already fit this bill very nicely. But they want to convert arbitrary values, even datatype instances, into integers, and (unlike with Gödel encodings) quickly obtain compact representations. What do they do? They just use the memory location of the datum.
That’s it! Every value resides somewhere in memory; they just use that location as the hash. Problem solved. They already know where it is. Values may be allocated anywhere with respect to each other, but that’s okay, we only want consistency, not “meaningfulness”.
There is one catch to this scheme. Memory in most modern languages is handled by something called a garbage collector. This can move values around. That means the address keeps changing…which is terrible. So there are clever schemes for making sure this does not cause a problem for hashing. There’s a simple scheme that is slightly space inefficient, which is to simply record the object’s address when it is first allocated, and copy this around as the hash value. A slightly more sophisticated scheme is to not do this until the object is copied, and only do it then.
There is also one semantic downside to this simple strategy: it means we can get a hash value for any type. That includes types that should not be Eq, like functions. So you have to be very careful with such an implementation!