On this page:
20.1 Ordering and Equality
20.2 Remembering Where We Came From
20.3 Hashing in Practice

20 Orderability and Hashing

    20.1 Ordering and Equality

    20.2 Remembering Where We Came From

    20.3 Hashing in Practice

20.1 Ordering and Equality

What we’ve seen [A Fine Balance: Tree Surgery] for the construction of balanced binary search trees is that we need some way of putting elements in order. In the examples we used numbers because they’re a very friendly datatype: they have several properties that we take for granted. However, not all data have these properties.

The critical property that numbers have is that they are orderable. This follows because they are comparable, and the comparison is ternary: it produces three answers, “less than”, “equal to”, and “greater than”.

However, not all data have this property. What are data that might not have these properties? Actually, there are multiple possible properties here: Is something orderable? Is something even comparable?

  

Comparable

  

Orderable

Numbers

  

Yes (but not Roughnums!)

  

Yes

Booleans

  

Yes

  

Yes

Data instances

  

Yes

  

Not by default

Roughnums

  

No

  

Yes

Functions

  

Not really

  

No

So…life is complicated.

That means you could potentially misuse a BBST on the wrong kind of data. Ideally, we would want to know if we’re doing this. In Pyret’s type system we chose not to build this in, but in some languages, the type system actually lets you capture these properties.

In Haskell, for instance, there’s a mechanism called the type-class; in Java, there are interfaces. They aren’t really the same, but they’re useful to conflate for our purposes. Only things that meet a particular interface or type class provide certain operations. For instance, in Haskell, if you want to use == or /= (not equal), you have to be in the Eq type-class. Thus the comparable datatypes above would be part of Eq. Similarly, there’s a type-class Ord, which ensures the availability of (and requires the implementation of) operations like <, >, <=, and >=. In Haskell, everything that is Ord must also be Eq, i.e., Eq is weaker than Ord (things can be Eq without being Ord). Pyret’s Roughnums contradict that…but Haskell is chill with it. But if you try to compare two functions,

(\x -> x + 1) < (\x -> x)

you get an error like

* No instance for (Ord (Integer -> Integer))

    arising from a use of `<'

20.2 Remembering Where We Came From

It’s easy to lose track of where we are. This all began because we wanted to talk about ordering, which was because we wanted to arrange things in balanced binary search trees, which we wanted because we wanted to quickly find elements or show that they weren’t there, which we wanted because we wanted to create…sets.

So now we’re finding that some datatypes lend themselves naturally to comparison and ordering and others do not. In particular, a BBST requires not only Eq but also Ord, which (in Haskell) is the stronger requirement, i.e., even fewer datatypes will satisfy it. But we may want to make sets with weaker data.

Do Now!

Which of these two do we really need? Either, both, neither?

Actually, we don’t really require Ord; it’s nice, for performance, but not essential. But what about Eq? Yeah, that’s pretty essential! The whole definition of a set is a collection with no duplicates, but without Eq, we can’t even talk about “duplicates”.

Above, there are two datatypes that satisfy Eq but not Ord: Booleans and datatype instances. Booleans are not that interesting: there are only four sets of Booleans one can make. But instances of data definitions are very common! And we definitely want to make sets of those. So it would be nice to have a way to create sets of it without first having to create a comparison function.

The critical thing to remember is that we don’t actually need a meaningful ordering operation.Observe that Gödel encodings are not “meaningful”, either. We don’t actually care if it says that 4 < 3! All we need is a comparison operation that is
  • non-trivial: not everything should be equal; and

  • consistent: every time we ask for a given pair of values, we should get the same answer.

Exercise

Why do we care about these two properties? Think about what would could go wrong if each one was violated.

We can achieve this through hashing.

20.3 Hashing in Practice

Hashing in practice cannot do things like Gödel encodings. That would involve enormous, expensive computations! It would take time proportional to the size of the datum or worse! It would take up a lot of space and a lot of time! It would be a disaster.

What programming languages instead do is something very pragmatic. They need to convert a value into a Eq-able, Ord-able value. Integers, we’ve already seen, already fit this bill very nicely. But they want to convert arbitrary values, even datatype instances, into integers, and (unlike with Gödel encodings) quickly obtain compact representations. What do they do? They just use the memory location of the datum.

That’s it! Every value resides somewhere in memory; they just use that location as the hash. Problem solved. They already know where it is. Values may be allocated anywhere with respect to each other, but that’s okay, we only want consistency, not “meaningfulness”.

There is one catch to this scheme. Memory in most modern languages is handled by something called a garbage collector. This can move values around. That means the address keeps changing…which is terrible. So there are clever schemes for making sure this does not cause a problem for hashing. There’s a simple scheme that is slightly space inefficient, which is to simply record the object’s address when it is first allocated, and copy this around as the hash value. A slightly more sophisticated scheme is to not do this until the object is copied, and only do it then.

There is also one semantic downside to this simple strategy: it means we can get a hash value for any type. That includes types that should not be Eq, like functions. So you have to be very careful with such an implementation!