17.2 Making Sets Grow on Trees
In Representing Sets as Lists we saw multiple list representations of
sets. They all came with at least some operations having linear time
complexity—
Let’s start by noting that it seems better, if at all possible, to avoid storing duplicates. Duplicates are only problematic during insertion due to the need for a membership test. But if we can make membership testing cheap, then we would be better off using it to check for duplicates and storing only one instance of each value (which also saves us space). Thus, let’s try to improve the time complexity of membership testing (and, hopefully, of other operations too).
It seems clear that with a (duplicate-free) list representation of a
set, we cannot really beat linear time for membership checking. This
is because at each step, we can eliminate only one element from
contention which in the worst case requires a linear amount of work to
examine the whole set. Instead, we need to eliminate many more
elements with each comparison—
In our handy set of recurrences [Solving Recurrences], one stands out: \(T(k) = T(k/2) + c\). It says that if, with a constant amount of work we can eliminate half the input, we can perform membership checking in logarithmic time. This will be our goal.
Before we proceed, it’s worth putting logarithmic growth in
perspective. Asymptotically, logarithmic is obviously not as nice as
constant. However, logarithmic growth is very pleasant because it
grows so slowly. For instance, if an input doubles from size \(k\) to
\(2k\), its logarithm—
We have actually just made an extremely subtle assumption. In the list representation of sets, when we check one element for membership and eliminate it, we have eliminated only that one element. To obtain this logarithmic complexity, we need comparing against one element to remove an entire set of elements. Because we are constructing sets of numbers, we don’t need to confront this issue here. Instead, we go into it in much more detail in Converting Values to Ordered Values.
17.2.1 Using Binary Trees
Because logs come from trees.
data BT:
| leaf
| node(v :: Number, l :: BT, r :: BT)
end
fun is-in-bt(e :: Number, s :: BT) -> Boolean:
cases (BT) s:
| leaf => false
| node(v, l, r) =>
if e == v:
true
else:
is-in-bt(e, l) or is-in-bt(e, r)
end
end
end
How can we improve on this? The comparison needs to help us eliminate not only the root but also one whole sub-tree. We can only do this if the comparison “speaks for” an entire sub-tree. It can do so if all elements in one sub-tree are less than or equal to the root value, and all elements in the other sub-tree are greater than or equal to it. Of course, we have to be consistent about which side contains which subset; it is conventional to put the smaller elements to the left and the bigger ones to the right. This refines our binary tree definition to give us a binary search tree (BST).
Do Now!
Here is a candiate predicate for recognizing when a binary tree is in fact a binary search tree:fun is-a-bst-buggy(b :: BT) -> Boolean: cases (BT) b: | leaf => true | node(v, l, r) => (is-leaf(l) or (l.v <= v)) and (is-leaf(r) or (v <= r.v)) and is-a-bst-buggy(l) and is-a-bst-buggy(r) end end
Is this definition correct?
<=
instead of <
above
because even though we don’t want to permit duplicates when
representing sets, in other cases we might not want to be so
stringent; this way we can reuse the above implementation for other
purposes. But the definition above performs only a “shallow”
comparison. Thus we could have a root a with a right child,
b, such that b > a; and the b node
could have a left child c such that c < b;
but this does not guarantee that c > a. In fact, it is
easy to construct a counter-example that passes this check:
check:
node(5, node(3, leaf, node(6, leaf, leaf)), leaf)
satisfies is-a-bst-buggy # FALSE!
end
Exercise
Fix the BST checker.
type BST = BT%(is-a-bst)
TSet
s to be tree sets:
type TSet = BST
mt-set = leaf
fun is-in(e :: Number, s :: BST) -> Bool:
cases (BST) s:
| leaf => ...
| node(v, l :: BST, r :: BST) => ...
... is-in(l) ...
... is-in(r) ...
end
end
fun is-in(e :: Number, s :: BST) -> Boolean:
cases (BST) s:
| leaf => false
| node(v, l, r) =>
if e == v:
true
else if e < v:
is-in(e, l)
else if e > v:
is-in(e, r)
end
end
end
fun insert(e :: Number, s :: BST) -> BST:
cases (BST) s:
| leaf => node(e, leaf, leaf)
| node(v, l, r) =>
if e == v:
s
else if e < v:
node(v, insert(e, l), r)
else if e > v:
node(v, l, insert(e, r))
end
end
end
You should now be able to define the remaining operations. Of these,
size
clearly requires linear time (since it has to count all
the elements), but because is-in
and insert
both throw
away one of two children each time they recur, they take logarithmic
time.
Exercise
Suppose we frequently needed to compute the size of a set. We ought to be able to reduce the time complexity of
size
by having each tree ☛ cache its size, so thatsize
could complete in constant time (note that the size of the tree clearly fits the criterion of a cache, since it can always be reconstructed). Update the data definition and all affected functions to keep track of this information correctly.
17.2.2 Checking the Complexity
But wait a minute. Are we actually done? Our recurrence takes the
form \(T(k) = T(k/2) + c\), but what in our data definition guaranteed
that the size of the child traversed by is-in
will be half the
size?
Do Now!
Construct an example—
consisting of a sequence of insert
s to the empty tree—such that the resulting tree is not balanced. Show that searching for certain elements in this tree will take linear, not logarithmic, time in its size.
1
, 2
, 3
, and 4
, in order. The
resulting tree would be
check:
insert(4, insert(3, insert(2, insert(1, mt-set)))) is
node(1, leaf,
node(2, leaf,
node(3, leaf,
node(4, leaf, leaf))))
end
4
in this tree would have to examine all the set
elements in the tree. In other words, this binary search tree is
degenerate—Therefore, using a binary tree, and even a BST, does not guarantee
the complexity we want: it does only if our inputs have arrived in
just the right order. However, we cannot assume any input ordering;
instead, we would like an implementation that works in all cases.
Thus, we must find a way to ensure that the tree is always
balanced, so each recursive call in is-in
really does throw away half the elements.
Exercise
Observe that we have not talked about computing the size of the set. Even if we could assume that the binary tree is balanced, how do we determine the size in logarithmic-or-better time?
17.2.3 A Fine Balance: Tree Surgery
Let’s define a balanced binary search tree (BBST). It must obviously be a search tree, so let’s focus on the “balanced” part. We have to be careful about precisely what this means: we can’t simply expect both sides to be of equal size because this demands that the tree (and hence the set) have an even number of elements and, even more stringently, to have a size that is a power of two.
Exercise
Define a predicate for a BBST that consumes a
BT
and returns aBoolean
indicating whether or not it a balanced search tree.
Therefore, we relax the notion of balance to one that is both accommodating and sufficient. We use the term balance factor for a node to refer to the height of its left child minus the height of its right child (where the height is the depth, in edges, of the deepest node). We allow every node of a BBST to have a balance factor of \(-1\), \(0\), or \(1\) (but nothing else): that is, either both have the same height, or the left or the right can be one taller. Note that this is a recursive property, but it applies at all levels, so the imbalance cannot accumulate making the whole tree arbitrarily imbalanced.
Exercise
Given this definition of a BBST, show that the number of nodes is exponential in the height. Thus, always recurring on one branch will terminate after a logarithmic (in the number of nodes) number of steps.
Here is an obvious but useful observation: every BBST is also a BST (this was true by the very definition of a BBST). Why does this matter? It means that a function that operates on a BST can just as well be applied to a BBST without any loss of correctness.
So far, so easy. All that leaves is a means of creating a
BBST, because it’s responsible for ensuring balance. It’s easy to
see that the constant empty-set
is a BBST value. So that
leaves only insert
.
Here is our situation with insert
. Assuming we start with a
BBST, we can determine in logarithmic time whether the element is
already in the tree and, if so, ignore it.To implement a
bag we count how many of each element are in it, which does not
affect the tree’s height.
When inserting an element, given balanced trees, the
insert
for a BST takes only a logarithmic amount of time to
perform the insertion. Thus, if performing the insertion does not
affect the tree’s balance, we’re done. Therefore, we only need to
consider cases where performing the insertion throws off the balance.
Observe that because \(<\) and \(>\) are symmetric (likewise with \(<=\) and \(>=\)), we can consider insertions into one half of the tree and a symmetric argument handles insertions into the other half. Thus, suppose we have a tree that is currently balanced into which we are inserting the element \(e\). Let’s say \(e\) is going into the left sub-tree and, by virtue of being inserted, will cause the entire tree to become imbalanced.Some trees, like family trees (Data Design Problem – Ancestry Data) represent real-world data. It makes no sense to “balance” a family tree: it must accurately model whatever reality it represents. These set-representing trees, in contrast, are chosen by us, not dictated by some external reality, so we are free to rearrange them.
There are two ways to proceed. One is to consider all the places where we might insert \(e\) in a way that causes an imbalance and determine what to do in each case.
Exercise
Enumerate all the cases where insertion might be problematic, and dictate what to do in each case.
The number of cases is actually quite overwhelming (if you didn’t
think so, you missed a few...). Therefore, we instead attack the
problem after it has occurred: allow the existing BST insert
to insert the element, assume that we have an imbalanced tree,
and show how to restore its balance.The insight that a tree can
be made “self-balancing” is quite remarkable, and there are now many
solutions to this problem. This particular one, one of the oldest, is
due to G.M. Adelson-Velskii and E.M. Landis. In honor of their
initials it is called an AVL Tree, though the tree itself is quite
evident; their genius is in defining re-balancing.
Thus, in what follows, we begin with a tree that is balanced;
insert
causes it to become imbalanced; we have assumed that the
insertion happened in the left sub-tree. In particular, suppose a
(sub-)tree has a balance factor of \(2\) (positive because we’re
assuming the left is imbalanced by insertion). The procedure for
restoring balance depends critically on the following property:
Exercise
Show that if a tree is currently balanced, i.e., the balance factor at every node is \(-1\), \(0\), or \(1\), then
insert
can at worst make the balance factor \(\pm 2\).
The algorithm that follows is applied as insert
returns from
its recursion, i.e., on the path from the inserted value back to the
root. Since this path is of logarithmic length in the set’s size (due
to the balancing property), and (as we shall see) performs only a
constant amount of work at each step, it ensures that insertion also
takes only logarithmic time, thus completing our challenge.
p |
/ \ |
q C |
/ \ |
A B |
Let’s say that \(C\) is of height \(k\). Before insertion, the tree rooted at \(q\) must have had height \(k+1\) (or else one insertion cannot create imbalance). In turn, this means \(A\) must have had height \(k\) or \(k-1\), and likewise for \(B\).
Exercise
Why can they both not have height \(k+1\) after insertion?
17.2.3.1 Left-Left Case
p |
/ \ |
q C |
/ \ |
r B |
/ \ |
A1 A2 |
\(A_1 < r\).
\(r < A_2 < q\).
\(q < B < p\).
\(p < C\).
The height of \(A_1\) or of \(A_2\) is \(k\) (the cause of imbalance).
The height of the other \(A_i\) is \(k-1\) (see the exercise above).
The height of \(C\) is \(k\) (initial assumption; \(k\) is arbitrary).
The height of \(B\) must be \(k-1\) or \(k\) (argued above).
q |
/ \ |
r p |
/ \ / \ |
A1 A2 B C |
17.2.3.2 Left-Right Case
p |
/ \ |
q C |
/ \ |
A r |
/ \ |
B1 B2 |
\(A < q\).
\(q < B_1 < r\).
\(r < B_2 < p\).
\(p < C\).
Suppose the height of \(C\) is \(k\).
The height of \(A\) must be \(k-1\) or \(k\).
The height of \(B_1\) or \(B_2\) must be \(k\), but not both (see the exercise above). The other must be \(k-1\).
p |
/ \ |
r C |
/ \ |
q B2 |
/ \ |
A B1 |
r |
/ \ |
q p |
/ \ / \ |
A B1 B2 C |
17.2.3.3 Any Other Cases?
Were we a little too glib before? In the left-right case we said that only one of \(B_1\) or \(B_2\) could be of height \(k\) (after insertion); the other had to be of height \(k-1\). Actually, all we can say for sure is that the other has to be at most height \(k-2\).
Exercise
Can the height of the other tree actually be \(k-2\) instead of \(k-1\)?
If so, does the solution above hold? Is there not still an imbalance of two in the resulting tree?
Is there actually a bug in the above algorithm?