17.1 Representing Sets as Lists
Earlier [Sets as Collective Data] we introduced sets. Recall that the elements of a set have no specific order, and ignore duplicates.If these ideas are not familiar, please read Sets as Collective Data, since they will be important when discussing the representation of sets. At that time we relied on Pyret’s built-in representation of sets. Now we will discuss how to build sets for ourselves. In what follows, we will focus only on sets of numbers.
check:
[list: 1, 2, 3] is [list: 3, 2, 1, 1]
end
Set
is already
built into Pyret, so below we will use the name LSet
for a set
represented as a list.
mt-set :: Set
is-in :: (T, Set<T> -> Bool)
insert :: (T, Set<T> -> Set<T>)
union :: (Set<T>, Set<T> -> Set<T>)
size :: (Set<T> -> Number)
to-list :: (Set<T> -> List<T>)
insert-many :: (List<T>, Set<T> -> Set<T>)
mt-set
, easily gives us a to-set
function.Sets can contain many kinds of values, but not necessarily any kind: we need to be able to check for two values being equal (which is a requirement for a set, but not for a list!), which can’t be done with all values (such as functions). We discuss the nuances of this elsewhere [Equality and Ordering]. For now, we can ignore these issues by focusing on sets of (non-rough)numbers.
17.1.1 Representation Choices
type LSet = List
mt-set = empty
size
as
fun size<T>(s :: LSet<T>) -> Number:
s.length()
end
- There is a subtle difference between lists and sets. The list
[list: 1, 1]
is not the same as[list: 1]
because the first list has length two whereas the second has length one. Treated as a set, however, the two are the same: they both have size one. Thus, our implementation ofsize
above is incorrect if we don’t take into account duplicates (either during insertion or while computing the size). We might falsely make assumptions about the order in which elements are retrieved from the set due to the ordering guaranteed provided by the underlying list representation. This might hide bugs that we don’t discover until we change the representation.
We might have chosen a set representation because we didn’t need to care about order, and expected lots of duplicate items. A list representation might store all the duplicates, resulting in significantly more memory use (and slower programs) than we expected.
insert
to check whether an element is
already in the set and, if so, leave the representation unchanged;
this incurs a cost during insertion but avoids unnecessary duplication
and lets us use length
to implement size
. The other
option is to define insert
as link
—insert = link
17.1.2 Time Complexity
insert
, is-in
, and size
.
Suppose the size of the set is \(k\) (where, to avoid ambiguity,
we let \(k\) represent the number of distinct elements).
The complexity of these operations depends on whether or not we store
duplicates:
If we don’t store duplicates, then
size
is simplylength
, which takes time linear in \(k\). Similarly,is-in
only needs to traverse the list once to determine whether or not an element is present, which also takes time linear in \(k\). Butinsert
needs to check whether an element is already present, which takes time linear in \(k\), followed by at most a constant-time operation (link
).If we do store duplicates, then
insert
is constant time: it simplylink
s on the new element without regard to whether it already is in the set representation.is-in
traverses the list once, but the number of elements it needs to visit could be significantly greater than \(k\), depending on how many duplicates have been added. Finally,size
needs to check whether or not each element is duplicated before counting it.
Do Now!
What is the time complexity of
size
if the list has duplicates?
size
is
fun size<T>(s :: LSet<T>) -> Number:
cases (List) s:
| empty => 0
| link(f, r) =>
if r.member(f):
size(r)
else:
1 + size(r)
end
end
end
s
is \(k\) but the
actual number of elements in s
is \(d\), where
\(d \geq k\). To compute the time to run size
on \(d\)
elements, \(T(d)\), we should determine the number of operations in
each question and answer. The first question has a constant number of
operations,
and the first answer also a constant. The second question also has
a constant number of
operations. Its answer is a conditional, whose first question
(r.member(f)
needs to traverse the entire list, and hence has
\(O([k \rightarrow d])\) operations. If it succeeds, we recur on something of size
\(T(d-1)\); else we do the same but perform a constant more operations.
Thus \(T(0)\) is a constant, while the recurrence (in big-Oh terms) is
\begin{equation*}T(d) = d + T(d-1)\end{equation*}
17.1.3 Choosing Between Representations
|
| With Duplicates |
| Without Duplicates | ||||
|
|
|
|
|
|
|
|
|
Size of Set |
| constant |
| linear |
| linear |
| linear |
Size of List |
| constant |
| linear |
| linear |
| linear |
Which representation we choose is a matter of how much duplication we expect. If there won’t be many duplicates, then the version that stores duplicates pays a small extra price in return for some faster operations.
Which representation we choose is also a matter of how often we expect each operation to be performed. The representation without duplication is “in the middle”: everything is roughly equally expensive (in the worst case). With duplicates is “at the extremes”: very cheap insertion, potentially very expensive membership. But if we will mostly only insert without checking membership, and especially if we know membership checking will only occur in situations where we’re willing to wait, then permitting duplicates may in fact be the smart choice. (When might we ever be in such a situation? Suppose your set represents a backup data structure; then we add lots of data but very rarely—
indeed, only in case of some catastrophe— ever need to look for things in it.) Another way to cast these insights is that our form of analysis is too weak. In situations where the complexity depends so heavily on a particular sequence of operations, big-Oh is too loose and we should instead study the complexity of specific sequences of operations. We will address precisely this question later [Halloween Analysis].
Moreover, there is no reason a program should use only one representation. It could well begin with one representation, then switch to another as it better understands its workload. The only thing it would need to do to switch is to convert all existing data between the representations.
How might this play out above? Observe that data conversion is very
cheap in one direction: since every list without duplicates is
automatically also a list with (potential) duplicates, converting in
that direction is trivial (the representation stays unchanged, only
its interpretation changes). The other direction is harder: we have to
filter duplicates (which takes time quadratic in the number of
elements in the list). Thus, a program can make an initial guess about
its workload and pick a representation accordingly, but maintain
statistics as it runs and, when it finds its assumption is wrong,
switch representations—
17.1.4 Other Operations
Exercise
Implement the remaining operations catalogued above (<set-operations>) under each list representation.
Exercise
Implement the operationremove :: (Set<T>, T -> Set<T>)
under each list representation (renamingSet
appropriately. What difference do you see?
Do Now!
Suppose you’re asked to extend sets with these operations, as the set analog offirst
andrest
:one :: (Set<T> -> T) others :: (Set<T> -> T)
You should refuse to do so! Do you see why?
With lists the “first” element is well-defined, whereas sets are
defined to have no ordering. Indeed, just to make sure users of your
sets don’t accidentally assume anything about your implementation
(e.g., if you implement one
using first
, they may notice
that one
always returns the element most recently added to the
list), you really ought to return a random element of the set on each
invocation.
Unfortunately, returning a random element means the above interface is
unusable. Suppose s
is bound to a set containing 1
,
2
, and 3
. Say the first time one(s)
is invoked
it returns 2
, and the second time 1
. (This already
means one
is not a function.)
The third time it may again return 2
. Thus
others
has to remember which element was returned the last time
one
was called, and return the set sans that element. Suppose
we now invoke one
on the result of calling others
. That
means we might have a situation where one(s)
produces the same
result as one(others(s))
.
Exercise
Why is it unreasonable for
one(s)
to produce the same result asone(others(s))
?
Exercise
Suppose you wanted to extend sets with a
subset
operation that partitioned the set according to some condition. What would its type be?
Exercise
The types we have written above are not as crisp as they could be. Define a
has-no-duplicates
predicate, refine the relevant types with it, and check that the functions really do satisfy this criterion.