Union find data structures provide the following operations:
make_sets(n)
initializes a union-find data structure with n
singletonsfind(i)
returns a representative for the set of element i
union(i,j)
merges the sets containing i
and j
We represent our partition of the elements 0
to n - 1
by storing a parent element parent[i]
for every element i
which eventually leads to a representative of the set containing i
.
If an element itself is a representative, it is its own parent, i.e. parent[i] == i
.
Thus, if we start with singleton sets, every element is its own representative:
We can find the representative for a given set by simply following these parent pointers.
Let us now see how we can merge sets:
If we want to merge the elements 0 and 1 and the elements 2 and 3, we can do this by setting the parent of 0 to 1 and setting the parent of 2 to 3:
In this simple case, only the elements parent pointer itself must be changed.
If we however want to merge larger sets, we must always change the parent pointer of the representative of the set that is to be merged into another set: After merge(0,3), we have set the parent of the representative of the set containing 0 to the representative of the set containing 3
To make the example a bit more interesting, let's now also merge (4,5), (5,6) and (3,4):
The last notion I want to introduce is path compression:
If we want to find the representative of a set, we might have to follow several parent pointers before reaching the representative. We might make this easier by storing the representative for each set directly in their parent node. We lose the order in which we merged the elements, but can potentially have a large runtime gain. In our case, the only paths that aren't compressed are the paths from 0, 4 and 5 to 3:
The most basic implementation of a union-find data structure consists of an array parent
storing the a parent element for every element of the structure. Following these parent 'pointers' for an element i
leads us to the representative j = find(i)
of the set containing i
, where parent[j] = j
holds.
using std::size_t;
class union_find {
private:
std::vector<size_t> parent; // Parent for every node
public:
union_find(size_t n) : parent(n) {
for (size_t i = 0; i < n; ++i)
parent[i] = i; // Every element is its own representative
}
size_t find(size_t i) const {
if (parent[i] == i) // If we already have a representative
return i; // return it
return find(parent[i]); // otherwise return the parent's representative
}
void merge(size_t i, size_t j) {
size_t pi = find(i);
size_t pj = find(j);
if (pi != pj) { // If the elements are not in the same set:
parent[pi] = pj; // Join the sets by marking pj as pi's parent
}
}
};
If we do many merge
operations on a union-find data structure, the paths represented by the parent
pointers might be quite long. Path compression, as already described in the theory part, is a simple way of mitigating this issue.
We might try to do path compression on the whole data structure after every k-th merge operation or something similar, but such an operation could have a quite large runtime.
Thus, path compression is mostly only used on a small part of the structure, especially the path we walk along to find the representative of a set. This can be done by storing the result of the find
operation after every recursive subcall:
size_t find(size_t i) const {
if (parent[i] == i) // If we already have a representative
return i; // return it
parent[i] = find(parent[i]); // path-compress on the way to the representative
return parent[i]; // and return it
}
In our current implementation of merge
, we always choose the left set to be the child of the right set, without taking the size of the sets into consideration. Without this restriction, the paths (without path compression) from an element to its representative might be quite long, thus leading to large runtimes on find
calls.
Another common improvement thus is the union by size heuristic, which does exactly what it says: When merging two sets, we always set the larger set to be the parent of the smaller set, thus leading to a path length of at most steps:
We store an additional member std::vector<size_t> size
in our class which gets initialized to 1 for every element. When merging two sets, the larger set becomes the parent of the smaller set and we sum up the two sizes:
private:
...
std::vector<size_t> size;
public:
union_find(size_t n) : parent(n), size(n, 1) { ... }
...
void merge(size_t i, size_t j) {
size_t pi = find(i);
size_t pj = find(j);
if (pi == pj) { // If the elements are in the same set:
return; // do nothing
}
if (size[pi] > size[pj]) { // Swap representatives such that pj
std::swap(pi, pj); // represents the larger set
}
parent[pi] = pj; // attach the smaller set to the larger one
size[pj] += size[pi]; // update the size of the larger set
}
Another heuristic commonly used instead of union by size is the union by rank heuristic
Its basic idea is that we don't actually need to store the exact size of the sets, an approximation of the size (in this case: roughly the logarithm of the set's size) suffices to achieve the same speed as union by size.
For this, we introduce the notion of the rank of a set, which is given as follows:
One advantage of union by rank is the space usage: As the maximum rank increases roughly like , for all realistic input sizes, the rank can be stored in a single byte (since n < 2^255
).
A simple implementation of union by rank might look like this:
private:
...
std::vector<unsigned char> rank;
public:
union_find(size_t n) : parent(n), rank(n, 0) { ... }
...
void merge(size_t i, size_t j) {
size_t pi = find(i);
size_t pj = find(j);
if (pi == pj) {
return;
}
if (rank[pi] < rank[pj]) {
// link the smaller group to the larger one
parent[pi] = pj;
} else if (rank[pi] > rank[pj]) {
// link the smaller group to the larger one
parent[pj] = pi;
} else {
// equal rank: link arbitrarily and increase rank
parent[pj] = pi;
++rank[pi];
}
}
While in combination with path compression, union by rank nearly achieves constant time operations on union-find data structures, there is a final trick that allows us to get rid of the rank
storage altogether by storing the rank in out-of-bounds entries of the parent
array. It is based on the following observations:
parent
.parent[i]
is at most size - 1
, i.e. larger values are unused.This brings us to the following approach:
parent[i] == i
, we now identify representatives byparent[i] >= size
i
has rank parent[i] - size
parent[i] = size
instead of parent[i] = i
, i.e. each set is its own representative with rank 0.Since we only offset the rank values by size
, we can simply replace the rank
vector by the parent
vector in the implementation of merge
and only need to exchange the condition identifying representatives in find
:
Finished implementation using union by rank and path compression:
using std::size_t;
class union_find {
private:
std::vector<size_t> parent;
public:
union_find(size_t n) : parent(n, n) {} // initialize with parent[i] = n
size_t find(size_t i) const {
if (parent[i] >= parent.size()) // If we already have a representative
return i; // return it
return find(parent[i]); // otherwise return the parent's repr.
}
void merge(size_t i, size_t j) {
size_t pi = find(i);
size_t pj = find(j);
if (pi == pj) {
return;
}
if (parent[pi] < parent[pj]) {
// link the smaller group to the larger one
parent[pi] = pj;
} else if (parent[pi] > parent[pj]) {
// link the smaller group to the larger one
parent[pj] = pi;
} else {
// equal rank: link arbitrarily and increase rank
parent[pj] = pi;
++parent[pi];
}
}
};