Implementation of HashLife

This document describes some notes on the implementation of hashlife
that is used in both hlife and golly.  Some of the technological
considerations underlying these decisions are changing, and it is
worthwhile to reconsider some of the design decisions as we move to
a 64-bit multithreaded world.

The first part of this document describes things as they are.
The second part discusses possible improvements.

* Basic Structures and Hashtable *

A golly hashlife node currently looks like this:

struct node {
   struct node *next ; // chained hash pointer
   struct node *nw, *ne, *sw, *se ; // quadtree pointers
   struct node *res ; // result
} ;

On a 32-bit platform, that's 24 bytes; on a 64-bit platform, that's
48 bytes.

Golly allocates nodes in contiguous blocks of 1000 to minimize any
memory loss due to malloc overhead.

Golly uses a hashtable that looks like:

node **hashtable ;

That is, the hashtable is an array of pointers to the first element of
a chain.  Until golly nears the memory limit set by the user, the
average chain length is kept relatively short (an average of 1 for a
full hashtable; an average of about 0.5 right after a resize, which
means an average of 0.75 during running.)  Once memory fills, the
average chain length is extended so more nodes fit in memory; for a
full memory, the average chain length is set to be 4.

In practice, most garbage collections free about 99% of the nodes, so
the average chain length after a garbage collection is nearly 0.
Thus, once memory fills and garbage collection starts, the average
chain length overall hovers somewhere around 2.

Golly uses a move-to-front heuristic on the hash chains that minimizes
the impact of the long chains.  Further, since about 90% of the
hashtable lookups succeed, it is relatively rare to walk the entire
chain.

With a full hashtable and an average chain length of 4, the hashtable
header pointer array itself requires about one byte per node.  Thus,
the total memory consumption in golly per node is very close to 25
bytes.

I performed a set of experiments during 2002 and 2003 (I believe) to
determine the best chain length.  An average chain length of two (when
memory was full) decreased the number of nodes that could fit in
memory by about 4%, and thus increased the number of garbage
collections somewhat, and ended up making the program very slightly
slower.  An average chain length of eight made the chain searches
start to take more time.  I found, at that time, with the machines I
had and the benchmark patterns I chose, that four was the best option.

In general, hashlife spends all of its time doing hash table lookups.
About 90% of the lookups succeed; about 10% fail.  For every failed
hashtable lookup, there will be a hashtable insert.

A typical hashtable lookup involves two memory misses; the first in
the hashtable chain header array itself, and the second in the first
node of the chain (which is usually the result node).  In general,
hashlife tends to be memory bound; most of the time is spent waiting
for results from memory access.

For very large memories (gigabytes), on modern processors, each main
data miss also involves a TLB miss, and servicing that TLB miss
usually requires yet another data miss (to access the leaf level pages
of the page table).  Use of 4K pages really hurts hashlife; I hope we
can get large page support at some point in time.  (The x86 hardware
supports it, but the operating systems typically make it very
difficult to use.)

In hlife and the hlifealgo algorithm in Golly, we use leaf nodes that
hold an 8x8 array of cells.  These leaf nodes are the same size as
regular nodes, and store the result of the 8x8 array advanced by 2 and
advanced by 1.  In the ghashbase algorithm in Golly (which supports
multistate algorithms), the leaf nodes are just nodes that hold four
byte-sized states and no result information.

In practice for most patterns and rules, the leaf cells have no
significant impact on performance; all work is typically done at
higher levels.

The hash function that is used is simply:

#define node_hash(a,b,c,d) (((int)d)+3*(((int)c)+3*(((int)b)+3*((int)a)+3)))

That is, we use a linear combination of the addresses of the nodes
themselves, multiplied by (1, 3, 9, 27).  This simplistic hash
function appears to work well, yielding a good distribution of chain
lengths.  It may be worthwhile changing this function to permit more
concurrent computation (perhaps using factors (1, 3, 17, 257) for
instance, and not chaining the multiplications).

* Result Caching *

We cache the results for each node directly in the node itself.  The
reason we do this is that most nodes in a hashlife computation have a
result computed for them; if we were to store the result in a separate
table, we would introduce significant additional memory (and time)
overhead.  This means we are restricted to a single result per node.

The generation calculation always proceeds by some power of two.
In a pure, traditional hashlife, the result node of a node that is
2^k x 2^k will always be advanced 2^{k-2}.  In our hashlife, we
permit the step size to be an arbitrary power of two, so for some
nodes the result stored in the res field may represent a smaller
step.  For instance, if the step size is 1024, then all nodes up to
4096 x 4096 will use the traditional 2^{k-2} step size, but all
nodes larger than this will use a step size of 1024.

This means that if the step size is changed, we have to walk all of
the nodes and invalidate the result if the step size for that node
size has changed.  To make this faster in some common cases, we
maintain an array of up to 1024 nodes that have "nontraditional" step
sizes.  This way if the increment is changed in such a way that only a
small number of nodes have nontraditional sizes, and only those might
need to be invalidated, we can directly access those and invalidate
them without walking the entire universe.  (This only works for
increasing the step size; we should enhance this to work for
decreasing the step size too by simply keeping the most recent largest
1024 nodes that have had results calculated for them.  This is a
relatively simple change and will make some step size changes when
decreasing the step size *much* faster.  We should also make this
array not be a constant size, but rather a fraction of the main memory
size, say 0.5%, so very large memory sizes can benefit from this more.)

* Bit Twiddling *

At various points, hashlife needs to walk the tree structure and
collect information.  In order to mark which nodes have been visited
and which have not, we will sometimes twiddle the low order bit of the
next field or the res field.  This means that all nodes must be on
even addresses for now.  We could have used some sort of external bit
vector, or an external hash set, but neither option was considered
sufficiently practical or sufficiently fast.  Take into account that
the nodes are not completely contiguous, so computing an index for
a particular node address is not straightforward.

Similarly, we will hijack the next field to hold more verbose
information.  Whenever we do this, we start by pulling the
"tree of interest" out of the hashtable completely, so that we can
use the next field with impunity; after the operation completes,
we rehash the tree of interest.  Because of this, when one of
these operations is underway, we cannot perform other operations
that require the hashtable.

The following operations use the mark bit in the next field:

   - Garbage collection (we mark nodes that are reachable)

   - Increment changing (or cache clearing)

The following operations use the mark bit in the res field:

   - Population counting (also stores the population in the next field)

   - Writing a macrocell (we store the node index in the next field)

* Garbage Collection *

When memory fills, we must reclaim some nodes.  We do this in a very
simple way that attempts to be fast and encourages spatial and
temporal locality.

First, we walk the stack of active references, recursively marking
those that are in use.  This walk only touches the reachable nodes,
so the time required is proportional to the live nodes; this tends
to be 1% or less so it is usually pretty fast.  For some patterns
when run for days or weeks, the live set can increase to be a
substantial fraction of the total nodes, in which case this marking
phase can take substantially longer.

Next, we completely zero the hashtable chain header array.  We then
walk the blocks of nodes we have allocated (a block is one of the
1024-element node chunks we malloc'd), and either rehash that node
(if it is marked, and thus reachable), or add that node to the free
list.

The walk of the blocks in order is critical to maintaining the spatial
and temporal locality which enhances the effectiveness of the
processor caches.  In addition, walking the blocks by a sequential
scan also makes the garbage collection itself faster.

This policy of freeing every node that is not reachable from the
current call stack tends to free a lot of useful work, that then must
be redone as needed.  I have not been able to come up with a better
policy; some policies are better for some patterns, but on average
this simple policy is the best I've been able to find.  Certainly
more experimentation is called for here.

* Multithreading *

A big opportunity is the use of multithreading for hashlife.  The main
obstacle here is the rate at which the hashtable is accessed (several
million times per second).  Some locking scheme needs to be introduced
or designed, or else some lock-free hashtable adopted.  Further, some
resolution for when two threads need the result of the same node must
be introduced.  With dual-core processors common, and quad-core
processors not unusual, this would potentially give a dramatic
improvement in what can be computed.

Initially, garbage collection can remain single-threaded, since it
does not currently take a substantial portion of the computation
time.  Long-term, we might want to multithread this too.

This would not be an easy task; several options need to be considered,
and implemented, and compared.  Further, if we do attempt a lock-free
datastructure, we'd have to be *very* careful that it is indeed
correct, and not just empirically correct.

* Open Hashing and Indexed Nodes *

The transition from 32-bit to 64-bit environments means also that we
are transitioning from a cramped, fragmented virtual address space to
a huge, unfragmented address space.  This means that rather than
allocate memory in small chunks, as the current implementation must in
order to work around the noncontiguous address space 32-bit Windows
and 32-bit Linux provides, we can instead allocate all nodes at once,
as one contiguous region.

The downside of a 64-bit environment is every pointer becomes
eight bytes.  Thus, golly right now would then use an average of
50 bytes, rather than 25, for each node.

But a contiguous allocation of nodes makes using indices rather than
pointers simple.  So if we use 32-bit indices rather than 64-bit
pointers, we can retain the 25-byte efficiency of the 32-bit platform
while permitting memory use up to 4B nodes (which is about 100GB of
memory).

In addition, contiguous memory allows the consideration of open
hashing and its variants (cuckoo hashing, for instance).  The gain of
using open hashing would be (ideally) a single miss per lookup, rather
than two.

The problem with open hashing is typically open hashing works best
when the hashtable utilization is low.  If we keep utilization low, we
cannot store as many nodes, so we have to garbage collect more
frequently, throwing away useful work.  Striking a balance here is a
real challenge, but there are good opportunities.

* Reference Counts *

One solution to the problem of garbage collection throwing away a lot
of useful work is to always maintain a full hashtable; keep memory
filled with nodes with results, rather than filling memory, then
throwing away almost all those results with an aggressive garbage
collection.  We can do this by using reference counts and garbage
collect based on those reference counts incrementally.  The
incremental garbage collection has the additional benefit of
permitting golly to run more smoothly without those annoying garbage
collection pauses.

Using reference counts has two downsides.  First, the size of the node
structure increases from 24 bytes to 28 bytes.  Secondly, maintaining
those reference counts introduces both a lot of additional CPU
instructions and a fair number of additional memory misses.

I have implemented this and experimented with it (in a 32-bit golly
environment), and so far it seems a wash.  Some patterns run somewhat
faster, but some run somewhat slower; in general, the difference is
not usually discernable.

Certainly it is worth further experimentation and exploration; I am
sure improvements can be made to my reference-counted implementation.

* What Else? *

What other improvements should be added to this note?

Some possible ideas:

   - Make hashlife work on tori or bounded universes.

   - Some sort of generational gc.