We continue our port of the Land of Lisp game “Dice of Doom” from Part 1 and Part 2.
Source code can be found on Github.
We can now play 2x2 games with no problem. However, trying to play a 3x3 game poses a significant challenge for us - it can be very slow. In this part, we’ll take a look at improving the performance of the game for boards bigger than 2x2.
Performance Characteristics
Before doing anything we should look at the performance characteristics of the game, in particular the size of the problem. For an N x N board with 2 players and a max of 3 dice per cell, the number of board configurations is (23)^(NN). A 2x2 game has 1,296 possible boards, while a 3x3 one has 10,077,696. A 4x4 game has 2,821,109,907,456 possible configurations - that’s over two trillion!
For the 2x2 boards, we can generate game trees for all scenarios and look at some stats, depending on whether A starts or B starts:
- The number of nodes in the tree.
- The eventual winners in each tree.
Getting the tree size by flattening the entire tree into a list and then getting the length of the list probably isn’t the most efficient way of doing it, but we’ll get to that in a while.
We can generate all possible 2x2 boards:
The function to print out the board statistics for these boards is:
The results are given here
Most of these games are short or very one-sided. Interesting games are balanced ones, with a large number of nodes usually leading to a more even possibility of A winning, B winning or a tie.
This doesn’t look too bad. How about a 3x3 game? We won’t generate all possible boards, but take a few random ones and play them.
Note: GHC vs GHCi
Whenever you’re doing performance work I recommend you compile your code
with optimisations switched on, rather than running it through GHCi. To do this
we’ll need to add a main
function and change the module name to “Main”.
Then compile with optimisations using the -O2
flag:
Playing this a few times:
Wow! There’s a huge variation here. The last game tree, when play is started by B, has over 5 million nodes.
That’s nothing: the following game tree has 1.4 BILLION nodes when play is started by A:
You might ask yourself how this was calculated, given that my laptop only has 8GB of RAM. Let’s say that each node is 100 bytes. This is assuming 1 Int == 4 bytes and summing up the Ints used in players, cells etc. The actual value is much larger, but let’s be conservative. We can fit 10 boards in 1KB, 10,000 boards in 1MB, 10 million boards in 1GB, which means that the max tree size I should be able to accomodate in memory is one with 80 million nodes. So how could I calculate the size of the above game tree?
The answer is laziness and garbage collection. I initially only wanted to know the tree sizes, so my main function was like this:
It took a couple of hours, but gave the following results:
The impressive thing was that it ran in constant memory. Looking at the process stats in Mac OS X’s Activity Monitor, the private memory usage never got above 2.5 MB.
Buoyed on by this, I added more stats, including the winners. Re-running, it didn’t take long before I consumed all available memory.
The reason for this is that if I just get the tree size, unused nodes are garbage collected. I don’t need to generate the whole tree up-front to start calculating the tree size - laziness helps here. However, if I need the tree for separate calls to get the size and the winners, the nodes can’t be garbage collected before the second call, and the whole tree resides in memory.
Aside: Calculating the Size of the Tree
The initial version of treeSize
flattens the entire tree and then gets its
length. I thought this would be horribly ineffieicent, but that it would
do me for a while. After getting an initial set of stats, I resolved to ‘fix’
the issue with treeSize
:
Re-running this on a large tree (several million nodes) resulted in all memory being consumed.
The problem here is with the foldl
function. It’s much better to use
foldl'
.
See the WikiBooks entry for more information on this.
Even with this, the function wasn’t as fast as the flattening one. I finally settled on the following simple function, which turns out to be faster still:
Shared Structure
There’s an interesting thing about the huge game tree above: if there are 1.4 billion nodes in a tree, but only 10 million possible board configurations, then there’s proabably a lot of repetition going on - the same branches may occur multiple times within the tree.
Let’s see an example of this. The game tree used in Part 2 is a good example:
The shared nodes are coloured in various shades of yellow.
The Lisp implementation memoises the calls to get-tree
by using a hash-based lookup table. By doing this it only calculates a
sub-tree once. Let’s see what effect this memoisation has on our test trees.
By modifying the memoised function, we can log the ‘cache-hits’ vs the ‘cache-misses’:
Run the game-tree function on *test-board-3x3-e*
:
As with the Haskell version, this runs in fairly-constant space, but still takes a couple of hours.
For the board with 1.4 billion nodes, the results are interesting
In other words, in the tree of 1.4 billion nodes, with 472,109 sub-trees, only 201,861 of these sub-trees had to be computed. Even if you play this game in the CLisp REPL, it only takes around 30 seconds for it to generate the game tree.
The Lisp memoization solves our two problems above:
- The time taken to generate the game tree is reduced. Also, by memoising the other calls, the time taken to calculate moves is also severly reduced.
- The space taken up by the tree is also a lot less.
This second point is interesting, and not immediately obvious on a casual
glance at the Lisp code. Once a sub-tree has been stored in the hash table,
all subsequent ‘cache hits’ returned by the game-tree
function
are referenced, not copied, in the parent. While logically the structure
is a tree, several sub-trees share the same structure. In effect,
this turns the tree into a Directed Acyclic Graph. The ‘magic’
behind this is the Lisp Cons Cell. If you want to find more information on
this, Peter Siebel’s book
Practical Common Lisp, in particular
Chapter 12,
They Called It Lisp For A Reason
is a good read.
Haskell also has cons cells where we can have several items in a list referring to a single shared data structure. Using this, we can memoise the game tree creation in a similar way.
Aside: Profiling an Application
Before we second-guess the performance issues in the game, it’s good to get a real idea of where the bottlenecks lie by profiling it.
GHC allows us to turn on profiling:
To run, we need to tell the GHC runtime that we want do do the profiling (-p) and that we want garbage collection stats as well (-s): (This test was done using test3x3BoardB)
The profiling output is stored in DiceOfDoom-h.prof.
The GC stats are very telling. For test3x3BoardC
we have:
There’s a lot of GC going on - which shouldn’t be a surprise as the game trees have millions of nodes.
For info on the interpretation of GC stats, see Running a compiled program in the GHC docs.
Memoising Tree Creation
In languages that allow mutation, memoising is a fairly straightforward process: take a function you want memoised, wrap it in another function that only calls it if it hasn’t already been called with those arguments, and that stores the result for subsequent calls.
Haskell doesn’t allow mutation directly, so we have to use some sort of state mechanism. The approach we’ll use is described in http://www.maztravel.com/haskell/memofib.html and in a thread on the comp.lang.haskell group.
In some ways the approach is similar to the Lisp one above: we have a function that stores calculated values in a map. However, we can’t just wrap the old function in another one - we have to modify the old function to enable us to memoise it.
NOTE It’s important to note that the type of memoisation presented here is slightly different to what you’d expect if you were memoising functions in Python or Ruby. In the Haskell memoisation code above, the memoisation only works within recursive calls to the same function. It doesn’t work across non-recursive calls. So if you’re in GHCI and you do the following:
Then the second call to gameTree
isn’t “cached” in any way - it goes through
exactly the same process of trying to memoise recursive calls to itself
starting off with an empty map. We’ll see an example of memoising that can do
this when we memoise the calls to get the ratings.
Fixing a Design Flaw
Before we apply memomisation to the game tree creation, we need to fix a flaw in the way the tree was designed. Recall that the game state is defined by:
We stored the moveMade
in the game state to tell us that a particular
move was made to get to this state. It made it easy to enumerate the
child nodes and see what move would result in that board.
However, several different moves from different boards can result in the
same end state. If we want to memoise the game tree from a possible board,
we need to remove the moveMade
reference and put this into the parent. The
game state should only have the current player, the current board, and
the list of possible moves from that board.
To do this we’re going to put the moves made from a parent node to a child node into the parent itself, as an array of moves.
This changes the gameTree
function slightly, in particular we don’t
need the fromMove
parameter any more. Two other functions that change are
allowedMoves
that maps the possible moves to numbers, and showGameGraphTree
used to generate the Graphviz output:
The code is in DiceOfDoom-i.hs.
Setting up the Memoising Map
In order to memoize the calls to gameTree
using the method above, we’ll
need to be able to map a single something to a game tree. The gametree
function takes three parameters: a board, a player and a boolean indicating
if it’s the first move or not. We’ll wrap these three items in a single tuple.
The structure of the code remains the same as the non-memoizing version:
Code in DiceOfDoom-j.hs
Let’s test this out on the 1.4 billion-node game tree:
Running this:
One minute and 40 seconds - that’s not bad. Memory usage peaks at 145MB.
If we profile this, and look at the GC stats for test3x3BoardC
There’s a lot less GC going on, and the total memory in use has gone down from 7.8 GB to 425 MB.
Memoising the ratings functions
We can now play the large game above against another human:
However, if we try to play against the computer, we’re still in trouble:
It hangs for a long time. If we take a look at the profiling info, we can see that the various ratings functions are taking up a lot of time. We need to memoise these as well.
We can’t use the same memoising technique that we used to generate the
tree. The childRatings
function does make recursive calls to itself
indirectly via ratePosition
, but it is called several times independently
by handleComputer
. We’d really like to memoise separate calls.
We’ll take the approach used in Ugly Memoization by Lennart Augustsson. It uses unsafe IO and MVars.
The childRatings
function takes a game tree and a player. However, we only
need to memoise the root of the tree and the player, so the childRatingsM
function takes the root as a parameter but does nothing with it.
Code in DiceOfDoom-k.hs.
With this in place, we can now play against the computer:
That’s it for part 3. The game is at the same stage as the end of chapter 15 in Land of Lisp. In the next part we’ll get the game to handle even larger boards.