问题
I'm looking for a fast and memory efficient approach for implementing Conway's Game of Life.
Constraints: a 96x128 board, approximately 2kB RAM available and 52MHz processor (see the tech specs here: http://www.getinpulse.com/features).
My current naive solution that represents each cell as a single bit in a matrix (96*128/8=1,536 bytes) works but is too slow. What tricks can be used to improve performance?
Storing the coordinates of live cells (for example in this implementation http://dotat.at/prog/life/life.html) would use too much memory.
回答1:
Looks like a fun piece of hardware. Storing 1 bit per pixel of a 96x128 pixel display gives 12,288 bits. That's already over half of the 16,384 bits you say are "available". If you can't even store 2 bits per cell, there's not a lot of room to do much.
A few ideas:
You have a 32-bit processor. There are several bit-twiddling tricks to take a bitmap and calculate the number of neighbors of several cells in parallel on such a processor.
It's often faster to store the neighbor counts, incrementing all 8 neighbors at birth and decrementing all 8 neighbors at death, rather than recalculating the number of neighbors from scratch every generation -- but it doesn't look like you have enough memory for this approach.
Perhaps you could use 2x2 pixels per cell (so only 3,072 cells visible) or 3x3 pixels per cell (so only 1,376 cells visible), so your software does less work and so gives the illusion of running faster. (This also frees up enough RAM that you can do the neighbor count mentioned above).
Since many life patterns have large areas of dead space, perhaps have a small bitmap of "live regions" -- say, a 12x16 array of bits, where each bit is "0" if the corresponding 8x8 cell region is entirely dead, and "1" if any of the cells in the corresponding region is alive. The next-generation update only needs to look at live regions and the 1-cell-wide border of dead regions; you can skip checking the 6x6 cell core of dead regions. Also, you can completely skip the entire 8x8 cell region if its 4 nearest neighboring regions are also dead.
Since many life patterns have large areas of static unchanging "still life" patterns, perhaps have a small bitmap of "dynamic regions" -- say, a 12x16 array of bits, where each bit is "0" if the corresponding 8x8 cell region had no changes in the last generation update, and "1" if any of the cells in the corresponding region changed in the last update. The next generation update only needs to look at dynamic regions and the 1-cell-wide border of static regions; you can skip checking the 6x6 cell core of static regions, since it will be the same in the next generation.
If a pattern is "large enough", a representation based on Gosper's Hashlife may be able to store it in less RAM than storing a bitmap directly. Alas, I suspect you're far below the "large enough" threshold.
回答2:
It is quite common with small life universes to make them wrap round on all sides (to make a toroidal universe) but this requires double buffering. In your case this requires 3KB RAM but you only have 2KB.
If you don't wrap then you don't need to double-buffer the whole universe. You just need to avoid overwriting cells before you have finished using them as input to the calculation.
Say your universe is laid out as a conventional bitmap. We are going to treat it as a series of rows arranges sequentially in memory. Say the universe has four rows numbered 0 to 3.
0
1
2
3
4
...
When you calculate the next generation, the new version of row 3 is calculated using rows 2, 3, and 4 (which is blank). You can write the new version of row 3 on top of row 4. Similarly, calculate the new row 2 from rows 1,2,3 and write it on top of row 3, since that data is no longer needed after calculating row 2. The new row 1 is calculated from rows 0,1,2 and written over row 2.
This means you only need memory for one extra row. 97*128 bits is 1552 bytes.
The disadvantage is that your universe scrolls through memory and you have to have some mechanism to deal with this. Either copy it back to its original position after each calculation (which is slow) or arrange to be able to display it from any position in memory, and ensure that your calculation and display systems can wrap around from high to low addresses.
回答3:
Look at the chapter on the Game of Life in "The Zen of Code Optimization" by Michael Abrash. There was an implementation that coded the current and previous states of 3 cells into a single word, and used tricks with lookup tables and carry bits generate the next states. It was blazingly fast.
回答4:
I would suggest starting with a routine to read a row of the board and generate two new row buffers, H and L such that bit 'n' of H will be set if two or more of (bin n+1, bit n, bit n-1) were set in the original row, and bit 'n' of L will be set if an odd number of (bin n+1, bit n, bit n-1) were set in the original row.
Allocate a total of three pairs of row buffers (call them H0..H2 and L0..L2). Take each row of the source file and compute a pair of buffers and store it in an HL pair, keeping it and the previous two. Examining a word from all six buffers will reveal which of 32 cells in the original matrix should be live if and only if they were previously, and which should be alive regardless of previous state.
For optimal performance in machine code, the three pairs of buffers may be interleaved; this may make it possible to achieve an execution speed of less than two cycles per pixel. Perhaps overkill at 52MHz (with only 12288 pixels, that would be a frame rate of ~4000fps) but speed is cool.
回答5:
If you have license to abuse the rest of the 30KB on the device (a.k.a. flash memory), you could always store it there, not ideal, but it is a way to potentially work around the limited RAM.
Efficiency speaking you're going to be trading CPU and memory performance for one another:
For memory efficiency, an array of bits is likely the optimal solution, but you lose CPU efficiency by iterating through that grid.
Depending on the capabilities and the size of memory addresses, you could use a linked list of cells to avoid iterating through the entire grid: would definitely save you time scanning through huge regions of dead cells.
回答6:
Stick with the bit array and skip the neighbor counts with this trick: create a lookup table with the possible bit patterns of neighboring cell blocks that create or maintain a live cell.
The "index" pattern can be packed into 8 bits if you want to minimize memory: 3 bits from the row above, two bits from the columns to either side, and 3 bits from the row below. You can encode the output as a single bit in a lookup table, taking up a total of only 256 bits. Use the index as a bit-shift count for your lookup array to "calculate" the result. Bit shifts and arithmetic OR operations are still very fast, and this technique eliminates counting neighboring live cells on the fly - or rather, the lookup table pre-processes the count and calculation.
The top processing bottlenecks should be: checking for board edge conditions, i.e. end of a row; word boundaries; extracting and packing neighbor bits as an index value. With a 32-bit processor, you should be able to cycle through 30 cells very quickly before hitting a word boundary. Addressing cell rows could simply be a matter of adding the number of columns/32 if you pack the bits into word-sized ints. Cache the results into two spare new-life rows and copy a whole row when you're done processing one.
There may be some pattern symmetries you can take advantage of to optimize things even more, but this may be enough. I think this technique will keep the processing and memory usage within your limitations.
来源:https://stackoverflow.com/questions/5059817/whats-an-efficient-implementation-of-conways-game-of-life-for-low-memory-uses