How do you save a tree data structure to binary file in Haskell

问题

I'm trying to save a simple (but quite big) Tree structure into a binary file using Haskell. The structure looks something like this:

-- For simplicity assume each Node has only 4 childs
data Tree = Node [Tree] | Leaf [Int]

And here is how I need the data look on disk:

Each node starts with four 32-bit offsets to it's children, then follow the childs.
I don't care much about the leafs, let's say it's just n consecutive 32-bit numbers.
For practival purposes I would need some node labels or some other additional data but right now I don't care about that much neither.

It apears to me that Haskellers first choice when writing binary files is the Data.Binary.Put library. But with that I have a problem in the bullet #1. In particular, when I'm about to write a Node to a file, to write down the child offsets I need to know my current offset and the size of each child.

This is not something that Data.Binary.Put provides so I thought this must be a perfect application of Monad transformers. But even though it sounds cool and functional, so far I have not been successfull with this approach.

I asked two other questions that I thought would help me solve the problem here and here. I must say that each time I received very nice answers that helped me progress further but unfortunatelly I am still unable to solve the problem as a whole.

Here is what I've got so far, it still leaks too much memory to be practical.

I would love to have solution that uses such functional approach, but would be grateful for any other solution as well.

回答1:

Here is implementation of two pass solution proposed by sclv.

import qualified Data.ByteString.Lazy as L
import Data.Binary.Put
import Data.Word
import Data.List (foldl')

data Tree = Node [Tree] | Leaf [Word32] deriving Show

makeTree 0 = Leaf $ replicate 100 0xdeadbeef
makeTree n = Node $ replicate 4 $ makeTree $ n-1

SizeTree mimics original Tree, it does not contain data but at each node it stores size of corresponding child in Tree.
We need to have SizeTree in memory, so it worth to make it more compact (e.g. replace Ints with uboxed words).

data SizeTree
  = SNode {sz :: Int, chld :: [SizeTree]}
  | SLeaf {sz :: Int}
  deriving Show

With SizeTree in memory it is possible to serialize original Tree in streaming fashion.

putTree :: Tree -> SizeTree -> Put
putTree (Node xs) (SNode _ ys) = do
  putWord8 $ fromIntegral $ length xs          -- number of children
  mapM_ (putWord32be . fromIntegral . sz) ys   -- sizes of children
  sequence_ [putTree x y | (x,y) <- zip xs ys] -- children data
putTree (Leaf xs) _ = do
  putWord8 0                                   -- zero means 'leaf'
  putWord32be $ fromIntegral $ length xs       -- data length
  mapM_ putWord32be xs                         -- leaf data


mkSizeTree :: Tree -> SizeTree
mkSizeTree (Leaf xs) = SLeaf (1 + 4 + 4 * length xs)
mkSizeTree (Node xs) = SNode (1 + 4 * length xs + sum' (map sz ys)) ys
  where
    ys = map mkSizeTree xs
    sum' = foldl' (+) 0

It is important to prevent GHC from merging two passes into one (in which case it will hold tree in memory). Here it is done by feeding not tree but tree generator to the function.

serialize mkTree size = runPut $ putTree (mkTree size) treeSize
  where
    treeSize = mkSizeTree $ mkTree size

main = L.writeFile "dump.bin" $ serialize makeTree 10

回答2:

There are two basic approaches I would consider. If the entire serialized structure will easily fit into memory, you can serialize each node into a lazy bytestring and just use the lengths for each of them to calculate the offset from the current position.

serializeTree (Leaf nums)  = runPut (mapM_ putInt32 nums)
serializeTree (Node subtrees) = mconcat $ header : childBs
 where
  childBs = map serializeTree subtrees
  offsets = scanl (\acc bs -> acc+L.length bs) (fromIntegral $ 2*length subtrees) childBs
  header = runPut (mapM_ putInt32 $ init offsets)

The other option is, after serializing a node, go back and re-write the offset fields with the appropriate data. This may be the only option if the tree is large, but I don't know of a serialization library that supports this. It would involve working in IO and seeking to the correct locations.

回答3:

What I think you want is an explicit two pass solution. The first converts your tree into a size annotated tree. This pass forces the tree, but can be done, in fact, without any monadic machinery at all by tying the knot. The second pass is in the plain old Put monad, and given that the size annotations are already calculated, should be very straightforward.

回答4:

Here is an implementation using Builder, which is part of the "binary" package. I haven't profiled it properly, but according to "top" it immediately allocates 108 Mbytes and then hangs on to that for the rest of the execution.

Note that I haven't tried reading the data back, so there may be lurking errors in my size and offset calculations.

-- Paste this into TreeBinary.hs, and compile with
--    ghc -O2 --make TreeBinary.hs -o TreeBinary

module Main where


import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Builder as B

import Data.List (init)
import Data.Monoid
import Data.Word


-- -------------------------------------------------------------------
-- Test data.

data Tree = Node [Tree] | Leaf [Word32] deriving Show

-- Approximate size in memory (ignoring laziness) I think is:
-- 101 * 4^9 * sizeof(Int) + 1/3 * 4^9 * sizeof(Node)

-- This version uses [Word32] instead of [Int] to avoid having to write
-- a builder for Int.  This is an example of lazy programming instead
-- of lazy evaluation. 

makeTree :: Tree
makeTree = makeTree1 9
  where makeTree1 0 = Leaf [0..100]
        makeTree1 n = Node [ makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1 ]

-- --------------------------------------------------------------------
-- The actual serialisation code.


-- | Given a tree, return a builder for it and its estimated length in bytes.
serialiseTree :: Tree -> (B.Builder, Word32)
serialiseTree (Leaf ns) = (mconcat (B.singleton 2 : map B.putWord32be ns), fromIntegral $ 4 * length ns + 1)
serialiseTree (Node ts) = (mconcat (B.singleton 1 : map B.putWord32be offsets ++ branches), 
                           baseLength + sum subLengths)
   where
      (branches, subLengths) = unzip $ map serialiseTree ts
      baseLength = fromIntegral $ 1 + 4 * length ts
      offsets = init $ scanl (+) baseLength subLengths


main = do
   putStrLn $ "Length = " ++ show (snd $ serialiseTree makeTree)
   BL.writeFile "test.bin" $ B.toLazyByteString $ fst $ serialiseTree makeTree

来源：https://stackoverflow.com/questions/5157979/how-do-you-save-a-tree-data-structure-to-binary-file-in-haskell

标签

haskell

functional-programming

binary-tree

monads

monad-transformers