Optimising Haskell data reading from file

前端未结

关注

 3  876

忘掉有多难 2021-02-07 18:16

I am trying to implement Kosaraju\'s graph algorithm, on a 3.5m line file where each row is two (space separated) Ints representing a graph edge. To start I need to create a su

3条回答

你的背包 (楼主)

2021-02-07 18:59
Using maps:
- Use IntMap or HashMap when possible. Both are significantly faster for Int keys than Map. HashMap is usually faster than IntMap but uses more RAM and has a less rich library.
- Don't do unnecessary lookups. The containers package has a large number of specialized functions. With alter the number of lookups can be halved compared to the createGraph implementation in the question.
Example for createGraph:
```
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node

createGraph :: [(Int, Int)] -> Graph1
createGraph xs = foldl' build IM.empty xs
    where
        addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
        addFwd y _                   = Just (Node False [y] [])
        addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
        addBwd x _                   = Just (Node False [] [x])

        build :: Graph1 -> (Int, Int) -> Graph1
        build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc 
```
Using vectors:
- Consider the efficient construction functions (the accumulators, unfolds, generate, iterate, constructN, etc.). These may use mutation behind the scenes but are considerably more convenient to use than actual mutable vectors.
- In the more general case, use the laziness of boxed vectors to enable self-reference when constructing a vector.
- Use unboxed vectors when possible.
- Use unsafe functions when you're absolutely sure about the bounds.
- Only use mutable vectors when there aren't pure alternatives. In that case, prefer the ST monad to IO. Also, avoid creating many mutable heap objects (i. e. prefer mutable vectors to immutable vectors of mutable references).
Example for createGraph:
```
import qualified Data.Vector as V

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = V.Vector Node

createGraph :: Int -> [(Int, Int)] -> Graph1
createGraph maxIndex edges = graph'' where
    graph    = V.replicate maxIndex (Node False [] [])
    graph'   = V.accum (\(Node e f b) x -> Node e (x:f) b) graph  edges
    graph''  = V.accum (\(Node e f b) x -> Node e f (x:b)) graph' (map (\(a, b) -> (b, a)) edges)
```
Note that if there are gaps in the range of the node indices, then it'd be wise to either
1. Contiguously relabel the indices before doing anything else.
2. Introduce an empty constructor to Node to signify a missing index.
Faster I/O:
- Use the IO functions from Data.Text or Data.ByteString. In both cases there are also efficient functions for breaking input into lines or words.
Example:
```
import qualified Data.ByteString.Char8 as BS
import System.IO

getLines :: FilePath -> IO [(Int, Int)]
getLines path = do
    lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
    let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
    return [(a, b) | [a, b] <- pairs]
```
Benchmarking:

Always do it, unlike me in this answer. Use criterion.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

Optimising Haskell data reading from file

Using maps:

Using vectors:

Faster I/O:

Benchmarking: