I am trying to implement Kosaraju\'s graph algorithm, on a 3.5m line file where each row is two (space separated) Ints representing a graph edge. To start I need to create a su
This is not really an answer, I would rather comment András Kovács post, if I add those 50 points...
I have implemented the loading of the graph in both IntMap and MVector, in a attempt to benchmark mutability vs. immutability.
Both program use Attoparsec for the parsing. There is surely more economic way to do it, but Attoparsec is relatively fast compared to its high abstraction level (the parser can stand in one line). The guideline is to avoid String
and read
. read
is partial and slow, [Char]
is slow and not memory efficient, unless properly fused.
As András Kovács noted, IntMap is better than Map for Int keys. My code provides another example of alter
usage. If the node identifier mapping is dense, you may also want to use Vector and Array. They allow O(1) indexing by the identifier.
The mutable version handle on demand the exponential growth of the MVector. This avoid to precise an upper bound on node identifiers, but introduce more complexity (the reference on the vector may change).
I benchmarked with a file of 5M edges with identifiers in the range [0..2^16]. The MVector version is ~2x faster than the IntMap code (12s vs 25s on my computer).
The code is here [Gist].
I will edit when more profiling is done on my side.
Based pretty much on András' suggestions, I've reduced a 113 second task down to 24 (measured by stopwatch as I can't quite get Criterion to do anything yet) (and then down to 10 by compiling -O2)!!! I've attended some courses this last year that talked about the challenge of optimising for large datasets but this was the first time I faced a question that actually involved one, and it was as non-trivial as my instructors' suggested. This is what I have now:
import System.IO
import Control.Monad
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM
import qualified Data.ByteString.Char8 as BS
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node
-- DFS uses a stack to store next points to explore, a list can do this
type Stack = [(NodeName, NodeName)]
getBytes :: FilePath -> IO [(Int, Int)]
getBytes path = do
lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
let
pairs = (map . map) (maybe (error "Can't read integers") fst . BS.readInt) lines
return [(a,b) | [a,b] <- pairs]
main = do
--list <- getLines' "testdata.txt" -- [String]
list <- getBytes "SCC.txt" -- [String]
let list' = createGraph' list
putStrLn $ show $ list' IM.! 66
-- return list'
bmark = defaultMain [
bgroup "1" [
bench "Sim test" $ whnf bmark' "SCC.txt"
]
]
bmark' :: FilePath -> IO ()
bmark' path = do
list <- getLines path
let
list' = createGraph list
putStrLn $ show $ list' IM.! 2
createGraph' :: [(Int, Int)] -> Graph1
createGraph' xs = foldl' build IM.empty xs
where
addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
addFwd y _ = Just (Node False [y] [])
addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
addBwd x _ = Just (Node False [] [x])
build :: Graph1 -> (Int, Int) -> Graph1
build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc
And now on with the rest of the exercise....
IntMap
or HashMap
when possible. Both are significantly faster for Int
keys than Map
. HashMap
is usually faster than IntMap
but uses more RAM and has a less rich library.containers
package has a large number of specialized functions. With alter the number of lookups can be halved compared to the createGraph
implementation in the question. Example for createGraph
:
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node
createGraph :: [(Int, Int)] -> Graph1
createGraph xs = foldl' build IM.empty xs
where
addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
addFwd y _ = Just (Node False [y] [])
addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
addBwd x _ = Just (Node False [] [x])
build :: Graph1 -> (Int, Int) -> Graph1
build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc
generate
, iterate
, constructN
, etc.). These may use mutation behind the scenes but are considerably more convenient to use than actual mutable vectors.Example for createGraph
:
import qualified Data.Vector as V
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = V.Vector Node
createGraph :: Int -> [(Int, Int)] -> Graph1
createGraph maxIndex edges = graph'' where
graph = V.replicate maxIndex (Node False [] [])
graph' = V.accum (\(Node e f b) x -> Node e (x:f) b) graph edges
graph'' = V.accum (\(Node e f b) x -> Node e f (x:b)) graph' (map (\(a, b) -> (b, a)) edges)
Note that if there are gaps in the range of the node indices, then it'd be wise to either
Node
to signify a missing index. Data.Text
or Data.ByteString
. In both cases there are also efficient functions for breaking input into lines or words. Example:
import qualified Data.ByteString.Char8 as BS
import System.IO
getLines :: FilePath -> IO [(Int, Int)]
getLines path = do
lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
return [(a, b) | [a, b] <- pairs]
Always do it, unlike me in this answer. Use criterion.