I\'ve written a small Haskell program to print the MD5 checksums of all files in the current directory (searched recursively). Basically a Haskell version of md5deep
The problem is that mapM is not as lazy as you think - it results in a full list with one element per file path. And the file IO you are using is lazy, so you get a list with one open file per file path.
The simplest solution in this case is to force the evaluation of the hash for each file path. One way to do that is with Control.Exception.evaluate
:
getFileLine path = do
theHash <- liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
evaluate theHash
As others have pointed out, we're working on a replacement for the current approach to lazy IO that is more general yet still simple.
NOTE: I've edited my code slightly to reflect the advice in Duncan Coutts's answer. Even after this edit his answer is obviously much better than mine, and doesn't seem to run out of memory in the same way.
Here's my quick attempt at an Iteratee
-based version. When I run it on a directory with about 2,000 small (30-80K) files it's about 30 times faster than your version here and seems to use a bit less memory.
For some reason it still seems to run out of memory on very large files—I don't really understand Iteratee
well enough yet to be able to tell why easily.
module Main where
import Control.Monad.State
import Data.Digest.Pure.MD5
import Data.List (sort)
import Data.Word (Word8)
import System.Directory
import System.FilePath ((</>))
import qualified Data.ByteString.Lazy as BS
import qualified Data.Iteratee as I
import qualified Data.Iteratee.WrappedByteString as IW
evalIteratee path = evalStateT (I.fileDriver iteratee path) md5InitialContext
iteratee :: I.IterateeG IW.WrappedByteString Word8 (StateT MD5Context IO) MD5Digest
iteratee = I.IterateeG chunk
where
chunk s@(I.EOF Nothing) =
get >>= \ctx -> return $ I.Done (md5Finalize ctx) s
chunk (I.Chunk c) = do
modify $ \ctx -> md5Update ctx $ BS.fromChunks $ (:[]) $ IW.unWrap c
return $ I.Cont (I.IterateeG chunk) Nothing
fileLine :: FilePath -> MD5Digest -> String
fileLine path c = show c ++ " " ++ path
main = mapM_ (\path -> putStrLn . fileLine path =<< evalIteratee path)
=<< getRecursiveContents "."
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topdir = do
names <- getDirectoryContents topdir
let properNames = filter (`notElem` [".", ".."]) names
paths <- concatForM properNames $ \name -> do
let path = topdir </> name
isDirectory <- doesDirectoryExist path
if isDirectory
then getRecursiveContents path
else do
isFile <- doesFileExist path
if isFile
then return [path]
else return []
return (sort paths)
concatForM :: (Monad m) => [a1] -> (a1 -> m [a]) -> m [a]
concatForM xs f = liftM concat (forM xs f)
Note that you'll need the iteratee package and TomMD's pureMD5. (And my apologies if I've done something horrifying here—I'm a beginner with this stuff.)
EDIT: sorry, thought the problem was with the files, not diectory reading/traversal. Ignore this.
No problem, just explicitly open the file (openFile), read the contents (Data.ByteString.Lazy.hGetContents), perform the md5 hash (let !h = md5 contents), and explicitly close the file (hClose).
Yet another solution that comes to mind is to use unsafeInterleaveIO
from System.IO.Unsafe
. See the reply of Tomasz Zielonka in this thread in Haskell Cafe.
It defers an input-output operation (opening a file) until it is actually required. Thus it is possible to avoid opening all files at once, and instead read and process them sequentially (open them lazily).
Now, I believe, mapM getFileLine
opens all files but does not start reading from them until putStr . unlines
. Thus a lot of thunks with open file handlers float around, this is the problem. (Please correct me if I am wrong).
A modified example with unsafeInterleaveIO is running against a 100 GB directory for several minutes now, in constant space.
getList :: FilePath -> IO [String]
getList p =
let getFileLine path =
liftM (\c -> (show . md5 $ c) ++ " " ++ path)
(unsafeInterleaveIO $ BS.readFile path)
in mapM getFileLine =<< getRecursiveContents p
(I changed for pureMD5 implementation of the hash)
P.S. I am not sure if this is good style. I believe that solutions with iteretees and strict IO are better, but this one is quicker to make. I use it in small scripts, but I'd be afraid of relying on it in a bigger program.
Lazy IO is very bug-prone.
As dons suggested, you should use strict IO.
You can use a tool such as Iteratee to help you structure strict IO code. My favorite tool for this job is monadic lists.
import Control.Monad.ListT (ListT) -- List
import Control.Monad.IO.Class (liftIO) -- transformers
import Data.Binary (encode) -- binary
import Data.Digest.Pure.MD5 -- pureMD5
import Data.List.Class (repeat, takeWhile, foldlL) -- List
import System.IO (IOMode(ReadMode), openFile, hClose)
import qualified Data.ByteString.Lazy as BS
import Prelude hiding (repeat, takeWhile)
hashFile :: FilePath -> IO BS.ByteString
hashFile =
fmap (encode . md5Finalize) . foldlL md5Update md5InitialContext . strictReadFileChunks 1024
strictReadFileChunks :: Int -> FilePath -> ListT IO BS.ByteString
strictReadFileChunks chunkSize filename =
takeWhile (not . BS.null) $ do
handle <- liftIO $ openFile filename ReadMode
repeat () -- this makes the lines below loop
chunk <- liftIO $ BS.hGet handle chunkSize
when (BS.null chunk) . liftIO $ hClose handle
return chunk
I used the "pureMD5" package here because "Crypto" doesn't seem to offer a "streaming" md5 implementation.
Monadic lists/ListT
come from the "List" package on hackage (transformers' and mtl's ListT
are broken and also don't come with useful functions like takeWhile
)
Edit: my assumption was that the user was opening thousands of very small files, it turns out they are very large. Laziness will be essential.
Well, you'll need to use a different IO mechanism. Either:
I'd also strongly recommend not using 'unpack', as that destroys the benefit of using bytestrings.
For example, you can replace your lazy IO with System.IO.Strict, yielding:
import qualified System.IO.Strict as S
getList :: FilePath -> IO [String]
getList p = mapM getFileLine =<< getRecursiveContents p
where
getFileLine path = liftM (\c -> (hex (hash c)) ++ " " ++ path)
(S.readFile path)