Haskell lazy I/O and closing files

前端 未结 7 804
攒了一身酷
攒了一身酷 2020-12-07 23:35

I\'ve written a small Haskell program to print the MD5 checksums of all files in the current directory (searched recursively). Basically a Haskell version of md5deep

相关标签:
7条回答
  • 2020-12-07 23:43

    The problem is that mapM is not as lazy as you think - it results in a full list with one element per file path. And the file IO you are using is lazy, so you get a list with one open file per file path.

    The simplest solution in this case is to force the evaluation of the hash for each file path. One way to do that is with Control.Exception.evaluate:

    getFileLine path = do
      theHash <- liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
      evaluate theHash
    

    As others have pointed out, we're working on a replacement for the current approach to lazy IO that is more general yet still simple.

    0 讨论(0)
  • 2020-12-07 23:48

    NOTE: I've edited my code slightly to reflect the advice in Duncan Coutts's answer. Even after this edit his answer is obviously much better than mine, and doesn't seem to run out of memory in the same way.


    Here's my quick attempt at an Iteratee-based version. When I run it on a directory with about 2,000 small (30-80K) files it's about 30 times faster than your version here and seems to use a bit less memory.

    For some reason it still seems to run out of memory on very large files—I don't really understand Iteratee well enough yet to be able to tell why easily.

    module Main where
    
    import Control.Monad.State
    import Data.Digest.Pure.MD5
    import Data.List (sort)
    import Data.Word (Word8) 
    import System.Directory 
    import System.FilePath ((</>))
    import qualified Data.ByteString.Lazy as BS
    
    import qualified Data.Iteratee as I
    import qualified Data.Iteratee.WrappedByteString as IW
    
    evalIteratee path = evalStateT (I.fileDriver iteratee path) md5InitialContext
    
    iteratee :: I.IterateeG IW.WrappedByteString Word8 (StateT MD5Context IO) MD5Digest
    iteratee = I.IterateeG chunk
      where
        chunk s@(I.EOF Nothing) =
          get >>= \ctx -> return $ I.Done (md5Finalize ctx) s
        chunk (I.Chunk c) = do
          modify $ \ctx -> md5Update ctx $ BS.fromChunks $ (:[]) $ IW.unWrap c
          return $ I.Cont (I.IterateeG chunk) Nothing
    
    fileLine :: FilePath -> MD5Digest -> String
    fileLine path c = show c ++ " " ++ path
    
    main = mapM_ (\path -> putStrLn . fileLine path =<< evalIteratee path) 
       =<< getRecursiveContents "."
    
    getRecursiveContents :: FilePath -> IO [FilePath]
    getRecursiveContents topdir = do
      names <- getDirectoryContents topdir
    
      let properNames = filter (`notElem` [".", ".."]) names
    
      paths <- concatForM properNames $ \name -> do
        let path = topdir </> name
    
        isDirectory <- doesDirectoryExist path
        if isDirectory
          then getRecursiveContents path
          else do
            isFile <- doesFileExist path
            if isFile
              then return [path]
              else return []
    
      return (sort paths)
    
    concatForM :: (Monad m) => [a1] -> (a1 -> m [a]) -> m [a]
    concatForM xs f = liftM concat (forM xs f)
    

    Note that you'll need the iteratee package and TomMD's pureMD5. (And my apologies if I've done something horrifying here—I'm a beginner with this stuff.)

    0 讨论(0)
  • 2020-12-07 23:57

    EDIT: sorry, thought the problem was with the files, not diectory reading/traversal. Ignore this.

    No problem, just explicitly open the file (openFile), read the contents (Data.ByteString.Lazy.hGetContents), perform the md5 hash (let !h = md5 contents), and explicitly close the file (hClose).

    0 讨论(0)
  • 2020-12-08 00:01

    unsafeInterleaveIO?

    Yet another solution that comes to mind is to use unsafeInterleaveIO from System.IO.Unsafe. See the reply of Tomasz Zielonka in this thread in Haskell Cafe.

    It defers an input-output operation (opening a file) until it is actually required. Thus it is possible to avoid opening all files at once, and instead read and process them sequentially (open them lazily).

    Now, I believe, mapM getFileLine opens all files but does not start reading from them until putStr . unlines. Thus a lot of thunks with open file handlers float around, this is the problem. (Please correct me if I am wrong).

    An example

    A modified example with unsafeInterleaveIO is running against a 100 GB directory for several minutes now, in constant space.

    getList :: FilePath -> IO [String]
    getList p =
      let getFileLine path =
            liftM (\c -> (show . md5 $ c) ++ " " ++ path)
            (unsafeInterleaveIO $ BS.readFile path)
      in mapM getFileLine =<< getRecursiveContents p 
    

    (I changed for pureMD5 implementation of the hash)

    P.S. I am not sure if this is good style. I believe that solutions with iteretees and strict IO are better, but this one is quicker to make. I use it in small scripts, but I'd be afraid of relying on it in a bigger program.

    0 讨论(0)
  • 2020-12-08 00:04

    Lazy IO is very bug-prone.

    As dons suggested, you should use strict IO.

    You can use a tool such as Iteratee to help you structure strict IO code. My favorite tool for this job is monadic lists.

    import Control.Monad.ListT (ListT) -- List
    import Control.Monad.IO.Class (liftIO) -- transformers
    import Data.Binary (encode) -- binary
    import Data.Digest.Pure.MD5 -- pureMD5
    import Data.List.Class (repeat, takeWhile, foldlL) -- List
    import System.IO (IOMode(ReadMode), openFile, hClose)
    import qualified Data.ByteString.Lazy as BS
    import Prelude hiding (repeat, takeWhile)
    
    hashFile :: FilePath -> IO BS.ByteString
    hashFile =
        fmap (encode . md5Finalize) . foldlL md5Update md5InitialContext . strictReadFileChunks 1024
    
    strictReadFileChunks :: Int -> FilePath -> ListT IO BS.ByteString
    strictReadFileChunks chunkSize filename =
        takeWhile (not . BS.null) $ do
            handle <- liftIO $ openFile filename ReadMode
            repeat () -- this makes the lines below loop
            chunk <- liftIO $ BS.hGet handle chunkSize
            when (BS.null chunk) . liftIO $ hClose handle
            return chunk
    

    I used the "pureMD5" package here because "Crypto" doesn't seem to offer a "streaming" md5 implementation.

    Monadic lists/ListT come from the "List" package on hackage (transformers' and mtl's ListT are broken and also don't come with useful functions like takeWhile)

    0 讨论(0)
  • 2020-12-08 00:04

    Edit: my assumption was that the user was opening thousands of very small files, it turns out they are very large. Laziness will be essential.

    Well, you'll need to use a different IO mechanism. Either:

    • Strict IO (process the files with Data.ByteString or System.IO.Strict
    • or, Iteratee IO (for experts only at the moment).

    I'd also strongly recommend not using 'unpack', as that destroys the benefit of using bytestrings.

    For example, you can replace your lazy IO with System.IO.Strict, yielding:

    import qualified System.IO.Strict as S
    
    getList :: FilePath -> IO [String]
    getList p = mapM getFileLine =<< getRecursiveContents p
        where
            getFileLine path = liftM (\c -> (hex (hash c)) ++ " " ++ path)
                                     (S.readFile path)
    
    0 讨论(0)
提交回复
热议问题