Haskell lazy I/O and closing files

前端未结

关注

 7  804

I\'ve written a small Haskell program to print the MD5 checksums of all files in the current directory (searched recursively). Basically a Haskell version of md5deep


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2020-12-07 23:43
              
            
            
                                                                       
The problem is that mapM is not as lazy as you think - it results in a full list with one element per file path. And the file IO you are using is lazy, so you get a list with one open file per file path.

The simplest solution in this case is to force the evaluation of the hash for each file path. One way to do that is with Control.Exception.evaluate:

getFileLine path = do
  theHash <- liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
  evaluate theHash


As others have pointed out, we're working on a replacement for the current approach to lazy IO that is more general yet still simple.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2020-12-07 23:48
              
            
            
                                                                       
NOTE: I've edited my code slightly to reflect the advice in Duncan Coutts's answer. Even after this edit his answer is obviously much better than mine, and doesn't seem to run out of memory in the same way.



Here's my quick attempt at an Iteratee-based version. When I run it on a directory with about 2,000 small (30-80K) files it's about 30 times faster than your version here and seems to use a bit less memory.

For some reason it still seems to run out of memory on very large files—I don't really understand Iteratee well enough yet to be able to tell why easily.

module Main where

import Control.Monad.State
import Data.Digest.Pure.MD5
import Data.List (sort)
import Data.Word (Word8) 
import System.Directory 
import System.FilePath ((</>))
import qualified Data.ByteString.Lazy as BS

import qualified Data.Iteratee as I
import qualified Data.Iteratee.WrappedByteString as IW

evalIteratee path = evalStateT (I.fileDriver iteratee path) md5InitialContext

iteratee :: I.IterateeG IW.WrappedByteString Word8 (StateT MD5Context IO) MD5Digest
iteratee = I.IterateeG chunk
  where
    chunk s@(I.EOF Nothing) =
      get >>= \ctx -> return $ I.Done (md5Finalize ctx) s
    chunk (I.Chunk c) = do
      modify $ \ctx -> md5Update ctx $ BS.fromChunks $ (:[]) $ IW.unWrap c
      return $ I.Cont (I.IterateeG chunk) Nothing

fileLine :: FilePath -> MD5Digest -> String
fileLine path c = show c ++ " " ++ path

main = mapM_ (\path -> putStrLn . fileLine path =<< evalIteratee path) 
   =<< getRecursiveContents "."

getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topdir = do
  names <- getDirectoryContents topdir

  let properNames = filter (`notElem` [".", ".."]) names

  paths <- concatForM properNames $ \name -> do
    let path = topdir </> name

    isDirectory <- doesDirectoryExist path
    if isDirectory
      then getRecursiveContents path
      else do
        isFile <- doesFileExist path
        if isFile
          then return [path]
          else return []

  return (sort paths)

concatForM :: (Monad m) => [a1] -> (a1 -> m [a]) -> m [a]
concatForM xs f = liftM concat (forM xs f)


Note that you'll need the iteratee package and TomMD's pureMD5. (And my apologies if I've done something horrifying here—I'm a beginner with this stuff.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2020-12-07 23:57
              
            
            
                                                                       
EDIT: sorry, thought the problem was with the files, not diectory reading/traversal.  Ignore this.

No problem, just explicitly open the file (openFile), read the contents (Data.ByteString.Lazy.hGetContents), perform the md5 hash (let !h = md5 contents), and  explicitly close the file (hClose).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2020-12-08 00:01
              
            
            
                                                                       
unsafeInterleaveIO?

Yet another solution that comes to mind is to use unsafeInterleaveIO from System.IO.Unsafe. See the reply of Tomasz Zielonka in this thread in Haskell Cafe.

It defers an input-output operation (opening a file) until it is actually required. Thus it is possible to avoid opening all files at once, and instead read and process them sequentially (open them lazily).

Now, I believe, mapM getFileLine opens all files but does not start reading from them until putStr . unlines. Thus a lot of thunks with open file handlers float around, this is the problem. (Please correct me if I am wrong).

An example

A modified example with unsafeInterleaveIO is running against a 100 GB directory for several minutes now, in constant space.

getList :: FilePath -> IO [String]
getList p =
  let getFileLine path =
        liftM (\c -> (show . md5 $ c) ++ " " ++ path)
        (unsafeInterleaveIO $ BS.readFile path)
  in mapM getFileLine =<< getRecursiveContents p 


(I changed for pureMD5 implementation of the hash)

P.S. I am not sure if this is good style. I believe that solutions with iteretees and strict IO are better, but this one is quicker to make. I use it in small scripts, but I'd be afraid of relying on it in a bigger program.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2020-12-08 00:04
              
            
            
                                                                       
Lazy IO is very bug-prone.

As dons suggested, you should use strict IO.

You can use a tool such as Iteratee to help you structure strict IO code. My favorite tool for this job is monadic lists.

import Control.Monad.ListT (ListT) -- List
import Control.Monad.IO.Class (liftIO) -- transformers
import Data.Binary (encode) -- binary
import Data.Digest.Pure.MD5 -- pureMD5
import Data.List.Class (repeat, takeWhile, foldlL) -- List
import System.IO (IOMode(ReadMode), openFile, hClose)
import qualified Data.ByteString.Lazy as BS
import Prelude hiding (repeat, takeWhile)

hashFile :: FilePath -> IO BS.ByteString
hashFile =
    fmap (encode . md5Finalize) . foldlL md5Update md5InitialContext . strictReadFileChunks 1024

strictReadFileChunks :: Int -> FilePath -> ListT IO BS.ByteString
strictReadFileChunks chunkSize filename =
    takeWhile (not . BS.null) $ do
        handle <- liftIO $ openFile filename ReadMode
        repeat () -- this makes the lines below loop
        chunk <- liftIO $ BS.hGet handle chunkSize
        when (BS.null chunk) . liftIO $ hClose handle
        return chunk


I used the "pureMD5" package here because "Crypto" doesn't seem to offer a "streaming" md5 implementation.

Monadic lists/ListT come from the "List" package on hackage (transformers' and mtl's ListT are broken and also don't come with useful functions like takeWhile)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2020-12-08 00:04
              
            
            
                                                                       
Edit: my assumption was that the user was opening thousands of very small files, it turns out they are very large. Laziness will be essential.

Well, you'll need to use a different IO mechanism. Either:


Strict IO (process the files with Data.ByteString or System.IO.Strict
or, Iteratee IO (for experts only at the moment).


I'd also strongly recommend not using 'unpack', as that destroys the benefit of using bytestrings.

For example, you can replace your lazy IO with System.IO.Strict, yielding:

import qualified System.IO.Strict as S

getList :: FilePath -> IO [String]
getList p = mapM getFileLine =<< getRecursiveContents p
    where
        getFileLine path = liftM (\c -> (hex (hash c)) ++ " " ++ path)
                                 (S.readFile path)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复