问题
I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title
and base/@href
).
To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list.
This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files)
error. After doing some reading, this isn't so surprising: I'm using mapM parseMetaDataFile files
which opens all the handles straight away.
What I can't figure out is how to work around the problem. I've tried reading a bit about Iteratee; Can I hook that up with Tagsoup easily? Strict IO, the way I used it anyway (heh), froze my computer even though the files aren't very big (28 KB on average).
Any pointers would be greatly appreciated. I realize the approach of creating a big list might fail as well, but 4.5k elements isn't that long... Also, there should probably be less String
and more ByteString
everywhere.
Here's some code. I apologize for the naivety:
import System.FilePath
import Text.HTML.TagSoup
data MetaData = MetaData String String deriving (Show, Eq)
-- | Given HTML input, produces a MetaData structure of its essentials.
-- Should obviously account for errors, but simplified here.
readMetaData :: String -> MetaData
readMetaData input = MetaData title base
where
title =
innerText $
(takeWhile (~/= TagClose "title") . dropWhile (~/= TagOpen "title" []))
tags
base = fromAttrib "href" $ head $ dropWhile (~/= TagOpen "base" []) tags
tags = parseTags input
-- | Parses MetaData from a file.
parseMetaDataFile :: FilePath -> IO MetaData
parseMetaDataFile path = fmap readMetaData $ readFile path
-- | From a given root, gets the FilePaths of the files we are interested in.
-- Not implemented here.
getHtmlFilePaths :: FilePath -> IO [FilePath]
getHtmlFilePaths root = undefined
main :: IO
main = do
-- Will call openFile for every file, which gives too many open files.
metas <- mapM parseMetaDataFile =<< getHtmlFilePaths
-- Do stuff with metas, which will cause files to actually be read.
回答1:
The quick and dirty solution:
parseMetaDataFile path = withFile path $ \h -> do
res@(MetaData x y) <- fmap readMetaData $ hGetContents h
Control.Exception.evaluate (length (x ++ y))
return res
A slightly nicer solution is to write a proper NFData
instance for MetaData
, instead of just using evaluate.
回答2:
If you want to keep the current design you must make sure parseMetaDataFile has consumed the entire string from readFile before returning. When readFile reaches end-of-file the file descriptor will be closed.
来源:https://stackoverflow.com/questions/5943250/processing-too-many-xml-files-with-tagsoup