Processing (too) many XML files (with TagSoup)

问题

I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title and base/@href).

To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list.

This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files) error. After doing some reading, this isn't so surprising: I'm using mapM parseMetaDataFile files which opens all the handles straight away.

What I can't figure out is how to work around the problem. I've tried reading a bit about Iteratee; Can I hook that up with Tagsoup easily? Strict IO, the way I used it anyway (heh), froze my computer even though the files aren't very big (28 KB on average).

Any pointers would be greatly appreciated. I realize the approach of creating a big list might fail as well, but 4.5k elements isn't that long... Also, there should probably be less String and more ByteString everywhere.

Here's some code. I apologize for the naivety:

import System.FilePath
import Text.HTML.TagSoup

data MetaData = MetaData String String deriving (Show, Eq)

-- | Given HTML input, produces a MetaData structure of its essentials.
-- Should obviously account for errors, but simplified here.
readMetaData :: String -> MetaData
readMetaData input = MetaData title base
 where
  title =
    innerText $
    (takeWhile (~/= TagClose "title") . dropWhile (~/= TagOpen "title" []))
    tags
  base = fromAttrib "href" $ head $ dropWhile (~/= TagOpen "base" []) tags
  tags = parseTags input

-- | Parses MetaData from a file.
parseMetaDataFile :: FilePath -> IO MetaData
parseMetaDataFile path = fmap readMetaData $ readFile path

-- | From a given root, gets the FilePaths of the files we are interested in.
-- Not implemented here.
getHtmlFilePaths :: FilePath -> IO [FilePath]
getHtmlFilePaths root = undefined

main :: IO
main = do
  -- Will call openFile for every file, which gives too many open files.
  metas <- mapM parseMetaDataFile =<< getHtmlFilePaths

  -- Do stuff with metas, which will cause files to actually be read.

回答1:

The quick and dirty solution:

parseMetaDataFile path = withFile path $ \h -> do
    res@(MetaData x y) <- fmap readMetaData $ hGetContents h
    Control.Exception.evaluate (length (x ++ y))
    return res

A slightly nicer solution is to write a proper NFData instance for MetaData, instead of just using evaluate.

回答2:

If you want to keep the current design you must make sure parseMetaDataFile has consumed the entire string from readFile before returning. When readFile reaches end-of-file the file descriptor will be closed.

来源：https://stackoverflow.com/questions/5943250/processing-too-many-xml-files-with-tagsoup

标签

xml

haskell

lazy-evaluation

tag-soup