heap memory buildup with xml-conduit parseBytes

问题

I'm parsing some rather large XML files with xml-conduit's streaming interface https://hackage.haskell.org/package/xml-conduit-1.8.0/docs/Text-XML-Stream-Parse.html#v:parseBytes but I'm seeing this memory buildup (here on a small test file):

where the top users are:

The actual data shouldn't take up that much heap – if I serialise and re-read, the resident memory use is kilobytes vs the megabytes here.

The minimal example I've managed to reproduce this with:

{-# LANGUAGE BangPatterns      #-}
{-# LANGUAGE OverloadedStrings #-}

module Main where

import           Control.Monad
import           Control.Monad.IO.Class
import           Data.Conduit
import           Data.Conduit.Binary    (sourceFile)
import qualified Data.Conduit.List      as CL
import           Data.Text              (Text)
import           Text.XML.Stream.Parse

type Y = [(Text, Text)]

main :: IO ()
main = do
  res1 <- runConduitRes $
          sourceFile "test.xml"
          .| Text.XML.Stream.Parse.parseBytes def
          .| parseMain
          .| CL.foldM get []
  print res1

get :: (MonadIO m, Show a) => [a] -> [a] -> m [a]
get acc !vals = do
 liftIO $! print vals           -- this oughta force it?
 return $! take 1 vals ++ acc

parseMain = void $ tagIgnoreAttrs "Period" parseDetails

parseDetails = many parseParam >>= yield

parseParam = tag' "param" parseParamAttrs $ \idAttr -> do
  value <- content
  return (idAttr, value)

parseParamAttrs = do
  idAttr <- requireAttr "id"
  attr "name"
  return idAttr

回答1:

If I change get to just return ["hi"] or something, I don't get the buildup. So it seems the returned texts keep some reference to the larger text they were in (e.g. zero-copy slicing, cf. comment at https://hackage.haskell.org/package/text-0.11.2.0/docs/Data-Text.html#g:18 ), so the rest of the text can't be garbage collected even though we're using only little parts.

Our fix is to use Data.Text.copy on any attributes we want to yield:

someattr <- requireAttr "n"
yield (T.copy someattr)

which lets us parse with nearly constant memory use.

(And we might consider using https://markkarpov.com/post/short-bs-and-text.html#shorttext if we want to save even more memory.)

来源：https://stackoverflow.com/questions/55417620/heap-memory-buildup-with-xml-conduit-parsebytes

标签

haskell

memory-leaks

xml-conduit