问题
here's some xml i'm parsing:
<?xml version="1.0" encoding="utf-8"?>
<data>
<row ows_Document='Weekly Report 10.21.2020'
ows_Category='Weekly Report'/>
<row ows_Document='Daily Update 10.20.2020'
ows_Category='Daily Update'/>
<row ows_Document='Weekly Report 10.14.2020'
ows_Category='Weekly Report'/>
<row ows_Document='Weekly Report 10.07.2020'
ows_Category='Weekly Report'/>
<row ows_Document='Spanish: Reporte Semanal 07.10.2020'
ows_Category='Weekly Report'/>
</data>
i've been trying to figure out how to get the conduit parser to reject records unless ows_Category
is Weekly Report
and ows_Document
doesn't contain Spanish
. at first, i used a dummy value (in parseDoc'
below) to filter them out after parsing, but then i realized i should be able to use Maybe
(in the otherwise identical parseDoc
below), together with join
to collapse out my Maybe
layer with the one used by tag'
event parser that fails based on name or attribute matches. it compiles, but behaves bizarrely, apparently not even trying to send certain elements to the parser! how could this be?
{-# LANGUAGE OverloadedStrings #-}
import Conduit
import Control.Monad
import qualified Data.ByteString.Lazy.Char8 as L8
import Data.Foldable
import Data.String
import qualified Data.Text as T
import Data.XML.Types
import Text.XML.Stream.Parse
newtype Doc = Doc
{ name :: String
} deriving (Show)
main :: IO ()
main = do
r <- L8.readFile "oha.xml"
let doc = Doc . T.unpack
check (x,y) a b = if y == "Weekly Report" && not (T.isInfixOf "Spanish" x) then a else b
t :: (MonadThrow m, MonadIO m) => ((T.Text, T.Text) -> ConduitT Event o m c)
-> ConduitT Event o m (Maybe c)
t f = tag' "row" ((,) <$> requireAttr "ows_Document" <*> requireAttr "ows_Category") $ \x -> do
liftIO $ print x
f x
parseDoc, parseDoc' :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
parseDoc = (join <$>) . t $ \z@(x,_) -> return $ check z (Just $ doc x) Nothing -- this version doesn't get sent all of the data! why!?!?
parseDoc' = t $ \z@(x,_) -> return $ doc $ check z x $ T.pack bad -- dummy value
parseDocs :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
-> ConduitT Event o m [Doc]
parseDocs = f tagNoAttr "data" . many'
f g n = force (n <> " required") . g (fromString n)
go p = runConduit $ parseLBS def r .| parseDocs p
bad = "no good"
traverse_ print =<< go parseDoc
putStrLn ""
traverse_ print =<< filter ((/= bad) . name) <$> go parseDoc'
output -- notice how parseDoc
isn't even sent one of the records (one that should succeed, from 10.14), while parseDoc'
behaves as expected:
("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.07.2020"}
("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.14.2020","Weekly Report")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.14.2020"}
Doc {name = "Weekly Report 10.07.2020"}
when i tried further simplifying by removing everything to do with ows_Category
, suddenly parseDoc
worked fine, establishing the soundness of the idea? when i instead removed everything to do with ows_Document
, the problem remained.
i suspect i'm supposed to be doing this with requireAttrRaw, but i haven't been able to make sense of it and can't find doc/examples.
does this have to do with Applicative
-- now that i think about it, it shouldn't be able to fail based on examining values, right?
UPDATES
i found this answer from the author for a previous version of the library, which includes the intriguing force "fail msg" $ return Nothing
in a similar situation, but that abandons all parsing instead of just failing the current parse.
this comment suggests i need to throw an exception, and in the source, they use something like lift $ throwM $ XmlException "failed check" $ Just event
, but like force ... return Nothing
, this kills all parsing, instead of just the current parser. also i don't know how to get my hands on the event
.
here's a merged pull request claiming to have addressed this issue, but it doesn't discuss how to use it, only that it is "trivial" :)
ANSWER
to be explicit about the answer:
parseAttributes :: AttrParser (T.Text, T.Text)
parseAttributes = do
d <- requireAttr "ows_Document"
c <- requireAttr "ows_Category"
ignoreAttrs
guard $ not (T.isInfixOf "Spanish" d) && c == "Weekly Report"
return d
parseDoc :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
parseDoc = tag' "row" parseAttributes $ return . doc
or, since in this case the attribute values can be checked independently:
parseAttributes = requireAttrRaw' "ows_Document" (not . T.isInfixOf "Spanish")
<* requireAttrRaw' "ows_Category" ("Weekly Report" ==)
<* ignoreAttrs
where requireAttrRaw' n f = requireAttrRaw ("required attr value failed condition: " <> n) $ \(n',as) ->
asum $ (\(ContentText a) -> guard (n' == fromString n && f a) *> pure a) <$> as
but the latter leaves open these questions regarding requireAttrRaw
:
- shouldn't we need to know the namespace if we're in charge of verifying
Name
? - why does
requireAttrRaw
send us[Content]
instead of twoMaybe Content
, one each forContentText
andContentEntity
? - what are we supposed to do with
ContentEntity
"For pass-through parsing"?
回答1:
tl;dr In tag' "row" parseAttributes parseContent
, the check
function belongs to parseAttributes
, not to parseContent
.
Why it does not behave as expected
xml-conduit is (notably) designed around the following invariants:
- when parsers are of type
ConduitT Event o m (Maybe a)
, theMaybe
layer encodes whetherEvent
s have been consumed tag' parseName parseAttributes parseContent
consumesEvent
s if and only if bothparseName
andparseAttributes
succeedtag' parseName parseAttributes parseContent
runsparseContent
if and only if bothparseName
andparseAttributes
succeed
In parseDoc
:
- the
check
function is called in theparseContent
part; at this stage,tag'
is already committed to consumeEvent
s, as per invariant 2 - a stack of 2
Maybe
layers arejoin
ed together:- the output of the
check
function, which encodes whether the current<row/>
element is relevant - the "standard"
Maybe
layer fromtag'
signature, which encodes whetherEvent
s have been consumed, as per invariant 1
- the output of the
This essentially breaks invariant 1: when check
returns Nothing
, parseDoc
returns Nothing
despite consuming Event
s of the whole <row/>
element.
This results in undefined behavior of all combinators of xml-conduit, notably many' (analyzed below.)
Why it behaves the way it does
The many' combinator relies on invariant 1 to do its job. It is defined as many' consumer = manyIgnore consumer ignoreAnyTreeContent, that is:
- try
consumer
- if
consumer
returnsNothing
, then skip element or content using ignoreAnyTreeContent, assuming it hasn't been consumed yet byconsumer
, and recurse back to step (1)
In your case, consumer
returns Nothing
for the Daily Update 10.20.2020
item, even though the complete <row/>
element has been consumed. Therefore, ignoreAnyTreeContent is run as a means to skip that particular <row/>
, but actually ends up skipping the next one instead (Weekly Report 10.14.2020
).
How to achieve the expected behavior
Move the check
logic to the parseAttributes
part, so that Event
consumption becomes coupled to whether check
passes.
来源:https://stackoverflow.com/questions/64492279/why-doesnt-runconduit-send-all-the-data