I write a program which counts the frequencies of NGrams in a corpus. I already have a function that consumes a stream of tokens and produces NGrams of one single order:
ngram :: Monad m => Int -> Conduit t m [t]
trigrams = ngram 3
countFreq :: (Ord t, Monad m) => Consumer [t] m (Map [t] Int)
At the moment i just can connect one stream consumer to a stream source:
tokens --- trigrams --- countFreq
How do I connect multiple stream consumers to the same stream source? I would like to have something like this:
.--- unigrams --- countFreq
|--- bigrams --- countFreq
tokens ----|--- trigrams --- countFreq
'--- ... --- countFreq
A plus would be to run each consumer in parallel
EDIT: Thanks to Petr I came up with this solution
spawnMultiple orders = do
chan <- atomically newBroadcastTMChan
results <- forM orders $ \_ -> newEmptyMVar
threads <- forM (zip results orders) $
forkIO . uncurry (sink chan)
forkIO . runResourceT $ sourceFile "test.txt"
$$ javascriptTokenizer
=$ sinkTMChan chan
forM results readMVar
where
sink chan result n = do
chan' <- atomically $ dupTMChan chan
freqs <- runResourceT $ sourceTMChan chan'
$$ ngram n
=$ frequencies
putMVar result freqs
I'm assuming you want all your sinks to receive all values.
I'd suggest:
- Use
newBroadcastTMChan
to create a new channelControl.Concurrent.STM.TMChan
(stm-chans). - Use this channel to build a sink using
sinkTBMChan
fromData.Conduit.TMChan
(stm-conduit) for your main producer. - For each client use
dupTMChan
to create its own copy for reading. Start a new thread that will read this copy usingsourceTBMChan
. - Collect results from your threads.
- Be sure your clients can read the data as fast as they're produced, otherwise you can get heap overflow.
(I haven't tried it, let us know how it works.)
Update: One way how you could collect the results is to create a MVar
for each consumer thread. Each of them would putMVar
its result after it's finished. And your main thread would takeMVar
on all these MVar
s, thus waiting for every thread to finish. For example if vars
is a list of your MVar
s, the main thread would issue mapM takeMVar vars
to collect all the results.
来源:https://stackoverflow.com/questions/17931053/conduit-multiple-stream-consumers