Using mapConcurrently to read stdin, make HTTP calls and write to stdout in parallel

问题

I'm writing a program that reads multiple URLs (one per line) from standard input, slightly adapts them and makes HTTP requests for each of those multiple URLs in parallel. Responses are printed to standard output. Here's the code:

{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Monad
import Network.Wreq
import Control.Concurrent.MSem
import Control.Concurrent.Async
import Control.Concurrent (threadDelay)
import qualified Data.Traversable as T

main :: IO ()
main = void $ mapPool 4 (const processUrl) [1..]

mapPool :: T.Traversable t => Int -> (a -> IO b) -> t a -> IO (t b)
mapPool max f xs = do semaphore <- new max
                      mapConcurrently (with semaphore . f) xs

processUrl :: IO ()
processUrl = do param <- getLine
                response <- get (url ++ param)
                print response

url = "http://example.com/resources?param="

Parallelism is hard-coded here to four. The problem arises when some of the IO actions (HTTP requests) in a batch fail. As per Control.Concurrent.Async.mapConcurrently's design, if one action fails, the rest are canceled. In my case it seems that the last batch will always fail because input hits EOF, an exception happens, and the program outputs:

my-program-exe: <stdin>: hGetLine: end of file

Is there an alternative for mapConcurrently that does not cancel all other action in case one ends with an exception? If not, is there a better way to approach this type of task?

回答1:

Is there an alternative for mapConcurrently that does not cancel all other action in case one ends with an exception?

Here the exception is quite predictable, so perhaps we should handle the problem at the source, for example checking for EOF before reading each line. We could put that in a IO (Maybe String) action that used Nothing to signify EOF.

getLineMaybe :: IO (Maybe String)
getLineMaybe =
    do isEOF <- hIsEOF stdin
       if isEOF then return Nothing
                else Just <$> System.IO.getLine

There's a problem with your example: writing to standard output concurrently is likely to produce a garbled result. The process of writing to stdout should be done from one thread only, and possibly reading from stdin as well.

Perhaps we could have two (closeable and bounded) concurrent queues, one in which we put lines read from stdin, and another in which we put processed results to be written later. Connecting one to the other there would be a number of worker threads.

Using the packages async, stm and stm-chans

import           Control.Concurrent.Async
import           Control.Concurrent.STM
import           Control.Concurrent.STM.TVar
import qualified Control.Concurrent.STM.TBMQueue as Q -- closeable, bounded queue

and this helper funcion

untilNothing :: IO (Maybe a) -> (a -> IO ()) -> IO () -> IO ()
untilNothing action handler finalizer =
   let go = do mx <- action
               case mx of
                   Nothing -> finalizer
                   Just x -> do handler x
                                go
    in go

we can write a general function like the following

data ConcConf = ConcConf {
                           pendingQueueSize :: Int,
                           doneQueueSize :: Int,
                           concurrencyLevel :: Int
                         } deriving Show

concPipeline :: ConcConf -> IO (Maybe a) -> (a -> IO b) -> (b -> IO ()) -> IO ()
concPipeline conf reader transformer writer =
    do src <- atomically $ Q.newTBMQueue (pendingQueueSize conf)
       dst <- atomically $ Q.newTBMQueue (doneQueueSize conf)
       workersLeft <- atomically $ newTVar (concurrencyLevel conf)
       let gang = replicateConcurrently_ (concurrencyLevel conf)
           pipeline =
               untilNothing reader
                            (\a -> atomically $ Q.writeTBMQueue src a)
                            (atomically $ Q.closeTBMQueue src)
               `concurrently_`
               untilNothing (atomically $ Q.readTBMQueue dst)
                            writer
                            (pure ())
               `concurrently_`
               -- worker threads connecting reader and writer
               gang (untilNothing (atomically $ Q.readTBMQueue src)
                                  (\a -> do b <- transformer a
                                            atomically $ Q.writeTBMQueue dst b)
                                  -- last one remaining closes shop
                                  (atomically $ do modifyTVar' workersLeft pred
                                                   c <- readTVar workersLeft
                                                   if c == 0 then Q.closeTBMQueue dst
                                                             else pure ()))
       pipeline

来源：https://stackoverflow.com/questions/50306343/using-mapconcurrently-to-read-stdin-make-http-calls-and-write-to-stdout-in-para

标签

haskell

parallel-processing

stdout

stdin