Haskell: Scan Through a List and Apply A Different Function for Each Element

前端 未结 4 1944
我寻月下人不归
我寻月下人不归 2021-02-06 11:43

I need to scan through a document and accumulate the output of different functions for each string in the file. The function run on any given line of the file depends on what i

相关标签:
4条回答
  • 2021-02-06 11:56

    I show a solution for two types of line, but it is easily extended to five types of line by using a five-tuple instead of a two-tuple.

    import Data.Monoid
    
    eachLine :: B.ByteString -> ([Atom], [Sheet])
    eachLine bs | isAnAtom bs = ([ {- calculate an Atom -} ], [])
                | isASheet bs = ([], [ {- calculate a Sheet -} ])
                | otherwise = error "eachLine"
    
    allLines :: [B.ByteString] -> ([Atom], [Sheet])
    allLines bss = mconcat (map eachLine bss)
    

    The magic is done by mconcat from Data.Monoid (included with GHC).

    (On a point of style: personally I would define a Line type, a parseLine :: B.ByteString -> Line function and write eachLine bs = case parseLine bs of .... But this is peripheral to your question.)

    0 讨论(0)
  • 2021-02-06 12:01

    It is a good idea to introduce a new ADT, e.g. "Summary" instead of tuples. Then, since you want to accumulate the values of Summary you came make it an istance of Data.Monoid. Then you classify each of your lines with the help of classifier functions (e.g. isAtom, isSheet, etc.) and concatenate them together using Monoid's mconcat function (as suggested by @dave4420).

    Here is the code (it uses String instead of ByteString, but it is quite easy to change):

    module Classifier where
    
    import Data.List
    import Data.Monoid
    
    data Summary = Summary
      { atoms :: [String]
      , sheets :: [String]
      , digits :: [String]
      } deriving (Show)
    
    instance Monoid Summary where
      mempty = Summary [] [] []
      Summary as1 ss1 ds1 `mappend` Summary as2 ss2 ds2 =
        Summary (as1 `mappend` as2)
                (ss1 `mappend` ss2)
                (ds1 `mappend` ds2)
    
    classify :: [String] -> Summary
    classify = mconcat  . map classifyLine
    
    classifyLine :: String -> Summary
    classifyLine line
      | isAtom line  = Summary [line] [] [] -- or "mempty { atoms = [line] }"
      | isSheet line = Summary [] [line] []
      | isDigit line = Summary [] [] [line]
      | otherwise    = mempty -- or "error" if you need this  
    
    isAtom, isSheet, isDigit :: String -> Bool
    isAtom = isPrefixOf "atom"
    isSheet = isPrefixOf "sheet"
    isDigit = isPrefixOf "digits"
    
    input :: [String]
    input = ["atom1", "sheet1", "sheet2", "digits1"]
    
    test :: Summary
    test = classify input
    
    0 讨论(0)
  • 2021-02-06 12:06

    If you have only 2 alternatives, using Either might be a good idea. In that case combine your functions, map the list, and use lefts and rights to get the results:

    import Data.Either
    
    -- first sample function, returning String
    f1 x = show $ x `div` 2
    
    -- second sample function, returning Int
    f2 x = 3*x+1
    
    -- combined function returning Either String Int
    hotpo x = if even x then Left (f1 x) else Right (f2 x)
    
    xs = map hotpo [1..10] 
    -- [Right 4,Left "1",Right 10,Left "2",Right 16,Left "3",Right 22,Left "4",Right 28,Left "5"]
    
    lefts xs 
    -- ["1","2","3","4","5"]
    
    rights xs
    -- [4,10,16,22,28]
    
    0 讨论(0)
  • 2021-02-06 12:17

    First of all, I think that the answers others have supplied will work at least 95% of the time. It's always good practice to code for the problem at hand by using appropriate data types (or tuples in some cases). However, sometimes you really don't know in advance what you're looking for in the list, and in these cases trying to enumerate all possibilities is difficult/time-consuming/error-prone. Or, you're writing multiple variants of the same sort of thing (manually inlining multiple folds into one) and you'd like to capture the abstraction.

    Fortunately, there are a few techniques that can help.

    The framework solution

    (somewhat self-evangelizing)

    First, the various "iteratee/enumerator" packages often provide functions to deal with this sort of problem. I'm most familiar with iteratee, which would let you do the following:

    import Data.Iteratee as I
    import Data.Iteratee.Char
    import Data.Maybe
    
    -- first, you'll need some way to process the Atoms/Sheets/etc. you're getting
    -- if you want to just return them as a list, you can use the built-in
    -- stream2list function
    
    -- next, create stream transformers
    -- given at :: B.ByteString -> Maybe Atom
    -- create a stream transformer from ByteString lines to Atoms
    atIter :: Enumeratee [B.ByteString] [Atom] m a
    atIter = I.mapChunks (catMaybes . map at)
    
    otIter :: Enumeratee [B.ByteString] [Sheet] m a
    otIter = I.mapChunks (catMaybes . map ot)
    
    -- finally, combine multiple processors into one
    -- if you have more than one processor, you can use zip3, zip4, etc.
    procFile :: Iteratee [B.ByteString] m ([Atom],[Sheet])
    procFile = I.zip (atIter =$ stream2list) (otIter =$ stream2list)
    
    -- and run it on some data
    runner :: FilePath -> IO ([Atom],[Sheet])
    runner filename = do
      resultIter <- enumFile defaultBufSize filename $= enumLinesBS $ procFile
      run resultIter
    

    One benefit this gives you is extra composability. You can create transformers as you like, and just combine them with zip. You can even run the consumers in parallel if you like (although only if you're working in the IO monad, and probably not worth it unless the consumers do a lot of work) by changing to this:

    import Data.Iteratee.Parallel
    
    parProcFile = I.zip (parI $ atIter =$ stream2list) (parI $ otIter =$ stream2list)
    

    The result of doing so isn't the same as a single for-loop - this will still perform multiple traversals of the data. However, the traversal pattern has changed. This will load a certain amount of data at once (defaultBufSize bytes) and traverse that chunk multiple times, storing partial results as necessary. After a chunk has been entirely consumed, the next chunk is loaded and the old one can be garbage collected.

    Hopefully this will demonstrate the difference:

    Data.List.zip:
      x1 x2 x3 .. x_n
                       x1 x2 x3 .. x_n
    
    Data.Iteratee.zip:
      x1 x2      x3 x4      x_n-1 x_n
           x1 x2      x3 x4           x_n-1 x_n
    

    If you're doing enough work that parallelism makes sense this isn't a problem at all. Due to memory locality, the performance is much better than multiple traversals over the entire input as Data.List.zip would make.

    The beautiful solution

    If a single-traversal solution really does make the most sense, you might be interested in Max Rabkin's Beautiful Folding post, and Conal Elliott's followup work (this too). The essential idea is that you can create data structures to represent folds and zips, and combining these lets you create a new, combined fold/zip function that only needs one traversal. It's maybe a little advanced for a Haskell beginner, but since you're thinking about the problem you may find it interesting or useful. Max's post is probably the best starting point.

    0 讨论(0)
提交回复
热议问题