How to read multiline fwf format where row may or may not flow multiline

房东的猫 提交于 2019-12-24 09:59:23

问题


I get trade report from one of my broker as below in text file. I am trying to parse it to do some analysis. Problem is each record has multiple rows, including one aggregate row (marked with * for BUY or SELL and below that).

  TRADE   SETTL  AT      BUY            SELL      CONTRACT DESCRIPTION           EX TRADE PRICE CC   DEBIT(DR)/CREDIT
 ------- ------- -- -------------- -------------- ------------------------------ -- ----------- -- --------------------
 11/26/2         F1                            1  JAN 13 SOYBEAN MEAL            01   424.70    US
                                                  ELECTRONIC TRADE
                 F1                            1*                                    COMMISSION US               1.20DR
                 F1                                                     EXCHANGE & CLEARING FEE US                .81DR
                 F1                                                                     NFA FEE US                .02DR
                 F1                                                     TOTAL COMMISSION & FEES US               2.03DR
 11/28/2         F1             1                 DEC 12 SWISS FRANC             16  107.490    US
                                                  ELECTRONIC TRADE
                 F1             1*                                                   COMMISSION US               1.20DR
                 F1                                                     EXCHANGE & CLEARING FEE US                .54DR
                 F1                                                                     NFA FEE US                .02DR
                 F1                                                     TOTAL COMMISSION & FEES US               1.76DR
 11/29/2         F1             2                 MAR 13 NEW COCOA               06    24.61    US
                                                  ELECTRONIC TRADE
                 F1             2*                                                   COMMISSION US               2.40DR
                 F1                                                     EXCHANGE & CLEARING FEE US               4.00DR
                 F1                                                                     NFA FEE US                .04DR
                 F1                                                     TOTAL COMMISSION & FEES US               6.44DR
 12/03/2         F1             1                 DEC 12 IMM EURO FX             16     1.30000 US
                                                  ELECTRONIC TRADE
                 F1             1*                                                   COMMISSION US               1.20DR
                 F1                                                     EXCHANGE & CLEARING FEE US                .54DR
                 F1                                                                     NFA FEE US                .02DR
                 F1                                                     TOTAL COMMISSION & FEES US               1.76DR
 12/07/2         F1                            3  DEC 12 US $ INDEX              13    80.245   US
                                                  ELECTRONIC TRADE
 12/07/2         F1             3                 DEC 12 US $ INDEX              13    80.250   US
                                                  ELECTRONIC TRADE
                 F1             3*             3*                                    COMMISSION US               7.20DR
                 F1                                                     EXCHANGE & CLEARING FEE US               8.10DR
                 F1                                                                     NFA FEE US                .12DR
                 F1                                                     TOTAL COMMISSION & FEES US              15.42DR

At the moment I am only interested in aggregated info i.e. CONTRACT DESCRIPTION, BUY and SELL quantities with * in it and fields below i.e COMMISSION, EXCHANGE AND CLEARING FEES, NFA FEE and TOTAL COMMISSION AND FEES values as specified in last column DEBIT(DR)/CREDIT ?

Any pointers how can I go about doing this?

I tried using read.fwf but it doesn't work for me because multiline format is not same for each record.

Ultimately, if nothing works, I will have to write line by line parser, which I am trying to avoid at the moment to see if I it can be done in more elegant manner.


回答1:


Since your data are grouped by date, I scan it and I treat it using lapply.

dat <- scan('yourfile_name',what='character')
ids <- c(grep('[0-9]+/[0-9]+/[0-9]',dat),length(dat))
lapply(head(seq_along(ids),-1),function(x)
{
  y <- dat[ids[x]:(ids[x+1]-1)]
  list( desc = paste(y[4:8] ,collapse=' '),
        dd = y[1],
       debit_credit = y[grep('.*DR',y)],
       trde_price = as.numeric(y[grep('[0-9]+[.][0-9]+$',y)])
       )
})
[[1]]
[[1]]$desc
[1] "JAN 13 SOYBEAN MEAL 01"
[[1]]$dd
[1] "11/26/2"
[[1]]$debit_credit
[1] "1.20DR" ".81DR"  ".02DR"  "2.03DR"
[[1]]$trde_price
[1] 424.7

[[2]]
[[2]]$desc
[1] "DEC 12 SWISS FRANC 16"

.....

PS: I loose the information of B/S. Hope this helps.




回答2:


agstudy's answer looks very helpful. I'm going to suggest an alternative approach: fix the bleeping input file first. If you can't get to the source program and change the output format, at the very least you can do the following in any text editor (even, dare I say it, MicrosoftWord :-) ) .

Edit: the suggestions below are backwards, i.e. you probably want to keep only the end-of-lines which are followed by a date string. The concept is the same, but mod the search term to find "anything but..." . Sorry for the misdirection.

Do a global search and replace for a paragraph mark (end of line) followed by two digits and a "/" and replace with a tab and the same 2 digits and "/"

In Word, this would be FIND what ^13([0-9]{2,2}/) REPLACE with ^t\1 ; editors supporting regexp will do it a little differently. Now your source file has one (longish) row for each date entry and you can easily extract the columns of interest.



来源:https://stackoverflow.com/questions/15105923/how-to-read-multiline-fwf-format-where-row-may-or-may-not-flow-multiline

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!