问题
I get trade report from one of my broker as below in text file. I am trying to parse it to do some analysis. Problem is each record has multiple rows, including one aggregate row (marked with * for BUY or SELL and below that).
TRADE SETTL AT BUY SELL CONTRACT DESCRIPTION EX TRADE PRICE CC DEBIT(DR)/CREDIT
------- ------- -- -------------- -------------- ------------------------------ -- ----------- -- --------------------
11/26/2 F1 1 JAN 13 SOYBEAN MEAL 01 424.70 US
ELECTRONIC TRADE
F1 1* COMMISSION US 1.20DR
F1 EXCHANGE & CLEARING FEE US .81DR
F1 NFA FEE US .02DR
F1 TOTAL COMMISSION & FEES US 2.03DR
11/28/2 F1 1 DEC 12 SWISS FRANC 16 107.490 US
ELECTRONIC TRADE
F1 1* COMMISSION US 1.20DR
F1 EXCHANGE & CLEARING FEE US .54DR
F1 NFA FEE US .02DR
F1 TOTAL COMMISSION & FEES US 1.76DR
11/29/2 F1 2 MAR 13 NEW COCOA 06 24.61 US
ELECTRONIC TRADE
F1 2* COMMISSION US 2.40DR
F1 EXCHANGE & CLEARING FEE US 4.00DR
F1 NFA FEE US .04DR
F1 TOTAL COMMISSION & FEES US 6.44DR
12/03/2 F1 1 DEC 12 IMM EURO FX 16 1.30000 US
ELECTRONIC TRADE
F1 1* COMMISSION US 1.20DR
F1 EXCHANGE & CLEARING FEE US .54DR
F1 NFA FEE US .02DR
F1 TOTAL COMMISSION & FEES US 1.76DR
12/07/2 F1 3 DEC 12 US $ INDEX 13 80.245 US
ELECTRONIC TRADE
12/07/2 F1 3 DEC 12 US $ INDEX 13 80.250 US
ELECTRONIC TRADE
F1 3* 3* COMMISSION US 7.20DR
F1 EXCHANGE & CLEARING FEE US 8.10DR
F1 NFA FEE US .12DR
F1 TOTAL COMMISSION & FEES US 15.42DR
At the moment I am only interested in aggregated info i.e. CONTRACT DESCRIPTION
, BUY
and SELL
quantities with * in it and fields below i.e COMMISSION
, EXCHANGE AND CLEARING FEES
, NFA FEE
and TOTAL COMMISSION AND FEES
values as specified in last column DEBIT(DR)/CREDIT
?
Any pointers how can I go about doing this?
I tried using read.fwf
but it doesn't work for me because multiline format is not same for each record.
Ultimately, if nothing works, I will have to write line by line parser, which I am trying to avoid at the moment to see if I it can be done in more elegant manner.
回答1:
Since your data are grouped by date, I scan it and I treat it using lapply
.
dat <- scan('yourfile_name',what='character')
ids <- c(grep('[0-9]+/[0-9]+/[0-9]',dat),length(dat))
lapply(head(seq_along(ids),-1),function(x)
{
y <- dat[ids[x]:(ids[x+1]-1)]
list( desc = paste(y[4:8] ,collapse=' '),
dd = y[1],
debit_credit = y[grep('.*DR',y)],
trde_price = as.numeric(y[grep('[0-9]+[.][0-9]+$',y)])
)
})
[[1]]
[[1]]$desc
[1] "JAN 13 SOYBEAN MEAL 01"
[[1]]$dd
[1] "11/26/2"
[[1]]$debit_credit
[1] "1.20DR" ".81DR" ".02DR" "2.03DR"
[[1]]$trde_price
[1] 424.7
[[2]]
[[2]]$desc
[1] "DEC 12 SWISS FRANC 16"
.....
PS: I loose the information of B/S. Hope this helps.
回答2:
agstudy's answer looks very helpful. I'm going to suggest an alternative approach: fix the bleeping input file first. If you can't get to the source program and change the output format, at the very least you can do the following in any text editor (even, dare I say it, MicrosoftWord :-) ) .
Edit: the suggestions below are backwards, i.e. you probably want to keep only the end-of-lines which are followed by a date string. The concept is the same, but mod the search term to find "anything but..." . Sorry for the misdirection.
Do a global search and replace for a paragraph mark (end of line) followed by two digits and a "/" and replace with a tab and the same 2 digits and "/"
In Word, this would be FIND what ^13([0-9]{2,2}/) REPLACE with ^t\1
; editors supporting regexp will do it a little differently.
Now your source file has one (longish) row for each date entry and you can easily extract the columns of interest.
来源:https://stackoverflow.com/questions/15105923/how-to-read-multiline-fwf-format-where-row-may-or-may-not-flow-multiline