问题
I am trying to parse tab-delimited data, which has been saved as a text file with extraneous data. I would like this to be an R data.table/data.frame.
The tab-delimited format is the following:
A 1092 - 1093 + 1X
B 1093 HRDCPMRFYT
A 1093 + 1094 - 1X
B 1094 BSZSDFJRVF
A 1094 + 1095 + 1X
B 1095 SSTFCLEPVV
...
There are only two types of rows, A and B. A consistently has 5 columns, e.g. for the first row,
1092 - 1093 + 1X
B consistently has two columns:
1093 HRDCPMRFYT
Question: How do you parse a file with "alternating" rows with different formats?
Let's say that this was a text file which was only of this format, alternating rows of A and B, with 5 columns and 2 columns respectively. How do you parse this into an R data.table? My idea how be to create the following format:
1092 - 1093 + 1X 1093 HRDCPMRFYT
1093 + 1094 - 1X 1094 BSZSDFJRVF
1094 + 1095 + 1X 1095 SSTFCLEPVV
...
回答1:
One way to go is to read in your data with readLines
, pull out the bits you want, and pass to read.table
to form the dataframe. So if the rows are alternating then:
txt <-
'1092 - 1093 + 1X
1093 HRDCPMRFYT
1093 + 1094 - 1X
1094 BSZSDFJRVF
1094 + 1095 + 1X
1095 SSTFCLEPVV'
rd <- readLines(textConnection(txt))
data.frame(read.table(text=rd[c(TRUE, FALSE)]),
read.table(text=rd[c(FALSE, TRUE)]))
Change textConnection(txt)
to your file path
Another way is to read in only once and then post-process
r <- read.table(text=txt, fill=TRUE, stringsAsFactors=FALSE, na.strings = "")
d <- cbind(r[c(TRUE, FALSE),], r[c(FALSE, TRUE),])
d[ colSums(is.na(d)) < nrow(d)]
回答2:
You can run shell commands using fread
. In Win10, you can even run some linux utilities such as sed
Hence, you can simply do
fread("sed '$!N;s/\\n/ /' test.tab")
# V1 V2 V3 V4 V5 V6
# 1: 1092 - 1093 + 1X 1093 HRDCPMRFYT
# 2: 1093 + 1094 - 1X 1094 BSZSDFJRVF
# 3: 1094 + 1095 + 1X 1095 SSTFCLEPVV
(sed syntax taken from here)
Data
text <- "1092 - 1093 + 1X
1093 HRDCPMRFYT
1093 + 1094 - 1X
1094 BSZSDFJRVF
1094 + 1095 + 1X
1095 SSTFCLEPVV"
# Saving it as tab separated file on disk
write(gsub(" +", "\t", text), file = "test.tab")
来源:https://stackoverflow.com/questions/50315948/how-to-parse-tab-delimited-data-of-different-formats-into-a-data-table-data-fr