How to parse tab-delimited data (of different formats) into a data.table/data.frame?

余生长醉 提交于 2019-12-07 08:23:33

问题


I am trying to parse tab-delimited data, which has been saved as a text file with extraneous data. I would like this to be an R data.table/data.frame.

The tab-delimited format is the following:

A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV
...

There are only two types of rows, A and B. A consistently has 5 columns, e.g. for the first row,

1092    -   1093    +   1X

B consistently has two columns:

1093    HRDCPMRFYT

Question: How do you parse a file with "alternating" rows with different formats?

Let's say that this was a text file which was only of this format, alternating rows of A and B, with 5 columns and 2 columns respectively. How do you parse this into an R data.table? My idea how be to create the following format:

1092    -    1093    +    1X    1093    HRDCPMRFYT
1093    +    1094    -    1X    1094    BSZSDFJRVF
1094    +    1095    +    1X    1095    SSTFCLEPVV
... 

回答1:


One way to go is to read in your data with readLines, pull out the bits you want, and pass to read.table to form the dataframe. So if the rows are alternating then:

txt <- 
'1092    -   1093    +   1X
1093    HRDCPMRFYT
1093    +   1094    -   1X
1094    BSZSDFJRVF
1094    +   1095    +   1X
1095    SSTFCLEPVV'


rd <- readLines(textConnection(txt))
data.frame(read.table(text=rd[c(TRUE, FALSE)]), 
           read.table(text=rd[c(FALSE, TRUE)]))

Change textConnection(txt) to your file path


Another way is to read in only once and then post-process

r <- read.table(text=txt, fill=TRUE, stringsAsFactors=FALSE, na.strings = "")
d <- cbind(r[c(TRUE, FALSE),], r[c(FALSE, TRUE),])
d[ colSums(is.na(d)) < nrow(d)]



回答2:


You can run shell commands using fread. In Win10, you can even run some linux utilities such as sed

Hence, you can simply do

fread("sed '$!N;s/\\n/ /' test.tab")
#      V1 V2   V3 V4      V5         V6
# 1: 1092  - 1093  + 1X 1093 HRDCPMRFYT
# 2: 1093  + 1094  - 1X 1094 BSZSDFJRVF
# 3: 1094  + 1095  + 1X 1095 SSTFCLEPVV

(sed syntax taken from here)


Data

text <- "1092    -   1093    +   1X
1093    HRDCPMRFYT
1093    +   1094    -   1X
1094    BSZSDFJRVF
1094    +   1095    +   1X
1095    SSTFCLEPVV"

# Saving it as tab separated file on disk
write(gsub(" +", "\t", text), file = "test.tab")


来源:https://stackoverflow.com/questions/50315948/how-to-parse-tab-delimited-data-of-different-formats-into-a-data-table-data-fr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!