Possible to change the record delimiter in R?

问题

Is it possible to manipulate the record/observation/row delimiter when reading in data (i.e. read.table) from a text file? It's straightforward to adjust the field delimiter using sep="", but I haven't found a way to change the record delimiter from an end-of-line character.

I am trying to read in pipe delimited text files in which many of the entries are long strings that include carriage returns. R treats these CRs as end-of-line, which begins a new row incorrectly and screws up the number of records and field order.

I would like to use a different delimiter instead of a CR. As it turns out, each row begins with the same string, so if I could use use something like \nString to identify true end-of-line, the table would import correctly. Here's a simplified example of what one of the text files might look like.

V1,V2,V3,V4
String,A,5,some text
String,B,2,more text and
more text
String,B,7,some different text
String,A,,

Should read into R as

V1      V2       V3      V4
String  A        5       some text
String  B        2       more text and more text
String  B        7       some different text
String  A        N/A     N/A

I can open the files in a text editor and clean them with a find/replace before reading in, but a systematic solution within R would be great. Thanks for your help.

回答1:

We can read them in and collapse them afterwards. g will have the value 0 for the header, 1 for the next line (and for follow on lines, if any, that are to go with it) and so on. tapply collapses the lines according to g giving L2 and finally we re-read the lines:

Lines <- "V1,V2,V3,V4
String,A,5,some text
String,B,2,more text and
more text
String,B,7,some different text
String,A,,"

L <- readLines(textConnection(Lines))

g <- cumsum(grepl("^String", L))
L2 <- tapply(L, g, paste, collapse = " ")

DF <- read.csv(text = L2, as.is = TRUE)
DF$V4[ DF$V4 == "" ] <- NA

This gives:

> DF
      V1 V2 V3                      V4
1 String  A  5               some text
2 String  B  2 more text and more text
3 String  B  7     some different text
4 String  A NA                    <NA>

回答2:

If you're on Linux/Mac, you should really be using a command line tool, like e.g. sed, instead. Here are two slightly different approaches:

# keep the \n
read.csv(pipe('sed \'N; s/\\([^,]*\\)\\n\\([^,]*$\\)/"\\1\\n\\2"/\' test.txt'))
#      V1 V2 V3                       V4
#1 String  A  5                some text
#2 String  B  2 more text and\nmore text
#3 String  B  7      some different text
#4 String  A NA

# get rid of the \n and replace with a space
read.csv(pipe('sed \'N; s/\\([^,]*\\)\\n\\([^,]*$\\)/\\1 \\2/\' test.txt'))
#      V1 V2 V3                      V4
#1 String  A  5               some text
#2 String  B  2 more text and more text
#3 String  B  7     some different text
#4 String  A NA

来源：https://stackoverflow.com/questions/16115887/possible-to-change-the-record-delimiter-in-r

标签

regex

delimiter