问题
I have a tab-delimited file like this:
RS1->2001 HAPLO1 AAACAAGGAGGAGAAGGAAA ...
RS1->2001 HAPLO2 CAACAAAGAGGAGAAGGAAA ...
RS1->2002 HAPLO1 AAAAAAGGAGGAAAAGGAAA ...
RS1->20020 HAPLO2 CAACAAGGAGGAAGCAGAGC ...
RS1->20021 HAPLO2 CAACAAGGAGGAAGCAGAGC ...
In R we can easily read in these three columns, my problem is that I need separate the 3rd column character by character. The end result should be something like this:
RS1->2001 HAPLO1 A A A C ...
RS1->2001 HAPLO2 C A A C ...
RS1->2002 HAPLO1 A A A A ...
RS1->20020 HAPLO2 C A A C ...
RS1->20021 HAPLO2 C A A C ...
I can first read the 3 columns in, then split each entry of the 3rd column into characters, but this is annoying, I would very much prefer to get it right from the start.
If the first two columns does not existe, I can achieve the goal with
read.fwf('test.csv', widths=rep(1, 300))
I am thinking whether I can read in the first 2 columns in by using the tab delimiter and then read the 3rd column by fixed width.
回答1:
The two main options that come to mind are strsplit
(as mentioned in the comments and in @Ricardo's answer) and read.fwf
. read.fwf
won't work directly with your data, but it can work on a column of data that has already been read in if you use the textConnection()
function.
Here's a basic example:
## Create a tab-separated file named "test.txt" in your working directory
cat("2001\tHAPLO1\tAAACAAGGAGGAGAAGGAAA\n",
"2001\tHAPLO2\tCAACAAAGAGGAGAAGGAAA\n",
"2002\tHAPLO1\tAAAAAAGGAGGAAAAGGAAA\n",
"20020\tHAPLO2\tCAACAAGGAGGAAGCAGAGC\n",
"20021\tHAPLO2\tCAACAAGGAGGAAGCAGAGC\n",
file = "test.txt")
## Read it in with `read.delim`
mydata <- read.delim("test.txt", header = FALSE, stringsAsFactors = FALSE)
## Use `read.fwf` on the third column
## Replace "widths" with whatever the maximum width is for that column
## If max width is not known, you can use something like
## `widths = rep(1, max(nchar(mydata$V3)))`
cbind(mydata[-3],
read.fwf(file = textConnection(mydata$V3), widths = rep(1, 20)))
# V1 V2 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1 2001 HAPLO1 A A A C A A G G A G G A G A A G G A A A
# 2 2001 HAPLO2 C A A C A A A G A G G A G A A G G A A A
# 3 2002 HAPLO1 A A A A A A G G A G G A A A A G G A A A
# 4 20020 HAPLO2 C A A C A A G G A G G A A G C A G A G C
# 5 20021 HAPLO2 C A A C A A G G A G G A A G C A G A G C
Note: If you did not use stringsAsFactors = FALSE
, you would have to change your file
argument to:
file = textConnection(as.character(mydata$V3))
回答2:
As @Ananda alludes to in the comments, strsplit
if asked to split on ""
will split every letter.
fContents <- read.csv("/path/to/file.csv")
# This will chop it up for you.
strsplit(fContents[, 3], "")
In order to combine it, use cbind
cbind(fContents[, -3],
do.call(rbind, strsplit(fContents[, 3], ""))
)
# or if you'd like to keep the columns ordered (and there are more than 3):
cbind(fContents[, 1:2],
do.call(rbind, strsplit(fContents[, 3], "")),
fContents[, 4:ncol(fContents)]
)
回答3:
import csv
file_read = csv.reader(open('/path/to/file.csv','r'),delimiter='\t')
file_write = csv.writer(open('/path/to/newfile.csv','w'),delimiter='\t')
for i in file_read:
first=i[0]
second=i[1]
third = i[3]
splitchar = [k for k in third]
outputdata = [first,second,splitchar]
file_write.writerow(outputdata)
来源:https://stackoverflow.com/questions/17976628/read-table-by-delimiter-then-by-fixed-width-in-r