Import fixed width data file with no line separator

主宰稳场 提交于 2019-12-23 09:28:01

问题


I have fixed width data files (.dbf) that don't have line separators. Here is what two lines of that datafile looks like:

20141101 77h  3.210                                  0    3 20141102 76h  3.090                                  0    3 

The widths of one line is c(8,4,7,41) for date (8), some time measure (4), the data point (7), and some other columns that i can summarize in one "rest" column (41). After one line there is no separator and the next line is just appended to the first line. All time steps are basically written consecutively in one massive line. There is exclusively numbers, characters and white space in this file.

With read.fwf('filepath', widths = c(8,4,7,41)) R stops reading after the first line due to lack of line separator.

Is there an argument to tell read.fwf() when to start reading the new line when there is no line separator? Or should i use a different read command?

Thanks in advance.


回答1:


A different, and probably less elegant, solution with readLines, substr, trimws, separate (tidyr) and mutate_all (dplyr):

txt <- readLines('filepath')
dfx <- data.frame(V1 = sapply(seq(from=1, to=nchar(txt), by=60),
                              function(x) substr(txt, x, x+59)))

library(dplyr)
library(tidyr)
dfx %>% 
  separate(V1, c(paste0("V",LETTERS[1:5])), c(8,12,19,55)) %>% 
  mutate_all(trimws)

which gives:

        VA  VB    VC VD VE
1 20141101 77h 3.210  0  3
2 20141102 76h 3.090  0  3

To get different column names , just replace c(paste0("V",LETTERS[1:5]) with a vector of columnnames you want.

If you want to transform the columns into the correct classes instead of into character, you can use funs(ul = type.convert(trimws(.))) inside mutate_all.




回答2:


Maybe not the best idea but this should work:

content <- scan('filepath','character',sep='~') # Warning choose a sep not appearing in datas to get the whole file.
# Split content in lines:
lines <- regmatches(content,gregexpr('.{60}',content))[[1]]
x <- tempfile()
write(lines,x)
data <- read.fwf(x, widths = c(8,4,7,41))
unlink(x)

The idea is to read the whole file, get each occurence of 60 chars into a single entry, write this to a tempfile, and read the data from this tempfile before deleting the temporary file.

Another approach is doable with regexes and package stringr (still with content resulting from scan above):

library(stringr)
d <- data.frame( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5], stringsAsFactors=FALSE)

which gives:

        V1   V2      V3                                        V4
1 20141101  77h   3.210                                   0    3 
2 20141102  76h   3.090                                   0    3 

str_match_all return a list, here with 1 element because there's only one line as input, so we remove it with [[1]].

Now the return is 5 columns, the first one being the full match, others being the capture groups so we subset the matrix on columns 2 to 5 to get only the 4 columns we need and wrap it in as.data.frame to get a data.frame at end.

you can then name the columns with colnames(d) <- c('date','time','data_point','rest')

If you wish to clean up the white spaces you can wrap the str_extract_all result in trimws (thanks to @jaap for the remind of this function) like this:

td <- data.frame( trimws( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5] ), stringsAsFactors=FALSE)

Output:

        X1  X2    X3     X4
1 20141101 77h 3.210 0    3
2 20141102 76h 3.090 0    3



回答3:


In addition to the other answers, some general info about dbf files:

Unless this is a one time read of a static file, it would be best to check the file/fields structure first in case that changes over time. See here for the internal structure of a dbf file.

But maybe even more important:

Each record in a dbf file is preceded by one byte for the delete flag. If this is a space, the record is not deleted, if it's an asterisk * the record is marked for deletion (records are not removed from a dbf file until the file is packed), and you probably want to skip those records. The first part of the data could also be overwritten with "DELETED" for example.

So, in your record c(8,4,7,41), the last byte of the rest column (41) is actually the delete flag of the record that follows it - and the last record in the file will only have 40 bytes for that field (but if you're lucky, the file has an EOF marker (0x1a), so maybe you didn't have a problem with the size there).

Thus, your record should actually be: c(1,8,4,7,40), where the 1 is the delete flag, and starting one byte sooner.



来源:https://stackoverflow.com/questions/35222149/import-fixed-width-data-file-with-no-line-separator

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!