Importing text data into R and removing extraneous headers and other unwanted text

喜你入骨 提交于 2020-02-04 01:37:10

问题


I have a large text file that contains data from the uniform crime report. Ideally, what I would like to do is only import the data and leave out the other extraneous stuff in the file. The actual data is delimited by spaces and as the data goes onto another "page" the header information repeats itself. I first tried to import the data (and only the data) using the following code and to add my own headers manually:

  data <- read.fwf("2010SHRall.txt", 
        c(-4,3,8,2,4,5,6,5,4,3,3,4,4,3,3,4,6,5,3,6,26,3),   
        skip=5,       
        col.names=c("AGE","AGENCY","G","MO","HOM","INC","SIT","VA","VS","VR","VE","OA","OS","OR","OE","WEAP","REL","CIR","SUB","AGENCYNAME","STATE"), 
        strip.white=FALSE)

This works and then at line 51 it quits. I'm definitely a novice R programmer and I tried to Google the answer as well as to search Stack Overflow but I am at a loss for where to go from here. Here is a link to the text file that I am trying to import. Again, I am trying to import the data and remove any rows that have header info or other pieces that are not needed for the complete dataset.

Any help anyone could offer would be greatly appreciated.


回答1:


This should probably work:

text <- readLines('/tmp/2010SHRall.txt')
group.start <- '^      AGENCY'
group.end <- '(^B)|(^0END OF GROUP)'
data <- character()
inside.group <- FALSE
for (line in text) {
  if (inside.group) {
    if (grepl(group.end, line))
      inside.group <- FALSE
    else
      data <- append(data, line)
  } else if (grepl(group.start, line)) {
    inside.group <- TRUE
  }
}
read.fwf(textConnection(data),
         widths=c(-4,3,8,2,4,5,6,5,4,3,3,4,4,3,3,4,6,5,3,6,26,3),
         header=FALSE,
         col.names=c("AGE","AGENCY","G","MO","HOM","INC","SIT","VA","VS","VR","VE","OA","OS","OR","OE","WEAP","REL","CIR","SUB","AGENCYNAME","STATE"), 
         strip.white=TRUE)

It keeps all lines in between lines that match the group.start and group.end regular expressions and discards the rest.



来源:https://stackoverflow.com/questions/13389196/importing-text-data-into-r-and-removing-extraneous-headers-and-other-unwanted-te

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!