how to read text files and create a data frame in R

非 Y 不嫁゛ 提交于 2019-12-06 06:52:26

Expanding on my comments, here's another approach. You may need to tweak some of the code if your full data set has a wider range of patterns to account for.

library(stringr) # For str_trim 

# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")

# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\\1",  dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\\1",  dat$address)

# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)

dat = dat[,c(1:2,7:8,4:6)]

dat
      LastName  FirstName streetno           streetname       city state        zip
1        Bania  Thomas M.      725    Commonwealth Ave.     Boston    MA      02215
2      Barnaby      David      373        W. Geneva St.   Wms. Bay    WI      53191
3       Bausch       Judy      373        W. Geneva St.   Wms. Bay    WI      53191
...
41      Wright       Greg      791  Holmdel-Keyport Rd.    Holmdel    NY 07733-1988
42     Zingale    Michael     5640        S. Ellis Ave.    Chicago    IL      60637
Pankaj Sharma

Try this.

x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" , 
  what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))

data<-as.data.frame(x)

Here your problem is not how to use R to read in this data, but rather it's that your data is not sufficiently structured using regular delimiters between the variable-length fields you have as inputs. In addition, the zip code field contains some alpha "O" characters that should be "0".

So here is a way to use regular expression substitution to add in delimiters, and then parse the delimited text using read.csv(). Note that depending on exceptions in your full set of text, you may need to adjust the regular expressions. I have done them step by step here to make it clear what is being done and so that you can adjust them as you find exceptions in your input text. (For instance, some city names like `Wms. Bay" are two words.)

addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\\s+O(\\d{4})", " 0\\1", addr.txt)       # replace O with 0 in zip
addr.txt <- gsub("(\\s+)([A-Z]{2})", ", \\2", addr.txt)    # state
addr.txt <- gsub("\\s+(\\d{5}(\\-\\d{4}){0,1})\\s*", ", \\1", addr.txt) # zip
addr.txt <- gsub("\\s+(\\d{1,4})\\s", ", \\1, ", addr.txt) # streetno
addr.txt <- gsub("(^\\w*)(\\s+)", "\\1, ", addr.txt)       # LastName (FirstName)
addr.txt <- gsub("\\s{2,}", ", ", addr.txt)                # city, by elimination

addr <- read.csv(textConnection(addr.txt), header = FALSE,
                 col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
                 stringsAsFactors = FALSE)
head(addr)
##     LastName   FirstName streetno         streetname      city state    zip
## 1      Bania   Thomas M.      725  Commonwealth Ave.    Boston    MA  02215
## 2    Barnaby       David      373      W. Geneva St.  Wms. Bay    WI  53191
## 3     Bausch        Judy      373      W. Geneva St.  Wms. Bay    WI  53191
## 4    Bolatto     Alberto      725  Commonwealth Ave.    Boston    MA  02215
## 5  Carlstrom        John      933        E. 56th St.   Chicago    IL  60637
## 6 Chamberlin  Richard A.      111         Nowelo St.      Hilo    HI  96720

I found it easiest to fix up the file into a csv by adding the commas where they belong, then read it.

## get the page as text
txt <- RCurl::getURL(
    "https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
    ## add most comma-separators, then the last for the house number
    text = gsub("(\\d+) (\\D+)", "\\1,\\2", gsub("\\s{2,}", ",", g1)), 
    header = FALSE,
    ## set the column names
    col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
#     LastName  FirstName streetno        streetname     city state   zip
# 1      Bania  Thomas M.      725 Commonwealth Ave.   Boston    MA O2215
# 2    Barnaby      David      373     W. Geneva St. Wms. Bay    WI 53191
# 3     Bausch       Judy      373     W. Geneva St. Wms. Bay    WI 53191
# 4    Bolatto    Alberto      725 Commonwealth Ave.   Boston    MA O2215
# 5  Carlstrom       John      933       E. 56th St.  Chicago    IL 60637
# 6 Chamberlin Richard A.      111        Nowelo St.     Hilo    HI 96720
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!