requesting data from the Center for Disease Control using RSocrata or XML in R

问题

My goal is to obtain a time series from 1996 week 1 to week 46 of 2016 of legionellosis cases from this website supported by the Center for Disease Control (CDC) of the United States. A coworker attempted to scrape only tables that contain legionellosis cases with the code below:

#install.packages('rvest')
library(rvest)


## Code to get all URLS

getUrls <- function(y1,y2,clist){
root="https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp?mmwr_year="
root1="&mmwr_week="
root2="&mmwr_table=2"
root3="&request=Submit&mmwr_location="

urls <- NULL
 for (year in y1:y2){
  for (week in 1:53){
   for (part in clist) {
     urls <- c(urls,(paste(root,year,root1,week,root2,part,root3,sep="")))
    }
  }
}
      return(urls)
   }

TabList<-c("A","B") ## can change to get not just 2 parts of the table but as many as needed. 

WEB <- as.data.frame(getUrls(1996,2014,TabList)) # Only applies from 1996-2014. After 2014, the root url changes.
head(WEB)


#Example of how to extract data from a single webpage. 

url <- 'https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp?  mmwr_year=1996&mmwr_week=20&mmwr_table=2A&request=Submit&mmwr_location='

webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table, fill = TRUE)[[2]]

#test if Legionellosis is in the table. Returns a vector showing the columns index if the text is found. 
#Can use this command to filter only pages that you need and select only those columns.   
test <- grep("Leg", sb) 
sb <- sb[,c(1,test)]


### This code only works if you have 3 columns for headings. Need to adapt   to be more general for all tables.
#Get Column names
colnames(sb) <- paste(sb[2,], sb[3,], sep="_")
colnames(sb)[1] <- "Area"
sb <- sb[-c(1:3),]

#Remove commas from numbers so that you can then convert columns to numerical  values. Only important if numbers above 1000
Dat <- sapply(sb, FUN= function(x) 
as.character(gsub(",", "", as.character(x), fixed = TRUE)))

Dat<-as.data.frame(Dat, stringsAsFactors = FALSE)

However, the code is not finished and I thought it may be best to use the API since the structure and layout of the table in the webpages changes. This way we wouldn't have to comb through the tables to figure out when the layout changes and how to adjust the web scraping code accordingly. Thus I attempted to pull the data from the API.

Now, I found two help documents from the CDC that provides the data. One appears to provide data from 2014 onward which can be seen here using RSocrata, while the other instruction appears to be more generalized and uses XML format request over http, which can be seen here.The XML format request over http required a databased ID which I could not find. Then I stumbled onto the RSocrata and decided to try that instead. But the code snippet provided along with the token ID I set up did not work.

    install.packages("RSocrata")

    library("RSocrata")
    df <- read.socrata("https://data.cdc.gov/resource/cmap-p7au?$$app_token=tdWMkm9ddsLc6QKHvBP6aCiOA")

How can I fix this? My end goal is a table of legionellosis cases from 1996 to 2016 on a weekly basis by state.

回答1:

I'd recommend checking out this issue thread in the RSocrata GitHub repo where they're discussing a similar issue with passing tokens into the RSocrata library.

In the meantime, you can actually leave off the $$app_token parameter, and as long as you're not flooding us with requests, it'll work just fine. There's a throttling limit you can sneak under without using an app token.

来源：https://stackoverflow.com/questions/40616596/requesting-data-from-the-center-for-disease-control-using-rsocrata-or-xml-in-r

标签

xml

socrata