How to find common variables in a list of datasets & reshape them in R?

问题

    setwd("C:\\Users\\DATA")
    temp = list.files(pattern="*.dta")
    for (i in 1:length(temp)) assign(temp[i], read.dta13(temp[i], nonint.factors = TRUE))
    grep(pattern="_m", temp, value=TRUE)

Here I create a list of my datasets and read them into R, I then attempt to use grep in order to find all variable names with pattern _m, obviously this doesn't work because this simply returns all filenames with pattern _m. So essentially what I want, is my code to loop through the list of databases, find variables ending with _m, and return a list of databases that contain these variables.

Now I'm quite unsure how to do this, I'm quite new to coding and R.

Apart from needing to know in which databases these variables are, I also need to be able to make changes (reshape them) to these variables.

回答1:

First, assign will not work as you think, because it expects a string (or character, as they are called in R). It will use the first element as the variable (see here for more info).

What you can do depends on the structure of your data. read.dta13 will load each file as a data.frame.

If you look for column names, you can do something like that:

myList <- character()
for (i in 1:length(temp)) {

    # save the content of your file in a data frame
    df <- read.dta13(temp[i], nonint.factors = TRUE))

    # identify the names of the columns matching your pattern
    varMatch <- grep(pattern="_m", colnames(df), value=TRUE)

    # check if at least one of the columns match the pattern
    if (length(varMatch)) {
        myList <- c(myList, temp[i]) # save the name if match
    }

}

If you look for the content of a column, you can have a look at the dplyr package, which is very useful when it comes to data frames manipulation.

A good introduction to dplyr is available in the package vignette here.

Note that in R, appending to a vector can become very slow (see this SO question for more details).

回答2:

Here is one way to figure out which files have variables with names ending in "_m":

# setup
setwd("C:\\Users\\DATA")
temp = list.files(pattern="*.dta")
# logical vector to be filled in
inFileVec <- logical(length(temp))

# loop through each file
for (i in 1:length(temp)) {
  # read file
  fileTemp <- read.dta13(temp[i], nonint.factors = TRUE)

  # fill in vector with TRUE if any variable ends in "_m"
  inFileVec[i] <- any(grepl("_m$", names(fileTemp)))
}

In the final line, names returns the variable names, grepl returns a logical vector for whether each variable name matches the pattern, and any returns a logical vector of length 1 indicating whether or not at least one TRUE was returned from grepl.

# print out these file names    
temp[inFileVec]

来源：https://stackoverflow.com/questions/38128038/how-to-find-common-variables-in-a-list-of-datasets-reshape-them-in-r

标签

loops

variables

dataset

reshape