R - identify which columns contain currency data $

问题

I have a very large dataset with some columns formatted as currency, some numeric, some character. When reading in the data all currency columns are identified as factor and I need to convert them to numeric. The dataset it too wide to manually identify the columns. I am trying to find a programmatic way to identify if a column contains currency data (ex. starts with '$') and then pass that list of columns to be cleaned.

name <- c('john','carl', 'hank')
salary <- c('$23,456.33','$45,677.43','$76,234.88')
emp_data <- data.frame(name,salary)

clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)

The issue in this example is that this sapply works on all columns resulting in the name column being replaced with NA. I need a way to programmatically identify just the columns that the clean function needs to be applied to.. in this example salary.

回答1:

Using dplyr and stringr packages, you can use mutate_if to identify columns that have any string starting with a $ and then change the accordingly.

library(dplyr)
library(stringr)

emp_data %>%
  mutate_if(~any(str_detect(., '^\\$'), na.rm = TRUE),
            ~as.numeric(str_replace_all(., '[$,]', '')))

回答2:

Taking advantage of the powerful parsers the readr package offers out of the box:

my_parser <- function(col) {
  # Try first with parse_number that handles currencies automatically quite well
  res <- suppressWarnings(readr::parse_number(col))
  if (is.null(attr(res, "problems", exact = TRUE))) {
    res
  } else {
    # If parse_number fails, fall back on parse_guess
    readr::parse_guess(col)
    # Alternatively, we could simply return col without further parsing attempt
  }
}

library(dplyr)

emp_data %>% 
  mutate(foo = "USD13.4",
         bar = "£37") %>% 
  mutate_all(my_parser)

#   name   salary  foo bar
# 1 john 23456.33 13.4  37
# 2 carl 45677.43 13.4  37
# 3 hank 76234.88 13.4  37

回答3:

A base R option is to use startsWith to detect the dollar columns, and gsub to remove "$" and "," from the columns.

doll_cols <- sapply(emp_data, function(x) any(startsWith(as.character(x), '$')))
emp_data[doll_cols] <- lapply(emp_data[doll_cols], 
                              function(x) as.numeric(gsub('\\$|,', '', x)))

来源：https://stackoverflow.com/questions/51364015/r-identify-which-columns-contain-currency-data

标签

currency

data-cleaning