Data cleaning of dollar values and percentage in R

假如想象 提交于 2019-11-28 12:01:47

问题


I've been searching for a number of packages in R to help me in converting dollar values to nice numerical values. I don't seem to be able to find one (in plyr package for example). The basic thing I'm looking for is simply removing the $ sign as well as translating "M" and "K" for Millions and thousands respectively.

To replicate, I can use this code below:

require(XML)
theurl <- "http://www.kickstarter.com/help/stats"
html <- htmlParse(theurl)

allProjects <- readHTMLTable(html)[[1]]
names(allProjects) <-  c("Category","LaunchedProjects","TotalDollars","SuccessfulDollars","UnsuccessfulDollars","LiveDollars","LiveProjects","SuccessRate")

The data looks like this:

> tail(allProjects)
      Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars
8         Food            3,069     $16.79 M          $13.18 M             $2.78 M   $822.64 K
9      Theater            4,155     $13.45 M          $12.01 M             $1.22 M   $217.86 K
10      Comics            2,242     $12.88 M          $11.07 M           $941.31 K   $862.18 K
11     Fashion            2,799      $9.62 M           $7.59 M             $1.44 M   $585.98 K
12 Photography            2,794      $6.76 M           $5.48 M             $1.06 M   $220.75 K
13       Dance            1,185      $3.43 M           $3.13 M           $225.82 K     $71,322
   LiveProjects SuccessRate
8           189      39.27%
9           111      64.09%
10          134      46.11%
11          204      27.24%
12           83      36.81%
13           40      70.22%

I ended up writing my own function:

dollarToNumber <- function(vectorInput) {
  result <- c()
  for (dollarValue in vectorInput) {
    if (is.factor(dollarValue)) {  
      dollarValue = levels(dollarValue)
    }
    dollarValue <- gsub("(\\$|,)","",dollarValue)
    if(grepl(" K",dollarValue)) {
      dollarValue <- as.numeric(gsub(" K","",dollarValue)) * 1000
    } else if (grepl(" M",dollarValue)) {
      dollarValue <- as.numeric(gsub(" M","",dollarValue)) * 1000000
    }  
    if (!is.numeric(dollarValue)) {
      dollarValue <- as.numeric(dollarValue)
    }
    result <- append(result,dollarValue)
  }
    result
}

Then I used it to get what I wanted:

 allProjects <- transform(allProjects,
                          LaunchedProjects = as.numeric(gsub(",","",levels(LaunchedProjects))),
                          TotalDollars = dollarToNumber(TotalDollars),
                          SuccessfulDollars = dollarToNumber(SuccessfulDollars),
                          UnsuccessfulDollars = dollarToNumber(UnsuccessfulDollars),
                          LiveDollars = dollarToNumber(LiveDollars),
                          LiveProjects = as.numeric(LiveProjects),
                          SuccessRate = as.numeric(gsub("%","",SuccessRate))/100)

Which will give me this result below:

> str(allProjects)
'data.frame':   13 obs. of  8 variables:
 $ Category           : Factor w/ 13 levels "Art","Comics",..: 6 8 4 9 12 11 1 7 13 2 ...
 $ LaunchedProjects   : num  10006 1185 1860 20025 2242 ...
 $ TotalDollars       : num  1.11e+08 9.68e+07 6.89e+07 6.66e+07 4.31e+07 ...
 $ SuccessfulDollars  : num  90990000 84960000 59020000 59390000 34910000 ...
 $ UnsuccessfulDollars: num  16640000 7900000 6830000 5480000 3700000 ...
 $ LiveDollars        : num  3090000 3970000 3010000 1750000 4470000 ...
 $ LiveProjects       : num  13 7 6 11 3 10 8 4 1 2 ...
 $ SuccessRate        : num  0.394 0.338 0.382 0.541 0.334 ...

I'm new to R and I felt the code I've written is so ugly, surely there's a better way to do this without reinventing the wheel? I've used apply, aaply, ddply functions with no success (I was trying not to use the for loop as well...). On top of that, when dealing with the SuccessRate column, I couldn't find something like an as.percentage function in R. What am I missing?

Any guidance will be much appreciated!


回答1:


A solution that uses parse and eval:

ToNumber <- function(X)
{
  A <- gsub("%","*1e-2",gsub("K","*1e+3",gsub("M","*1e+6",gsub("\\$|,","",as.character(X)),fixed=TRUE),fixed=TRUE),fixed=TRUE)
  B <- try(sapply(A,function(a){eval(parse(text=a))}),silent=TRUE)
  if (is.numeric(B)) return (as.numeric(B)) else return(X)
}

#----------------------------------------------------------------------
# Example:
X <-
  read.table( header=TRUE,
              text = 
   'Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars  LiveProjects SuccessRate
        Food            3,069    "$16.79 M"         "$13.18 M"            "$2.78 M"  "$822.64 K" 189      39.27%
     Theater            4,155    "$13.45 M"         "$12.01 M"            "$1.22 M"  "$217.86 K" 111      64.09%
      Comics            2,242    "$12.88 M"         "$11.07 M"          "$941.31 K"  "$862.18 K" 134      46.11%
     Fashion            2,799     "$9.62 M"          "$7.59 M"            "$1.44 M"  "$585.98 K" 204      27.24%
 Photography            2,794     "$6.76 M"          "$5.48 M"            "$1.06 M"  "$220.75 K"  83      36.81%
       Dance            1,185     "$3.43 M"          "$3.13 M"          "$225.82 K"    "$71,322"  40      70.22%' )

numX <- as.data.frame(lapply(as.list(X),ToNumber))

options(width=1000)
print(numX,row.names=FALSE)

#    Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars LiveProjects SuccessRate
#        Food             3069     16790000          13180000             2780000      822640          189      0.3927
#     Theater             4155     13450000          12010000             1220000      217860          111      0.6409
#      Comics             2242     12880000          11070000              941310      862180          134      0.4611
#     Fashion             2799      9620000           7590000             1440000      585980          204      0.2724
# Photography             2794      6760000           5480000             1060000      220750           83      0.3681
#       Dance             1185      3430000           3130000              225820       71322           40      0.7022



回答2:


One thing that makes R different from other languages you might be used to is that it's better to do things in a "vectorized" way, to operate on a whole vector at a time rather than looping through each individual value. So your dollarToNumber function can be rewritten without the for loop:

dollarToNumber_vectorised <- function(vector) {
  # Want the vector as character rather than factor while
  # we're doing text processing operations
  vector <- as.character(vector)
  vector <- gsub("(\\$|,)","", vector)
  # Create a numeric vector to store the results in, this will give you
  # warning messages about NA values being introduced because the " K" values
  # can't be converted directly to numeric
  result <- as.numeric(vector)
  # Find all the "$N K" values, and modify the result at those positions
  k_positions <- grep(" K", vector)
  result[k_positions] <- as.numeric(gsub(" K","", vector[k_positions])) * 1000
  # Same for the "$ M" value
  m_positions <- grep(" M", vector)
  result[m_positions] <- as.numeric(gsub(" M","", vector[m_positions])) * 1000000
  return(result)
}

It still gives the same output as your original function:

> dollarToNumber_vectorised(allProjects$LiveDollars)
 [1] 3100000 3970000 3020000 1760000 4510000  762650  510860  823370  218590  865940
[11]  587670  221110   71934
# Don't worry too much about this warning
Warning message:
In dollarToNumber_vectorised(allProjects$LiveDollars) :
  NAs introduced by coercion
> dollarToNumber(allProjects$LiveDollars)
 [1] 3100000 3970000 3020000 1760000 4510000  762650  510860  823370  218590  865940
[11]  587670  221110   71934


来源:https://stackoverflow.com/questions/15014333/data-cleaning-of-dollar-values-and-percentage-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!