Data cleaning in Excel sheets using R

后端 未结 3 1202
暗喜
暗喜 2021-01-27 01:30

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S

相关标签:
3条回答
  • 2021-01-27 01:45

    There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.

    You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.

    The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean

    your_df <- data.frame(ID=1:2000)
    your_df$BranchNames <- sample(letters,2000, replace=T)
    your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
    unique.names <- sort(unique(your_df$BranchNames))
    

    Now that we have a sorted list of unique values, we can create a listing of recodes:

    Let's say we wanted to rename A through G as just A

    your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
    

    And you'd repeat the process above eliminating or group the unique names as appropriate.

    0 讨论(0)
  • 2021-01-27 01:49

    The car package has a recode function. See it's help page for worked examples.

    In fact an argument could be made that this should be a closed question:

    Why is recode in R not changing the original values?

    How to recode a variable to numeric in R?

    Recode/relevel data.frame factors with different levels

    And a few more questions easily identifiable with a search: [r] recode

    EDIT: I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.

    my_recode <- function(fac, levslist){ nfac <- factor(fac);
        inlevs <- levels(nfac);
        othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
          # levslist of the form ::::    list(
          #     animal = c("cow", "pig"),
          #     bird = c("eagle", "pigeon") )
     levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
    
     df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"), 
                  stringsAsFactors = FALSE)
     df$type <- my_recode(df$name, list(
         animal = c("cow", "pig"),
         bird = c("eagle", "pigeon") ) )
     df
    #-----------
        name       type
    1    cow     animal
    2    pig     animal
    3  eagle       bird
    4 pigeon       bird
    5  zebra all_others
    
    0 讨论(0)
  • 2021-01-27 01:57

    You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)

    If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.

    It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.

    The videos at the official web site are useful.

    0 讨论(0)
提交回复
热议问题