Data cleaning in Excel sheets using R

后端 未结 3 1197
暗喜
暗喜 2021-01-27 01:30

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S

3条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-27 01:45

    There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.

    You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.

    The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean

    your_df <- data.frame(ID=1:2000)
    your_df$BranchNames <- sample(letters,2000, replace=T)
    your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
    unique.names <- sort(unique(your_df$BranchNames))
    

    Now that we have a sorted list of unique values, we can create a listing of recodes:

    Let's say we wanted to rename A through G as just A

    your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
    

    And you'd repeat the process above eliminating or group the unique names as appropriate.

提交回复
热议问题