Data cleaning in Excel sheets using R

后端未结

关注

 3  1207

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S

相关标签:

3条回答

小蘑菇

2021-01-27 01:45
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.

You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.

The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
```
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
```
Now that we have a sorted list of unique values, we can create a listing of recodes:

Let's say we wanted to rename A through G as just A
```
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
```
And you'd repeat the process above eliminating or group the unique names as appropriate.
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2021-01-27 01:49
The car package has a recode function. See it's help page for worked examples.

In fact an argument could be made that this should be a closed question:

Why is recode in R not changing the original values?

How to recode a variable to numeric in R?

Recode/relevel data.frame factors with different levels

And a few more questions easily identifiable with a search: [r] recode

EDIT: I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
```
my_recode <- function(fac, levslist){ nfac <- factor(fac);
    inlevs <- levels(nfac);
    othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
      # levslist of the form ::::    list(
      #     animal = c("cow", "pig"),
      #     bird = c("eagle", "pigeon") )
 levels(nfac)<- c(levslist, all_others =othrlevs); nfac}

 df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"), 
              stringsAsFactors = FALSE)
 df$type <- my_recode(df$name, list(
     animal = c("cow", "pig"),
     bird = c("eagle", "pigeon") ) )
 df
#-----------
    name       type
1    cow     animal
2    pig     animal
3  eagle       bird
4 pigeon       bird
5  zebra all_others
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
攒了一身酷

2021-01-27 01:57

You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)

If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.

It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.

The videos at the official web site are useful.

0 讨论(0)
发布评论:

提交评论
- 加载中...