Data cleaning in Excel sheets using R

后端未结

关注

 3  1203

暗喜 2021-01-27 01:30

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S

3条回答

小蘑菇 (楼主)

2021-01-27 01:45
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.

You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.

The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
```
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
```
Now that we have a sorted list of unique values, we can create a listing of recodes:

Let's say we wanted to rename A through G as just A
```
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
```
And you'd repeat the process above eliminating or group the unique names as appropriate.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...