How to use R to check data consistency (make sure no contradiction between case and value)?

问题

Let's say I have:

Person   Movie    Rating
Sally    Titanic  4
Bill     Titanic  4
Rob      Titanic  4
Sue      Cars     8
Alex     Cars     **9**
Bob      Cars     8

As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?

Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?

Thanks

回答1:

Here's a data.table solution

Define the function

Myfunc <- function(x) {
  temp <- table(x)  
  names(temp)[which.max(temp)]
}

library(data.table)

Create a column with the correct rating (by reference)

setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
#    Person   Movie Rating CorrectRating
# 1:  Sally Titanic      4             4
# 2:   Bill Titanic      4             4
# 3:    Rob Titanic      4             4
# 4:    Sue    Cars      8             8
# 5:   Alex    Cars      9             8
# 6:    Bob    Cars      8             8

Or If you want to remove the "bad" ratings

df[Rating == CorrectRating][]
#    Person   Movie Rating CorrectRating
# 1:  Sally Titanic      4             4
# 2:   Bill Titanic      4             4
# 3:    Rob Titanic      4             4
# 4:    Sue    Cars      8             8
# 5:    Bob    Cars      8             8

回答2:

It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.

You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

library(dplyr)
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))

This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:

newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))

If you want to fix them, do

newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))

You can test the above with a reproducible version of your data:

dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
                  Movie = rep(c("Titanic", "Cars"), each = 3),
                  Rating = c(4, 4, 4, 8, 9, 8))

回答3:

If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.

tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)

which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
dat[.Last.value,]

回答4:

I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset

# one of many functions to find mode, could use any other
modefunc <- function(x){
  names(table(x))[table(x)==max(table(x))]
}

# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)

# do whatever you want with the records that are different
x[x$Rating != x$mode, ]

If you want another function for mode, try other functions for mode

来源：https://stackoverflow.com/questions/27463155/how-to-use-r-to-check-data-consistency-make-sure-no-contradiction-between-case

标签

data-cleaning