Using R, getting a “Can't bind data because some arguments have the same name” using dplyr:select

前端 未结 4 601
不知归路
不知归路 2021-01-05 03:48
#use readtable to create data frames of following unzipped files below
x.train <- read.table("UCI HAR Dataset/train/X_train.txt")
subject.train <- re         


        
4条回答
  •  醉梦人生
    2021-01-05 04:21

    I had exactly the same problem and I think I'm looking at the same dataset as you. It's motion sensor data from a smart phone, isn't it?

    The problem is exactly what the error message says! That dang set has duplicate column names. Here's how I explored it. I couldn't use your dput commands, so I couldn't try out your data. I'm showing my code and results. I suggest you substitute your variable, dataset_test, where I have samsungData.

    Here's the error. If you just select the dataset, but don't indicate the columns, the error message identifies the duplicates.

    select(samsungData)
    

    That gave me this error, which is just what your own dplyr error was trying to tell you.

    Error: Columns "fBodyAcc-bandsEnergy()-1,8", "fBodyAcc-bandsEnergy()-9,16", "fBodyAcc-bandsEnergy()-17,24", "fBodyAcc-bandsEnergy()-25,32", "fBodyAcc-bandsEnergy()-33,40", ... must have a unique name

    Then I wanted to see where that first column was duplicated. (I don't think I'll ever work well with regular expressions, but this one made me mad and I wanted to find it.)

    has_dupe_col <- grep("fBodyAcc\\-bandsEnergy\\(\\)\\-1,8", names(samsungData))
    names(samsungData)[has_dupe_col]
    

    Results:

    [1] "fBodyAcc-bandsEnergy()-1,8" "fBodyAcc-bandsEnergy()-1,8" "fBodyAcc-bandsEnergy()-1,8"
    

    That showed me that the same column name appears in three positions. That won't play nicely in dplyr.

    Then I wanted to see a frequency table for all the column names and call out the duplicates.

    names_freq <- as.data.frame(table(names(samsungData)))
    names_freq[names_freq$Freq > 1, ]
    

    A bunch of them appear three times! Here are just a few.

                                    Var1 Freq
    9        fBodyAcc-bandsEnergy()-1,16    3
    10       fBodyAcc-bandsEnergy()-1,24    3
    11        fBodyAcc-bandsEnergy()-1,8    3
    

    Conclusion:

    The tool (dplyr) isn't broken, the data is defective. If you want to use dplyr to select from this dataset, you're going to have to locate those duplicate column names and do something about them. Maybe you change the column name (dplyr's mutate will do it for you without grief). On the other hand, maybe they're supposed to be duplicated and they're there because they're a time series or some iteration of experimental observations. Maybe then what you need to do is merge those columns into one and provide another dimension (variable) to distinguish them.

    That's the analysis part of data analysis. You'll have to dig into the data to see what the right answer is. Either that, or the question you're trying to answer need not even include those duplicate columns, in which case you throw them away and sleep peacefully.

    Welcome to data science! At best, it's just 10% cool math and machine learning. 90% is putting on gloves and a mask and wiping up crap like this in your data.

提交回复
热议问题