问题
In my dataset, I'm using have four assessments I'm trying to predict: 1 [Good] to 4 [Bad].
My model seems to be working using the polr
function to predict values using ordered logistic regression -- though it's giving me the 'warning message': In cbind(race, partisanship, sex, age) : number of rows of result is not a multiple of vector length (arg 4)
, because there are some cells that I can see got imported as blanks instead of NA
s.
Here's what the output looks like:
mydata <- read.csv("~/Desktop/R/mydata.csv")
attach(mydata)
> y <- as.factor(assessment)
> x <- cbind(race, partisanship, sex, age)
Warning message:
In cbind(race, partisanship, sex, age) :
number of rows of result is not a multiple of vector length (arg 4)
>
> olr <- polr(y ~ x, mydata)
> summary(olr)
Re-fitting to get Hessian
Call:
polr(formula = y ~ x, data = mydata)
Coefficients:
Value Std. Error t value
xrace 0.49485 0.214426 2.3078
xpartisanship -0.00990 0.002942 -3.3654
xsex -0.21304 0.299763 -0.7107
xage 0.01486 0.006812 2.1819
Intercepts:
Value Std. Error t value
1|2 -1.4763 0.8253 -1.7887
2|3 1.8049 0.8237 2.1913
3|4 2.4739 0.8290 2.9842
Residual Deviance: 667.1306
AIC: 681.1306
(1401 observations deleted due to missingness)
I tried to combat the problem adding na.strings = ""
and x[x==""] <- NA
before I define x
-- it looks better in the summary output -- but I still get the error.
It's the race
column that for some reason imports missing cells as blanks instead of NA
s, because when I look at the .csv file using view(mydata)
in R-Studio, I see blanks instead of NA
s in the race
column, while all the other columns have NA
s where I'm missing data. Though when I look at the output, it shows NA
s.
For example, in R-Studio, row 7 shows a NA
for partisanship already, but row 10 shows a blank for race:
> head(x, 10)
race partisanship age
[1,] 2 97.4 80
[2,] 2 96.7 75
[3,] 3 95.0 70
[4,] 3 87.7 65
[5,] 3 85.2 60
[6,] 3 4.7 50
[7,] 3 NA 40
[8,] 3 9.1 30
[9,] 3 1.1 80
[10,] NA 10.2 75
Does anybody have any ideas on how I can removing this error? And a way to import all .csv files with NAs so I know everything's lining up properly?
EDIT: If it helps, after doing a bit more research, it looks like the columns with missing values showing up as blanks instead of NA
s stems from manual editing of the data to clean it up before loading it into R. Most of the data I have to import requires a bit of clean-up first, so I don't know how to get around doing this.
Thanks!
回答1:
It's getting to be a long string of comments, so let me put it into an answer.
It appears, from the cbind error, that age, sex, partisanship, and race are not the same length. This is a serious error. It means that somewhere in your data, the link between age[n], sex[n], partisanship[n], and race[n] has been broken.
This might be the result of doing an na.omit on one or more of the vectors. NA's should be there when you don't know an answer. If you know all the ages, sex's, partisanship, and race of all participants except for the age of participant 12, you need an NA in age[12] so that everything lines up. If you remove the NA, what's in age[13] ends up in age[12] and so matches up with sex[12], partisanship[12], and race[12] instead of with sex[13], partisanship[13], and race[13]. If age was originally, say, 42 long, age[42] will not have any value and R is warning you that it forced things to work by wrapping around and assigning age[42] = age[1].
Does that make sense?
So you need to figure out how the vectors became different lengths in the first place.
来源:https://stackoverflow.com/questions/23145430/how-can-i-make-sure-all-my-csv-data-gets-imported-as-na-instead-of-blank-in-r