My data set testdata
has 2 variables named PWGTP
and AGEP
The data are in a .csv
file.
When I do:
> head(testdata)
The variables show up as
ï..PWGTP AGEP
23 55
26 56
24 45
22 51
25 54
23 35
So, for some reason, R is reading PWGTP
as ï..PWGTP
. No biggie.
HOWEVER, when I use some function to refer to the variable ï..PWGTP
, I get the message:
Error: id variables not found in data: ï..PWGTP
Similarly, when I use some function to refer to the variable PWGTP
, I get the message:
Error: id variables not found in data: PWGTP
2 Questions:
Is there anything I should be doing to the source file to prevent mangling of the variable name
PWGTP
?It should be trivial to rename
ï..PWGTP
to something else -- butR
is unable to find a variable named as such. Your thoughts on how one should try to repair the variable name?
This is a BOM (Byte Order Mark) UTF-8 issue.
To prevent this from happening, 2 options:
- Save your file as UTF-8 without BOM / signature -- or --
- Use
fileEncoding = "UTF-8-BOM"
when usingread.table
orread.csv
Example:
mydata <- read.table(file = "myfile.txt", fileEncoding = "UTF-8-BOM")
It is possible that the column names in the file could be 1 PWGTP
i.e.with spaces between the number (or something else) and that characters which result in ..
while reading in R. One way to prevent this would be to use check.names = FALSE
in read.csv/read.table
d1 <- read.csv("yourfile.csv", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)
However, it is better not to have a name starting with number or have spaces in between.
So, suppose, if the OP read the data with the default options i.e. with check.names = TRUE
, we can use sub
to change the column names
names(d1) <- sub(".*\\.+", "", names(d1))
As an example
sub(".*\\.+", "", "ï..PWGTP")
#[1] "PWGTP"
来源:https://stackoverflow.com/questions/37802797/prevent-variable-name-getting-mangled-by-read-csv-read-table