I am trying to read in some data that is is a text file that looks like this:
2009-08-09 - 2009-08-15 0 2 0
2009-08-16 - 2009-08-22 0 1 0
2009-08-23
The file you are reading is probably using some encoding other than ASCII.
?read.table
shows
read.table(file, header = FALSE, sep = "", quote = "\"'",
...
fileEncoding = "", encoding = "unknown")
fileEncoding: character string: if non-empty declares the encoding used
on a file (not a connection) so the character data can be
re-encoded. See 'file'.
So perhaps try setting the fileEncoding
parameter. If you don't know the encoding, perhaps try "utf-8" or "cp-1252". If that does not work, then if you pastebin a snippet of your actual file, we may be able to identify the encoding.
What you see here:
ÿþ
is the Byte Order Mark (BOM) for UTF-16-LE or UCS-2LE. See Wikipedia (Byte Order Mark) for an explanation. You might have characters from strange languages in your file that need this encoding, or your file might have been created by some Windows software that saves files with a BOM. The BOM is placed before all other data at the beginning of a file.
R sees these characters and believes the data start here. Try:
(1) If you don't need this encoding, simply open your data in a text editor (like Vim), change the encoding, save, and read into R. (In Vim do :write ++enc=utf-8 new_file_name.txt
, then close the file and open the newly saved version, then do :set nobomb
, just to be sure, then :wq
.)
(2) If you need the encoding or don't want to go through a text editor, tell R what encoding the file is in. You might experiment with:
read.table("file.dat", fileEncoding = "UTF-16")
read.table("file.dat", fileEncoding = "UTF-16LE")
read.table("file.dat", fileEncoding = "UTF-16-LE")
read.table("file.dat", fileEncoding = "UCS-2LE")
If none of these work, try the solution given in this related question: How to detect the right encoding for read.csv?, and check the R manual on R Data Import/Export, it has a section that explains about files with BOM.
Your separator could be spaces rather than tabs. If you leave the sep
argument as ""
, it will use any kind of white space.
EDIT: Actually, the encoding does sound more likely as the source of the problem.
Read in the file with readLines
, then check the encoding with Encoding
.