问题
Sometimes a Byte Order Mark (BOM) is present at the beginning of a .CSV file. The symbol is not visible when you open the file using Notepad or Excel, however, When you read the file in R using various methods, you will different symbols in the name of first column. here is an example
A sample csv file with BOM in the beginning.
ID,title,clean_title,clean_title_id
1,0 - 0,,0
2,"""0 - 1,000,000""",,0
27448,"20yr. rope walker
igger",Rope Walker Igger,1832700817
Reading through read.csv
in base R package
(x1 = read.csv("file1.csv",stringsAsFactors = FALSE))
# ï..ID raw_title semi_clean semi_clean_id
# 1 1 0 - 0 0
# 2 2 "0 - 1,000,000" 0
# 3 27448 20yr. rope walker\nigger Rope Walker Igger 1832700817
Reading through fread
in data.table package
(x2 = data.table::fread("file1.csv"))
# ID raw_title semi_clean semi_clean_id
# 1: 1 0 - 0 0
# 2: 2 ""0 - 1,000,000"" 0
# 3: 27448 20yr. rope walker\rigger Rope Walker Igger 1832700817
Reading through read_csv
in readr package
(x3 = readr::read_csv("file1.csv"))
# <U+FEFF>ID raw_title semi_clean semi_clean_id
# 1 1 0 - 0 <NA> 0
# 2 2 "0 - 1,000,000" <NA> 0
# 3 27448 20yr. rope walker\rigger Rope Walker Igger 1832700817
You can notice different characters in front of variable name ID.
Here are the results when you run names on all of these
names(x1)
# [1] "ï..ID" "raw_title" "semi_clean" "semi_clean_id"
names(x2)
# [1] "ID" "raw_title" "semi_clean" "semi_clean_id"
names(x3)
# [1] "ID" "raw_title" "semi_clean" "semi_clean_id"
In x3
, there is nothing 'visible' in front of ID
, but when you check
names(x3)[[1]]=="ID"
# [1] FALSE
How to get rid of these unwanted character in each case. PS: Please add more methods to read csv files, the problem faced and the solutions.
回答1:
For read.csv in base R use:
x1 = read.csv("file1.csv",stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")
For fread, use:
x2 = fread("file1.csv")
setnames(x2, "ID", "ID")
For read_csv, use:
x3 = readr::read_csv("file1.csv")
setDT(X3) #convert into data tables, so that setnames can be used
setnames(x3, "\uFEFFID", "ID")
One non-R based solution is open the file in Notepad++, save the file after change encoding to "Encoding in UTF-8 without BOM"
来源:https://stackoverflow.com/questions/39593637/dealing-with-byte-order-mark-bom-in-r