I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. \"1,513\"
instead of 1513
. Wh
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)"
and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE)
to read the file.
Not sure about how to have read.csv
interpret it properly, but you can use gsub
to replace ","
with ""
, and then convert the string to numeric
using as.numeric
:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed
in unix.
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines
on a textConnection
. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ';", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub
, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})