问题
Writing a data frame with a mix of small integer entries (value less than 1000) and "large" ones (value 1000 or more) into csv file with write_csv() mixes scientific and non-scientific entries. If the first 1000 rows are small values but there is a large value thereafter, read_csv() seems to get confused with this mix and outputs NA for scientific notations:
test_write_read <- function(small_value,
n_fills,
position,
large_value) {
tib <- tibble(a = rep(small_value, n_fills))
tib$a[position] <- large_value
write_csv(tib, "tib.csv")
tib <- read_csv("tib.csv")
}
The following lines do not make any problem:
tib <- test_write_read(small_value = 1,
n_fills = 1001,
position = 1000, #position <= 1000
large_value = 1000)
tib <- test_write_read(1, 1001, 1001, 999)
tib <- test_write_read(1000, 1001, 1000, 1)
However, the following lines do:
tib <- test_write_read(small_value = 1,
n_fills = 1001,
position = 1001, #position > 1000
large_value = 1000)
tib <- test_write_read(1, 1002, 1001, 1000)
tib <- test_write_read(999, 1001, 1001, 1000)
A typical output:
problems(tib)
## A tibble: 1 x 5
# row col expected actual file
# <int> <chr> <chr> <chr> <chr>
#1 1001 a no trailing characters e3 'tib.csv'
tib %>% tail(n = 3)
## A tibble: 3 x 1
# a
# <int>
#1 999
#2 999
#3 NA
The csv file:
$ tail -n3 tib.csv
#999
#999
#1e3
I am running:
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS
with tidyverse_1.2.1 (loading readr_1.1.1)
Is that a bug that should be reported?
回答1:
Adding the two answers, both correct, and the rationale as Community Wiki.
read_csv has an argument guess_max, which by default will be set to 1000. So read_csv only reads the first 1000 records before trying to figure out how each column should be parsed. Increasing guess_max to be larger than the total number of rows should fix the problem. – Marius 4 hours ago
You could also specify ,col_types= ...,
as double or character. – CPak 3 hours ago
Using @CPak's suggestion will make your code more reproducible and your analyses more predictable in the long run. That's a primary reason read_csv() spits out a message about the colspec
upon reading (so you can copy it and use it). Copy it, modify it and tell it to use a different type.
回答2:
I just installed the dev version of readr: devtools::install_github("tidyverse/readr")
, so now I have readr_1.2.0, and the NA
problem went away. But the column "a" is "guessed" by read_csv()
as dbl
now (whether or not there is a large integer in it), whereas it was correctly read as int
before, so if I need it as int
I still have to do a as.integer()
conversion. At least now it does not crash my code.
tib <- test_write_read(1, 1002, 1001, 1000)
tib %>% tail(n = 3)
## A tibble: 6 x 1
# a
# <dbl>
#1 1.00
#2 1000
#3 1.00
The large value is still written as 1e3 by write_csv()
, though, so to my opinion this is not quite a final solution.
$ tail -n3 tib.csv
#1
#1e3
#1
来源:https://stackoverflow.com/questions/48218646/write-csv-read-csv-with-scientific-notation-after-1000th-row