问题
readr::read_csv
is misreading some column types in a file I am loading so I want to use cols
to set them manually.
In ?read_csv
, it says the col_types argument should be _"One of ‘NULL’, a ‘cols()’ specification, or a string. See ‘vignette("column-types")’ for more details". Well, vignette("column-types")
gives vignette("column-types") not found
so I tried ?cols
. It says it accepts "column objects created by ‘col_*()’ or their abbreviated character names".
What are the acceptable functions or abbreviated character names and where do I find that information? readr 1.1.1
btw.
回答1:
There are col_double
, col_integer
, col_character
, col_date
, col_factor
, .etc
library(readr)
mtcars <- read_csv(readr_example("mtcars.csv"), col_types =
cols(
mpg = col_double(),
cyl = col_integer(),
disp = col_double(),
hp = col_integer(),
drat = col_double(),
vs = col_integer(),
wt = col_double(),
qsec = col_double(),
am = col_integer(),
gear = col_integer(),
carb = col_integer()
)
)
mtcars
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
Alternatively, you can use a compact string representation where each character represents one column:
c = character
, i = integer
, n = number
, d = double
, l = logical
, D = date
, T = date time
, t = time
, ? = guess
, or _
/-
to skip the column.
mtcars_select <- read_csv(readr_example("mtcars.csv"),
col_types = cols_only(mpg = 'd', cyl = 'i', hp = 'i',
qsec = 'd', gear = 'i'),
na = c("NA", "N/A", "-9999", "-999"))
mtcars_select
#> # A tibble: 32 x 5
#> mpg cyl hp qsec gear
#> <dbl> <int> <int> <dbl> <int>
#> 1 21 6 110 16.5 4
#> 2 21 6 110 17.0 4
#> 3 22.8 4 93 18.6 4
#> 4 21.4 6 110 19.4 3
#> 5 18.7 8 175 17.0 3
#> 6 18.1 6 105 20.2 3
#> 7 14.3 8 245 15.8 3
#> 8 24.4 4 62 20 4
#> 9 22.8 4 95 22.9 4
#> 10 19.2 6 123 18.3 4
#> # ... with 22 more rows
Or even shorter
mtcars <- read_csv(readr_example("mtcars.csv"), col_types = "di_i__d__i_")
mtcars
# A tibble: 32 x 5
mpg cyl hp qsec gear
<dbl> <int> <int> <dbl> <int>
1 21 6 110 16.5 4
2 21 6 110 17.0 4
3 22.8 4 93 18.6 4
4 21.4 6 110 19.4 3
5 18.7 8 175 17.0 3
6 18.1 6 105 20.2 3
7 14.3 8 245 15.8 3
8 24.4 4 62 20 4
9 22.8 4 95 22.9 4
10 19.2 6 123 18.3 4
# ... with 22 more rows
Ref:
https://cran.r-project.org/web/packages/readr/vignettes/readr.html
https://www.rdocumentation.org/packages/readr/versions/1.1.1/topics/cols
回答2:
I also think this is not obviously documented. You can read the source code of col_types.R
from readr
which tells you the abbreviations:
"_" = ,
"-" = col_skip(),
"?" = col_guess(),
c = col_character(),
D = col_date(),
d = col_double(),
i = col_integer(),
l = col_logical(),
n = col_number(),
T = col_datetime(),
t = col_time()
The way to set column types is to pass a named vector:
col_types = cols(column_1 = col_integer(), column2 = col_character())
or, if you are using col_names
, just pass a vector of the same length.
If the reason for over-riding the defaults is that read_csv
is guessing the type wrong then you may overcome this using spec_csv
and allowing more rows to be used in guessing the types (by default it uses 1,000) . For example
x<- spec_csv(filename,guess_max=2000)
read_csv(filename,col_types = x)`
回答3:
This may not be a complete list of the available col_*()
suffixes, but it's close:
_logical
_integer
_double
_number
_character
_datetime
_date
_time
_factor
From the column-types vignette:
If you want to manually specify the column types, you can start by copying and pasting this code, and then tweaking it fix the parsing problems.
df3 <- read_csv( readr_example("challenge.csv"), col_types = cols( x = col_double(), y = col_date(format = "") ) )
The article focuses on the different type parsers, which are enumerated by section (Atomic vectors, Dates/times, etc). For every parse_()
function there is an equivalent col_
function:
Each parse_() is coupled with a col_() function, which will be used in the process of parsing a complete tibble.
来源:https://stackoverflow.com/questions/50651898/what-are-permissible-column-objects-of-the-form-col-used-in-readr