What are permissible column objects of the form “col_*()” used in readr?

强颜欢笑 提交于 2019-12-02 01:08:04

问题


readr::read_csv is misreading some column types in a file I am loading so I want to use cols to set them manually.

In ?read_csv, it says the col_types argument should be _"One of ‘NULL’, a ‘cols()’ specification, or a string. See ‘vignette("column-types")’ for more details". Well, vignette("column-types") gives vignette("column-types") not found so I tried ?cols. It says it accepts "column objects created by ‘col_*()’ or their abbreviated character names".

What are the acceptable functions or abbreviated character names and where do I find that information? readr 1.1.1 btw.


回答1:


There are col_double, col_integer, col_character, col_date, col_factor, .etc

library(readr)

mtcars <- read_csv(readr_example("mtcars.csv"), col_types = 
                     cols(
                       mpg = col_double(),
                       cyl = col_integer(),
                       disp = col_double(),
                       hp = col_integer(),
                       drat = col_double(),
                       vs = col_integer(),
                       wt = col_double(),
                       qsec = col_double(),
                       am = col_integer(),
                       gear = col_integer(),
                       carb = col_integer()
                     )
)
mtcars

#> # A tibble: 32 x 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ... with 22 more rows

Alternatively, you can use a compact string representation where each character represents one column: c = character, i = integer, n = number, d = double, l = logical, D = date, T = date time, t = time, ? = guess, or _/- to skip the column.

mtcars_select <- read_csv(readr_example("mtcars.csv"), 
                          col_types = cols_only(mpg = 'd', cyl = 'i', hp = 'i', 
                                                qsec = 'd', gear = 'i'),
                          na = c("NA", "N/A", "-9999", "-999"))
mtcars_select

#> # A tibble: 32 x 5
#>      mpg   cyl    hp  qsec  gear
#>    <dbl> <int> <int> <dbl> <int>
#>  1  21       6   110  16.5     4
#>  2  21       6   110  17.0     4
#>  3  22.8     4    93  18.6     4
#>  4  21.4     6   110  19.4     3
#>  5  18.7     8   175  17.0     3
#>  6  18.1     6   105  20.2     3
#>  7  14.3     8   245  15.8     3
#>  8  24.4     4    62  20       4
#>  9  22.8     4    95  22.9     4
#> 10  19.2     6   123  18.3     4
#> # ... with 22 more rows

Or even shorter

mtcars <- read_csv(readr_example("mtcars.csv"), col_types = "di_i__d__i_")
mtcars

# A tibble: 32 x 5
     mpg   cyl    hp  qsec  gear
   <dbl> <int> <int> <dbl> <int>
 1  21       6   110  16.5     4
 2  21       6   110  17.0     4
 3  22.8     4    93  18.6     4
 4  21.4     6   110  19.4     3
 5  18.7     8   175  17.0     3
 6  18.1     6   105  20.2     3
 7  14.3     8   245  15.8     3
 8  24.4     4    62  20       4
 9  22.8     4    95  22.9     4
10  19.2     6   123  18.3     4
# ... with 22 more rows

Ref:

https://cran.r-project.org/web/packages/readr/vignettes/readr.html
https://www.rdocumentation.org/packages/readr/versions/1.1.1/topics/cols




回答2:


I also think this is not obviously documented. You can read the source code of col_types.R from readr which tells you the abbreviations:

"_" = ,
"-" = col_skip(),
"?" = col_guess(),
c = col_character(),
D = col_date(),
d = col_double(),
i = col_integer(),
l = col_logical(),
n = col_number(),
T = col_datetime(),
t = col_time()

The way to set column types is to pass a named vector:

col_types = cols(column_1 = col_integer(), column2 = col_character())

or, if you are using col_names, just pass a vector of the same length.

If the reason for over-riding the defaults is that read_csv is guessing the type wrong then you may overcome this using spec_csv and allowing more rows to be used in guessing the types (by default it uses 1,000) . For example

x<- spec_csv(filename,guess_max=2000)
read_csv(filename,col_types = x)`



回答3:


This may not be a complete list of the available col_*() suffixes, but it's close:

_logical
_integer
_double
_number
_character
_datetime
_date
_time
_factor

From the column-types vignette:

If you want to manually specify the column types, you can start by copying and pasting this code, and then tweaking it fix the parsing problems.

df3 <- read_csv(
    readr_example("challenge.csv"), 
    col_types = cols(
    x = col_double(),
    y = col_date(format = "")
    )
  )

The article focuses on the different type parsers, which are enumerated by section (Atomic vectors, Dates/times, etc). For every parse_() function there is an equivalent col_ function:

Each parse_() is coupled with a col_() function, which will be used in the process of parsing a complete tibble.



来源:https://stackoverflow.com/questions/50651898/what-are-permissible-column-objects-of-the-form-col-used-in-readr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!