how to skip reading certain columns in readr [duplicate]

微笑、不失礼 提交于 2020-05-14 14:39:47

问题


I have a simple csv file called "test.csv" with the following content:

colA,colB,colC
1,"x",12
2,"y",34
3,"z",56

Let's say I want to skip reading in colA and just read in colB and colC. I want a general way to do this because I have lots of files to read in and sometimes colA is called something else altogether but colB and colC are always the same.

According to the read_csv documentation, one way to accomplish this is to pass a named list for col_types and only name the columns you want to keep:

read_csv('test.csv', col_types = list(colB = col_character(), colC = col_numeric()))

By not mentioning colA it should get dropped from the output. However, the resulting data frame is:

Source: local data frame [3 x 3]

      colA colB colC
    1    1    x   12
    2    2    y   34
    3    3    z   56

Am I doing something wrong or is the read_csv documentation not correct? According to the help file:

If a list, it must contain one "collector" for each column. If you only want to read a subset of the columns, you can use a named list (where the names give the column names). If a column is not mentioned by name, it will not be included in the output.


回答1:


There is an answer out there, I just didn't search hard enough: https://github.com/hadley/readr/issues/132

Apparently this was a documentation issue that has been corrected. This functionality may eventually get added but Hadley thought it was more useful to be able to just update one column type and not drop the others.

Update: The functionality has been added

The following code is from the readr documentation:

read_csv("iris.csv", col_types = cols_only( Species = col_factor(c("setosa", "versicolor", "virginica"))))

This will read only the Species column of the iris data set. In order to read only a specific column you must also pass the column specification i.e. col_factor, col_double, etc...




回答2:


"According to the read_csv documentation, one way to accomplish this is to pass a named list for col_types and only name the columns you want to keep"

WRONG: read_csv('test.csv', col_types=list(colB='c', colC='c'))

No, the doc is misleading, you have to either specify that unnamed cols get dropped (class='_'/col_skip()), or else explicitly specify their class as NULL:

read_csv('test.csv', col_types=list('*'='_', colB='c', colC='c'))

read_csv('test.csv', col_types=list('colA'='_', colB='c', colC='c'))


来源:https://stackoverflow.com/questions/31150351/how-to-skip-reading-certain-columns-in-readr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!