Pandas read_csv dtype leading zeros

前端 未结 5 423
再見小時候
再見小時候 2020-11-29 10:45

So I\'m reading in a station codes csv file from NOAA which looks like this:

\"USAF\",\"WBAN\",\"STATION NAME\",\"CTRY\",\"FIPS\",\"STATE\",\"CALL\",\"LAT\",         


        
相关标签:
5条回答
  • 2020-11-29 11:02

    It looks like you have to specify the length of the string if you don't want it to be an object.
    For example:

    dtype={'USAF': '|S6'}
    

    I can't find the reference for this, but I seem to recall Wes discussing this very issue (perhaps in a talk). He suggested that numpy doesn't allow "proper" variable length strings (see this question/answer), and using the maximum length to populate the array will more often than not be incredibly space inefficient (even if a string is short it'll use as much space as the longest string).

    As @Wes points out, this is also a case where:

    dtype={'USAF': object}
    

    works just as well.

    0 讨论(0)
  • 2020-11-29 11:03

    This problem caused me all sorts of headaches when parsing a file with serial numbers. For unknown reasons 00794 and 000794 are two distinct serial numbers. I eventually came up with

    converters = {'serial_number': str}
    
    0 讨论(0)
  • This is an issue of pandas dtype guessing.

    Pandas sees numbers and guesses you want it to be numbers.

    To make pandas not doubt your intentions, you should set the dtype you want: object

    pd.read_csv('filename.csv', dtype={'leading_zero_column_name': object})
    

    Will do the trick

    0 讨论(0)
  • 2020-11-29 11:20

    You can pass a dictionary of functions to converters where the keys are numeric column indices. So, if you don't know what your column names will be, you can do this (provided you have less than 100 columns).

    pd.read_csv('some_file.csv', converters={i: str for i in range(100)})

    0 讨论(0)
  • 2020-11-29 11:20

    With Pandas 1, how about:

    df.read_csv(..., dtype={"my_confusing_col": "string"})
    

    Note that will use the column dtype string which uses pd.NA for any missing values. All leading zeros will of course be preserved.

    0 讨论(0)
提交回复
热议问题