问题
I'm reading a fixed width format (full source file) full of missing data, so pandas.read_fwf
comes in handy. There is an empty line after the header, so I'm passing skip_blank_lines=True
, but this appears to have no effect, as the first entry is still full of NaN/NaT:
import io
import pandas
s="""USAF WBAN STATION NAME CTRY ST CALL LAT LON ELEV(M) BEGIN END
007018 99999 WXPOD 7018 +00.000 +000.000 +7018.0 20110309 20130730
007026 99999 WXPOD 7026 AF +00.000 +000.000 +7026.0 20120713 20170822
007070 99999 WXPOD 7070 AF +00.000 +000.000 +7070.0 20140923 20150926
008260 99999 WXPOD8270 +00.000 +000.000 +0000.0 20050101 20100920
008268 99999 WXPOD8278 AF +32.950 +065.567 +1156.7 20100519 20120323
008307 99999 WXPOD 8318 AF +00.000 +000.000 +8318.0 20100421 20100421
008411 99999 XM20 20160217 20160217
008414 99999 XM18 20160216 20160217
008415 99999 XM21 20160217 20160217
008418 99999 XM24 20160217 20160217
010000 99999 BOGUS NORWAY NO ENRS 20010927 20041019
010010 99999 JAN MAYEN(NOR-NAVY) NO ENJA +70.933 -008.667 +0009.0 19310101 20200111
010013 99999 ROST NO 19861120 19880105
010014 99999 SORSTOKKEN NO ENSO +59.792 +005.341 +0048.8 19861120 20200110
"""
print(pandas.read_fwf(io.StringIO(s), parse_dates=["BEGIN", "END"],
skip_blank_lines=True))
Which results in:
USAF WBAN STATION NAME ... ELEV(M) BEGIN END
0 NaN NaN NaN ... NaN NaT NaT
1 7018.0 99999.0 WXPOD 7018 ... 7018.0 2011-03-09 2013-07-30
2 7026.0 99999.0 WXPOD 7026 ... 7026.0 2012-07-13 2017-08-22
3 7070.0 99999.0 WXPOD 7070 ... 7070.0 2014-09-23 2015-09-26
4 8260.0 99999.0 WXPOD8270 ... 0.0 2005-01-01 2010-09-20
5 8268.0 99999.0 WXPOD8278 ... 1156.7 2010-05-19 2012-03-23
6 8307.0 99999.0 WXPOD 8318 ... 8318.0 2010-04-21 2010-04-21
7 8411.0 99999.0 XM20 ... NaN 2016-02-17 2016-02-17
8 8414.0 99999.0 XM18 ... NaN 2016-02-16 2016-02-17
9 8415.0 99999.0 XM21 ... NaN 2016-02-17 2016-02-17
10 8418.0 99999.0 XM24 ... NaN 2016-02-17 2016-02-17
11 10000.0 99999.0 BOGUS NORWAY ... NaN 2001-09-27 2004-10-19
12 10010.0 99999.0 JAN MAYEN(NOR-NAVY) ... 9.0 1931-01-01 2020-01-11
13 10013.0 99999.0 ROST ... NaN 1986-11-20 1988-01-05
14 10014.0 99999.0 SORSTOKKEN ... 48.8 1986-11-20 2020-01-10
[15 rows x 11 columns]
Row 0 still has values for all columns. I was expecting row 0 to be the first non-empty data row, starting with 007018. Why does skip_blank_lines=True
appear to have no effect? How can I tell pandas to skip the blank line? Am I doing something wrong?
回答1:
One missing detail in your code is that you failed to pass widths parameter.
But this is not all. Another problem is that unfortunately, read_fwf contains such a bug that it ignores skip_blank_lines parameter.
To cope with it, define the following class, containing readline method skipping empty lines:
class LineFilter(io.TextIOBase):
def __init__(self, iterable):
self.iterable = iterable
def readline(self):
while True:
line = next(self.iterable).strip()
if line:
return line
Then run:
df = pd.read_fwf(LineFilter(io.StringIO(s)), widths=[7, 6, 30, 8, 6, 8, 9, 8, 9, 9],
parse_dates=["BEGIN", "END"], na_filter=False)
As you can see, I added na_filter=False to block conversion of empty strings to NaN values.
回答2:
If there is one colum which will surly have some value, if you remove blank line for that colum , that may work..
Try below
df.dropna(subset=['WBAN'], how='all', inplace=True)
print(df.head())
来源:https://stackoverflow.com/questions/59757478/why-is-pandas-read-fwf-not-skipping-the-blank-line-as-instructed