pandas extractall() is not extracting all cases given a regex?

问题

I have a nested list of strings which I would like to extract them the date. The date format is:

Two numbers (from 01 to 12) hyphen tree letters (a valid month) hyphen two numbers, for example: 08-Jan—07 or 03-Oct—01

I tried to use the following regex:

r'\d{2}(—|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}'

Then I tested it as follows:

import pandas as pd
df = pd.DataFrame({'blobs':['6-Feb- 1 4 Facebook’s virtual-reality division created a 3-EBÚ7 11 network of 500 free demo stations in Best Buy stores to give people a taste of VR using the Oculus Rift 90 GT 48 headset. But according to a Wednesday report from Business Insider, about 200 of the demo stations will close after low interest from consumers. 17-Feb-2014',
                         'I think in a store environment getting people to sit down and go through that experience of getting a headset on and getting set up is quite a difficult thing to achieve,” said Geoff Blaber, a CCS Insight analyst. 29—Oct-2012 Blaber 32 FAX 2978 expects that it will get easier when companies can convince  18-Oct-12 credit cards. '
                            ]})
df

Then:

df['blobs'].str.extractall(r'\d{2}(—|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}')

Nevertheless, they are not working. The previous regex doesn't give me anything (i.e. just hypens -):

    Col
0   NaN
1    -
2    -
3   NaN
4   NaN
5    -
...
n    -

How can I fix them in order to get?:

           Col
0 6-Feb-14, 17-Feb-2014
1 29—Oct-2012, 18-Oct-12

UPDATE

I also tried to:

import re
df['col'] = df.blobs.apply(lambda x: re.findall('\d{2}(—|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}',x))
s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True)
s.name = "col"
df = df.drop('col')
df

Nevertheless I also got:

ValueError                                Traceback (most recent call last)
<ipython-input-4-5e9a34bd159f> in <module>()
      3 s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True)
      4 s.name = "col"
----> 5 df = df.drop('col')
      6 df

/usr/local/lib/python3.5/site-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors)
   1905                 new_axis = axis.drop(labels, level=level, errors=errors)
   1906             else:
-> 1907                 new_axis = axis.drop(labels, errors=errors)
   1908             dropped = self.reindex(**{axis_name: new_axis})
   1909             try:

/usr/local/lib/python3.5/site-packages/pandas/indexes/base.py in drop(self, labels, errors)
   3260             if errors != 'ignore':
   3261                 raise ValueError('labels %s not contained in axis' %
-> 3262                                  labels[mask])
   3263             indexer = indexer[~mask]
   3264         return self.delete(indexer)

ValueError: labels ['col'] not contained in axis

回答1:

When you use Series.str.extract or Series.str.extractall, the captured substrings are returned, not the whole matches. So, you need to make sure you capture (i.e. add ( and ) around) the part of pattern you need to grab.

Now, several expected matches in your rows make it more difficult to do with extractall, it seems you may use Series.str.findall that may return the whole matches if no capturing group is defined in the pattern.

Use

rx = r'\b\d{1,2}[-–—](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[-–—](?:\d{4}|\d{2})\b'
df['Col'] = df['blobs'].str.findall(rx).apply(','.join)

The .apply(','.join) will convert lists to comma-separated strings in Col column.

The pattern means:

\b - a word boundary
\d{1,2} - 1 or 2 digits
[-–—] - a hyphen, em- or en-dash
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) - any of the 12 month shortened names
[-–—] - a hyphen, em- or en-dash
(?:\d{4}|\d{2}) - 4 or 2 digits
\b - a word boundary

来源：https://stackoverflow.com/questions/42254384/pandas-extractall-is-not-extracting-all-cases-given-a-regex

标签

python

regex

python-3.x

pandas