I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.
Data looks like:
time result
1 09:00 +52A
I often use list comprehensions for these types of tasks because they're often faster.
There can be big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest - see code race below for this task:
import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop
Try this using regular expression:
import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)
There's a bug here: currently cannot pass arguments to str.lstrip
and str.rstrip
:
http://github.com/pydata/pandas/issues/2411
EDIT: 2012-12-07 this works now on the dev branch:
In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]:
1 52
2 62
3 44
4 30
5 110
Name: result