问题
I have a function find_country_from_connection_ip
which takes an ip, and after some processing returns a country. Like below:
def find_country_from_connection_ip(ip):
# Do some processing
return county
I am using the function inside apply
method. like below:
df['Country'] = df.apply(lambda x: find_country_from_ip(x['IP']), axis=1)
As it is pretty straightforward, what I want is to evaluate a new column from an existing column in the DataFrame which has >400000
rows.
It runs, but terribly slow and throws an exception like below:
...........: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if name == 'main': In [38]:
I understand the problem, but can't quite figure out how to use loc
with apply
and lambda
.
N.B. Please suggest if you have a more efficient alternative solution, which can bring the end result.
**** EDIT ********
The function is mainly a lookup on mmdb
database like below:
def find_country_from_ip(ip):
result = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
if result:
return re.search(r'\"(.+?)\"', result).group(1)
else:
final_output = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} registered_country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
return re.search(r'\"(.+?)\"', final_output).group(1)
This is nevertheless a costly operation, and when you have a DataFrame with >400000
rows, it should take time. But how much? That is the question. It takes about 2 hours which is pretty much I think.
回答1:
I would use maxminddb-geolite2
(GeoLite) module for that.
First install maxminddb-geolite2
module
pip install maxminddb-geolite2
Python Code:
import pandas as pd
from geolite2 import geolite2
def get_country(ip):
try:
x = geo.get(ip)
except ValueError:
return pd.np.nan
try:
return x['country']['names']['en'] if x else pd.np.nan
except KeyError:
return pd.np.nan
geo = geolite2.reader()
# it took me quite some time to find a free and large enough list of IPs ;)
# IP's for testing: http://upd.emule-security.org/ipfilter.zip
x = pd.read_csv(r'D:\download\ipfilter.zip',
usecols=[0], sep='\s*\-\s*',
header=None, names=['ip'])
# get unique IPs
unique_ips = x['ip'].unique()
# make series out of it
unique_ips = pd.Series(unique_ips, index = unique_ips)
# map IP --> country
x['country'] = x['ip'].map(unique_ips.apply(get_country))
geolite2.close()
Output:
In [90]: x
Out[90]:
ip country
0 000.000.000.000 NaN
1 001.002.004.000 NaN
2 001.002.008.000 NaN
3 001.009.096.105 NaN
4 001.009.102.251 NaN
5 001.009.106.186 NaN
6 001.016.000.000 NaN
7 001.055.241.140 NaN
8 001.093.021.147 NaN
9 001.179.136.040 NaN
10 001.179.138.224 Thailand
11 001.179.140.200 Thailand
12 001.179.146.052 NaN
13 001.179.147.002 Thailand
14 001.179.153.216 Thailand
15 001.179.164.124 Thailand
16 001.179.167.188 Thailand
17 001.186.188.000 NaN
18 001.202.096.052 NaN
19 001.204.179.141 China
20 002.051.000.165 NaN
21 002.056.000.000 NaN
22 002.095.041.202 NaN
23 002.135.237.106 Kazakhstan
24 002.135.237.250 Kazakhstan
... ... ...
Timing: for 171.884 unique IPs:
In [85]: %timeit unique_ips.apply(get_country)
1 loop, best of 3: 14.8 s per loop
In [86]: unique_ips.shape
Out[86]: (171884,)
Conclusion: it would take approx. 35 seconds for you DF with 400K unique IPs on my hardware:
In [93]: 400000/171884*15
Out[93]: 34.90726303786274
回答2:
Your issue isn't with how to use apply
or loc
. The issue is that your df
is flagged as a copy of another dataframe.
Let's explore this a bit
df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz')))
df
def find_country_from_connection_ip(ip):
return {1: 'A', 2: 'B', 3: 'C'}[ip]
df['Country'] = df.IP.apply(find_country_from_connection_ip)
df
No Problems
Let's make some problems
# This should make a copy
print(bool(df.is_copy))
df = df[['A', 'IP']]
print(df)
print(bool(df.is_copy))
False
A IP
0 x 1
1 y 2
2 z 3
True
Perfect, now we have a copy. Let's perform the same assignment with the apply
df['Country'] = df.IP.apply(find_country_from_connection_ip)
df
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':
how do you fix it?
Where ever you created df
you can use df.loc
. My example above, where I did df = df[:]
triggered the copy. If I had used loc
instead, I'd have avoided this mess.
print(bool(df.is_copy))
df = df.loc[:]
print(df)
print(bool(df.is_copy))
False
A IP
0 x 1
1 y 2
2 z 3
False
You need to either find where df
is created and use loc
or iloc
instead when you slice the source dataframe. Or, you can simply do this...
df.is_copy = None
The full demonstration
df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz')))
def find_country_from_connection_ip(ip):
return {1: 'A', 2: 'B', 3: 'C'}[ip]
df = df[:]
df.is_copy = None
df['Country'] = df.IP.apply(find_country_from_connection_ip)
df
回答3:
IIUC you can use your custom function with Series.apply this way:
df['Country'] = df['IP'].apply(find_country_from_ip)
Sample:
df = pd.DataFrame({'IP':[1,2,3],
'B':[4,5,6]})
def find_country_from_ip(ip):
# Do some processing
# some testing formula
country = ip + 5
return country
df['Country'] = df['IP'].apply(find_country_from_ip)
print (df)
B IP Country
0 4 1 6
1 5 2 7
2 6 3 8
回答4:
First and foremost, @MaxU 's answer is the way to go, efficient and ideal for parallel application on vectorized pd.series/dataframe.
Will contrast performance of two popular libraries to return location data given IP Address info. TLDR: use geolite2 method.
1. geolite2
package from geolite2
library
Input
# !pip install maxminddb-geolite2
import time
from geolite2 import geolite2
geo = geolite2.reader()
df_1 = train_data.loc[:50,['IP_Address']]
def IP_info_1(ip):
try:
x = geo.get(ip)
except ValueError: #Faulty IP value
return np.nan
try:
return x['country']['names']['en'] if x is not None else np.nan
except KeyError: #Faulty Key value
return np.nan
s_time = time.time()
# map IP --> country
#apply(fn) applies fn. on all pd.series elements
df_1['country'] = df_1.loc[:,'IP_Address'].apply(IP_info_1)
print(df_1.head(), '\n')
print('Time:',str(time.time()-s_time)+'s \n')
print(type(geo.get('48.151.136.76')))
Output
IP_Address country
0 48.151.136.76 United States
1 94.9.145.169 United Kingdom
2 58.94.157.121 Japan
3 193.187.41.186 Austria
4 125.96.20.172 China
Time: 0.09906983375549316s
<class 'dict'>
2. DbIpCity
package from ip2geotools
library
Input
# !pip install ip2geotools
import time
s_time = time.time()
from ip2geotools.databases.noncommercial import DbIpCity
df_2 = train_data.loc[:50,['IP_Address']]
def IP_info_2(ip):
try:
return DbIpCity.get(ip, api_key = 'free').country
except:
return np.nan
df_2['country'] = df_2.loc[:, 'IP_Address'].apply(IP_info_2)
print(df_2.head())
print('Time:',str(time.time()-s_time)+'s')
print(type(DbIpCity.get('48.151.136.76',api_key = 'free')))
Output
IP_Address country
0 48.151.136.76 US
1 94.9.145.169 GB
2 58.94.157.121 JP
3 193.187.41.186 AT
4 125.96.20.172 CN
Time: 80.53318452835083s
<class 'ip2geotools.models.IpLocation'>
A reason why the huge time difference could be due to the Data structure of the output, i.e direct subsetting from dictionaries seems way more efficient than indexing from the specicialized ip2geotools.models.IpLocation object.
Also, the output of the 1st method is dictionary containing geo-location data, subset respecitively to obtain needed info:
x = geolite2.reader().get('48.151.136.76')
print(x)
>>>
{'city': {'geoname_id': 5101798, 'names': {'de': 'Newark', 'en': 'Newark', 'es': 'Newark', 'fr': 'Newark', 'ja': 'ニューアーク', 'pt-BR': 'Newark', 'ru': 'Ньюарк'}},
'continent': {'code': 'NA', 'geoname_id': 6255149, 'names': {'de': 'Nordamerika', 'en': 'North America', 'es': 'Norteamérica', 'fr': 'Amérique du Nord', 'ja': '北アメリカ', 'pt-BR': 'América do Norte', 'ru': 'Северная Америка', 'zh-CN': '北美洲'}},
'country': {'geoname_id': 6252001, 'iso_code': 'US', 'names': {'de': 'USA', 'en': 'United States', 'es': 'Estados Unidos', 'fr': 'États-Unis', 'ja': 'アメリカ合衆国', 'pt-BR': 'Estados Unidos', 'ru': 'США', 'zh-CN': '美国'}},
'location': {'accuracy_radius': 1000, 'latitude': 40.7355, 'longitude': -74.1741, 'metro_code': 501, 'time_zone': 'America/New_York'},
'postal': {'code': '07102'},
'registered_country': {'geoname_id': 6252001, 'iso_code': 'US', 'names': {'de': 'USA', 'en': 'United States', 'es': 'Estados Unidos', 'fr': 'États-Unis', 'ja': 'アメリカ合衆国', 'pt-BR': 'Estados Unidos', 'ru': 'США', 'zh-CN': '美国'}},
'subdivisions': [{'geoname_id': 5101760, 'iso_code': 'NJ', 'names': {'en': 'New Jersey', 'es': 'Nueva Jersey', 'fr': 'New Jersey', 'ja': 'ニュージャージー州', 'pt-BR': 'Nova Jérsia', 'ru': 'Нью-Джерси', 'zh-CN': '新泽西州'}}]}
来源:https://stackoverflow.com/questions/40211314/pandas-fastest-way-to-resolve-ip-to-country