问题
I have a set of about 200,000 IP Addresses and 10,000 subnets of the form(1.1.1.1/24). For every IP Address I need to check whether it belongs to one of these subnets, but since it is a such a large dataset and I have less computational power, I would like an efficient implementation for this.
On searching, one method I found was this (https://stackoverflow.com/a/820124/7995937):
from netaddr import IPNetwork, IPAddress
if IPAddress("192.168.0.1") in IPNetwork("192.168.0.0/24"):
print "Yay!"
But since I have to loop this over 200,000 IP Addresses, and for each address loop over 10,000 subnets, I am unsure if this is efficient. My first doubt, is checking "IPAddress() in IPNetwork()" just a linear scan or is it optimized in some way?
The other solution I came up with was to make a list with all the IPs contained in the IP Subnets(which comes to about 13,000,000 IPs without duplicates), and then sorting it. If I do this, then in my loop over the 200,000 IP Addresses I only need to do a binary search for each IP, over a larger set of IP Addresses.
for ipMasked in ipsubnets: # Here ipsubnets is the list of all subnets
setUnmaskedIPs = [str(ip) for ip in IPNetwork(ipMasked)]
ip_list = ip_list + setUnmaskedIPs
ip_list = list(set(ip_list)) # To eliminate duplicates
ip_list.sort()
I could then just perform binary search in the following manner:
for ip in myIPList: # myIPList is the list of 200,000 IPs
if bin_search(ip,ip_list):
print('The ip is present')
Is this method more efficient than the other one? Or is there any other more efficient way to perform this task?
回答1:
Okay, So sorting takes O(nlogn), In case of 13,000,000 you end up doing O(13000000log(13000000)). Then You are iterating over 200000 IP and doing binary search O(logn) on that sorted list on 13000000. I sincerely doubt that's best solution. I suggest you use map
from netaddr import IPNetwork, IPAddress
l_ip_address = map(IPAddress, list_of_ip_address)
l_ip_subnet = map(IPNetwork, list_of_subnets)
if any(x in y for x in l_ip_address for y in l_ip_subnet):
print "FOUND"
回答2:
This is probably not the best possible solution, but I'd suggest using a set rather than a list. Sets are optimized for checking if any given value is present in the set, so you're replacing your binary search with a single operation. Instead of:
ip_list = list(set(ip_list))
just do:
ip_set = set(ip_list)
and then the other part of your code becomes:
for ip in myIPList: # myIPList is the list of 200,000 IPs
if ip in ip_set:
print('The ip is present')
Edit: and to make things a bit more memory-efficient you can skip creating an intermediate list as well:
ip_set = set()
for ipMasked in ipsubnets:
ip_set.update([str(ip) for ip in IPNetwork(ipMasked)])
回答3:
Your IP address in in a subnet if N leading bits of that address match N leading bits of one of the N-bit subnets. So, start by making a list of empty sets. Encode each subnet as a 32-bit integer with the trailing bits masked out. For example, 1.2.3.4/23 equals (0x01020304 & 0xfffffe00) equals 0x01020200. Add this number to the 23rd set in the list, ie subnets[23]
. Continue for all the subnets.
To see if an IP address is in your subnets, encode the IP address in the same way as a 32-bit number ipaddr
and then (something like, untested code)
for N in range( 32, 0, -1)
mask = ( 0xffffffff >> (32-N) ) << (32-N)
if (ipaddr & mask) in subnets[N] :
# have found ipaddr in one of our subnets
break # or do whatever...
else
# have not found ipaddr
Looking up a number in a set at worst O(log N) where N in the number of elements in the set. This code does it at most 32 times for the worst case of an ip address that is not in the sets of subnets. If the majority of the addresses are expected to be present, there's an optimisation to test the sets with the most elemnnts first. That might be
for N in ( 24, 16, 8, 29, 23, 28, 27, 26, 25, 22, 15, 21 ... )
or you could calculate the optimal sequence at runtime.
来源:https://stackoverflow.com/questions/44262437/how-to-efficiently-check-if-a-given-ip-address-belong-to-an-ip-subnetwork-in-pyt