问题
Objective: Write Python 2.7 code to extract IPv4 addresses from string.
String content example:
The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).
As you can see from the above, I am struggling to find a way to parse through a txt file that may contain IPs depicted in multiple forms of "censorship" (to prevent hyper-linking).
I'm thinking that a regex expression is the way to go. Maybe say something along the lines of; any grouping of four ints 0-255 or 000-255 separated by anything in the 'separators list' which would consist of periods, brackets, parenthesis, or any of the other aforementioned examples. This way, the 'separators list' could be updated at as needed.
Not sure if this is the proper way to go or even possible so, any help with this is greatly appreciated.
Update: Thanks to recursive's answer below, I now have the following code working for the above example. It will...
- find the IPs
- place them into a list
- clean them of the spaces/braces/etc
- and replace the uncleaned list entry with the cleaned one.
Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing 6 and 3 from the aforementioned. If its first octet is invalid (ex:256.10.10.10) it will drop the leading 2 (resulting in 56.10.10.10).
import re
def extractIPs(fileContent):
pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
ips = [each[0] for each in re.findall(pattern, fileContent)]
for item in ips:
location = ips.index(item)
ip = re.sub("[ ()\[\]]", "", item)
ip = re.sub("dot", ".", ip)
ips.remove(item)
ips.insert(location, ip)
return ips
myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()
IPs = extractIPs(fileContent)
print "Original file content:\n{0}".format(fileContent)
print "--------------------------------"
print "Parsed results:\n{0}".format(IPs)
回答1:
Here is a regex that works:
import re
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
text = "The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. "
ips = [match[0] for match in re.findall(pattern, text)]
print ips
# output: ['192.168.1.1', '8.8.8.8', '101.099.098.000', '192.168.1[.]1', '192.168.1(.)1', '192.168.1[dot]1', '192.168.1(dot)1', '192 .168 .1 .1', '192. 168. 1. 1']
The regex has a few main parts, which I will explain here:
([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])
This matches the numerical parts of the ip address.|
means "or". The first case handles numbers from 0 to 199 with or without leading zeroes. The second two cases handle numbers over 199.[ (\[]?(\.|dot)[ )\]]?
This matches the "dot" parts. There are three sub-components:[ (\[]?
The "prefix" for the dot. Either a space, an open paren, or open square brace. The trailing?
means that this part is optional.(\.|dot)
Either "dot" or a period.[ )\]]?
The "suffix". Same logic as the prefix.
{3}
means repeat the previous component 3 times.- The final element is another number, which is the same as the first, except it is not followed by a dot.
回答2:
Description
This regex will match each of four octets of a what looks like an IP address. Each of the octets will be placed into it's own capture group for collection.
(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])
Given the following sample text this regex will match all 10 embedded IP strings in their entirety including the first one. Working example: http://www.rubular.com/r/1MbGZOhuj5
The following are IP addresses: 192.168.1.222, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).
The resulting matches could be iterated over and a properly formatted IP string could be constructed by joining the 4 capture groups with a dot.
回答3:
The code below will...
- find IPs in strings even when censored (ex: 192.168.1[dot]20 or 10.10.10 .21)
- place them into a list
- clean them of the censorship (spaces/braces/parenthesis)
- and replace the uncleaned list entry with the cleaned one.
Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing digit (6 and 3 from the aforementioned). If its first octet is invalid (ex: 256.10.10.10), it will drop the leading digit (resulting in 56.10.10.10).
import re
def extractIPs(fileContent):
pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
ips = [each[0] for each in re.findall(pattern, fileContent)]
for item in ips:
location = ips.index(item)
ip = re.sub("[ ()\[\]]", "", item)
ip = re.sub("dot", ".", ip)
ips.remove(item)
ips.insert(location, ip)
return ips
myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()
IPs = extractIPs(fileContent)
print "Original file content:\n{0}".format(fileContent)
print "--------------------------------"
print "Parsed results:\n{0}".format(IPs)
回答4:
Extract and Categorize IPv4 Addresses (Even When Censored)
Note: This is just an implementation of a class I wrote for extracting IPv4 Addresses. I will likely update my class with a method for this functionality in the future. You can find it on my GitHub page.
What I'm demonstrating below is the following:
Cleaning up your string content example
Bringing your string data into a list
Using the ExtractIPs() class to parse and categorize IPv4 Addresses
This class returns a dictionary containing 4 lists:
Valid IPv4 Addresses
Public IPv4 Addresses
Private IPv4 Addresses
Invalid IPv4 Addresses
ExtractIPs class
#!/usr/bin/env python """Extract and Classify IP Addresses.""" import re # Use Regular Expressions. __program__ = "IPAddresses.py" __author__ = "Johnny C. Wachter" __copyright__ = "Copyright (C) 2014 Johnny C. Wachter" __license__ = "MIT" __version__ = "0.0.1" __maintainer__ = "Johnny C. Wachter" __contact__ = "wachter.johnny@gmail.com" __status__ = "Development" class ExtractIPs(object): """Extract and Classify IP Addresses From Input Data.""" def __init__(self, input_data): """Instantiate the Class.""" self.input_data = input_data self.ipv4_results = { 'valid_ips': [], # Store all valid IP Addresses. 'invalid_ips': [], # Store all invalid IP Addresses. 'private_ips': [], # Store all Private IP Addresses. 'public_ips': [] # Store all Public IP Addresses. } def extract_ipv4_like(self): """Extract IP-like strings from input data. :rtype : list """ ipv4_like_list = [] ip_like_pattern = re.compile(r'([0-9]{1,3}\.){3}([0-9]{1,3})') for entry in self.input_data: if re.match(ip_like_pattern, entry): if len(entry.split('.')) == 4: ipv4_like_list.append(entry) return ipv4_like_list def validate_ipv4_like(self): """Validate that IP-like entries fall within the appropriate range.""" if self.extract_ipv4_like(): # We're gonna want to ignore the below two addresses. ignore_list = ['0.0.0.0', '255.255.255.255'] # Separate the Valid from Invalid IP Addresses. for ipv4_like in self.extract_ipv4_like(): # Split the 'IP' into parts so each part can be validated. parts = ipv4_like.split('.') # All part values should be between 0 and 255. if all(0 <= int(part) < 256 for part in parts): if not ipv4_like in ignore_list: self.ipv4_results['valid_ips'].append(ipv4_like) else: self.ipv4_results['invalid_ips'].append(ipv4_like) else: pass def classify_ipv4_addresses(self): """Classify Valid IP Addresses.""" if self.ipv4_results['valid_ips']: # Now we will classify the Valid IP Addresses. for valid_ip in self.ipv4_results['valid_ips']: private_ip_pattern = re.findall( r"""^10\.(\d{1,3}\.){2}\d{1,3} (^127\.0\.0\.1)| # Loopback (^10\.(\d{1,3}\.){2}\d{1,3})| # 10/8 Range # Matching the 172.16/12 Range takes several matches (^172\.1[6-9]\.\d{1,3}\.\d{1,3})| (^172\.2[0-9]\.\d{1,3}\.\d{1,3})| (^172\.3[0-1]\.\d{1,3}\.\d{1,3})| (^192\.168\.\d{1,3}\.\d{1,3})| # 192.168/16 Range # Match APIPA Range. (^169\.254\.\d{1,3}\.\d{1,3}) # VERBOSE for a clean look of this RegEx. """, valid_ip, re.VERBOSE ) if private_ip_pattern: self.ipv4_results['private_ips'].append(valid_ip) else: self.ipv4_results['public_ips'].append(valid_ip) else: pass def get_ipv4_results(self): """Extract and classify all valid and invalid IP-like strings. :returns : dict """ self.extract_ipv4_like() self.validate_ipv4_like() self.classify_ipv4_addresses() return self.ipv4_results
Example Extraction With Censorship
censored = re.compile( r""" \(\.\)| \(dot\)| \[\.\]| \[dot\]| ( \.) """, re.VERBOSE | re.IGNORECASE ) data_list = input_string.split() # Bring your input string to a list. clean_list = [] # List to store the cleaned up input. for entry in data_list: # Remove undesired leading and trailing characters. clean_entry = entry.strip(' .,<>?/[]\\{}"\'|`~!@#$%^&*()_+-=') clean_list.append(clean_entry) # Add the entry to the clean list. clean_unique_list = list(set(clean_list)) # Remove duplicates in list. # Now we can go ahead and extract IPv4 Addresses. Note that this will be a dict. results = ExtractIPs(clean_list).get_ipv4_results() for k, v in results.iteritems(): # After all that work, make sure the results are nicely presented! print("\n%s: %s" % (k, v))
Results:
public_ips: ['8.8.8.8', '101.099.098.000'] valid_ips: ['192.168.1.1', '8.8.8.8', '101.099.098.000'] invalid_ips: [] private_ips: ['192.168.1.1']
来源:https://stackoverflow.com/questions/17327912/python-parse-ipv4-addresses-from-string-even-when-censored