Trying to write a python script to extract lines from a file. The file is a text file which is a dump of python suds output.
I want to:
Several suggestions on your code:
Stripping all non-alphanumeric characters is totally unnecessary and timewasting; there is no need whatsoever to build linelist
. Are you aware you can simply use plain old string.find("ArrayOf_xsd_string")
or re.search(...)
?
Then as to your regex, _
is already covered under \W
anyway. But the following reassignment to line overwrites the line you just read??
for line in f:
line = re.compile('[\W_]+') # overwrites the line you just read??
line.sub('', string.printable)
Here's my version, which reads the file directly, and also handles multiple matches:
with open('data.txt', 'r') as f:
theDict = {}
found = -1
for (lineno,line) in enumerate(f):
if found < 0:
if line.find('ArrayOf_xsd_string')>=0:
found = lineno
entries = []
continue
# Grab following 6 lines...
if 2 <= (lineno-found) <= 6+1:
entry = line.strip(' ""{}[]=:,')
entries.append(entry)
#then create a dict with the key from line 5
if (lineno-found) == 6+1:
key = entries.pop(4)
theDict[key] = entries
print key, ','.join(entries) # comma-separated, no quotes
#break # if you want to end on first match
found = -1 # to process multiple matches
And the output is exactly what you wanted (that's what ','.join(entries) was for):
123456 001,ABCD,1234,wordy type stuff,more stuff, etc
234567 002,ABCD,1234,wordy type stuff,more stuff, etc
345678 003,ABCD,1234,wordy type stuff,more stuff, etc
Let's have some fun with iterators!
class SudsIterator(object):
"""extracts xsd strings from suds text file, and returns a
(key, (value1, value2, ...)) tuple with key being the 5th field"""
def __init__(self, filename):
self.data_file = open(filename)
def __enter__(self): # __enter__ and __exit__ are there to support
return self # `with SudsIterator as blah` syntax
def __exit__(self, exc_type, exc_val, exc_tb):
self.data_file.close()
def __iter__(self):
return self
def next(self): # in Python 3+ this should be __next__
"""looks for the next 'ArrayOf_xsd_string' item and returns it as a
tuple fit for stuffing into a dict"""
data = self.data_file
for line in data:
if 'ArrayOf_xsd_string' not in line:
continue
ignore = next(data)
val1 = next(data).strip()[1:-2] # discard beginning whitespace,
val2 = next(data).strip()[1:-2] # quotes, and comma
val3 = next(data).strip()[1:-2]
val4 = next(data).strip()[1:-2]
key = next(data).strip()[1:-2]
val5 = next(data).strip()[1:-2]
break
else:
self.data_file.close() # make sure file gets closed
raise StopIteration() # and keep raising StopIteration
return key, (val1, val2, val3, val4, val5)
data = dict()
for key, value in SudsIterator('data.txt'):
data[key] = value
print data
If you want to extract the specific number of lines after a specific line that matches. You may as well simply read in the array with readlines, loop through it to find the match, then take the next N lines from the array too. Also, you could use a while loop along with readline, which is preferable if the files can get large.
The following is the most straight-forward fix to your code I can think of, but its not necessarily the best overall implementation, I suggest following my tips above unless you have good reasons not to or just want to get the job done asap by hook or crook ;)
newlines = []
for i in range(len(linelist)):
mylines = linelist[i].split()
if re.search(r'\w+', 'ArrayOf_xsd_string'):
for l in linelist[i+2:i+20]:
newlines.append(l)
print newlines
Should do what you want if I have interpreted your requirements properly. This says: take the next but one line, and the next 17 lines (so, up to but not including the 20th line after the match), append them to newlines (you cannot append a whole list at once, that list becomes a single index in the list you are adding them to).
Have fun and good luck :)