I have a huge data file with a specific string being repeated after a defined number of lines.
counting jump between first two \'Rank\' occurrences. For example the file
I assume you want to find the number of lines in a block where each block starts with a line that contains 'Rank' e.g., there are 3 blocks in your sample: 1st has 4 lines, 2nd has 4 lines, 3rd has 1 line:
from itertools import groupby
def block_start(line, start=[None]):
if 'Rank' in line:
start[0] = not start[0]
return start[0]
with open(filename) as file:
block_sizes = [sum(1 for line in block) # find number of lines in a block
for _, block in groupby(file, key=block_start)] # group
print(block_sizes)
# -> [4, 4, 1]
If all blocks have the same number of lines or you just want to find number of lines in the first block that starts with 'Rank'
:
count = None
with open(filename) as file:
for line in file:
if 'Rank' in line:
if count is None: # found the start of the 1st block
count = 1
else: # found the start of the 2nd block
break
elif count is not None: # inside the 1st block
count += 1
print(count) # -> 4
counting jump between first two 'Rank'
occurrences:
def find_jumps(filename):
first = True
count = 0
with open(filename) as f:
for line in f:
if 'Rank' in line:
if first:
count = 0
#set this to 1 if you want to include one of the 'Rank' lines.
first = False
else:
return count
else:
count += 1
Don't use .readlines()
when a simple generator expression counting the lines with Rank
is enough:
count = sum(1 for l in open(filename) if 'Rank' not in l)
'Rank' not in l
is enough to test if the string 'Rank'
is not present in a string. Looping over the open file is looping over all the lines. The sum()
function will add up all the 1
s, which are generated for each line not containing Rank
, giving you a count of lines without Rank
in them.
If you need to count the lines from Rank
to Rank
, you need a little itertools.takewhile
magic:
import itertools
with open(filename) as f:
# skip until we reach `Rank`:
itertools.takewhile(lambda l: 'Rank' not in l, f)
# takewhile will have read a line with `Rank` now
# count the lines *without* `Rank` between them
count = sum(1 for l in itertools.takewhile(lambda l: 'Rank' not in l, f)
count += 1 # we skipped at least one `Rank` line.
7 line of codes:
count = 0
for line in open("yourfile.txt"):
if "Rank" in line:
count += 1
if count > 1: break
elif count > 0: count += 1
print count