Is there a well-hidden way to read tokens from a file or file-like object without reading entire lines? The application I immediately have (someone else\'s problem
To read tokens from a file one by one; you could use re
module to generate tokens from a memory-mapped file:
#!/usr/bin/env python3
import re
import sys
from mmap import ACCESS_READ, mmap
def generate_tokens(filename, pattern):
with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm:
yield from re.finditer(pattern, mm)
# sum all integers in a file specified at the command-line
print(sum(int(m.group()) for m in generate_tokens(sys.argv[1], br'\d+')))
It works even if the file doesn't fit in memory.
You can read file in chunks with file.read(size)
. I would not recomment however to read by 1 byte, as this will drastically affect performance. Following snippet (not much tested, use on your own risk) reads file in chunks an yields numbers. You'll have to read through file first to determine rows starting position though.
def values_chunks(file_object, pos_from=0, chunk_size=32*1024):
file_object.seek(pos_from)
eol = False
tail = ''
while True:
raw_data = file_object.read(chunk_size)
raw_data = tail + raw_data
raw_data = raw_data.split('\n', 1) # to check for eol, split in tuple
if len(raw_data) > 1:
eol = True
raw_data = raw_data[0]
raw_values = raw_data.split()
if not eol and raw_data[-1] != ' ':
tail = raw_values[-1]
raw_values = raw_values[:-1]
else:
tail = ''
for value in raw_values: # either case we need only first tuple elem
yield int(value)
if not raw_data[0] or eol: # eof/eol
break
>>> with open('test', 'wb') as test:
... test.write(' '.join(map(str, range(10**5))))
... test.write('\n')
... test.write(' '.join(map(str, range(10**4))))
...
>>> values = list(values_chunks(open('test', 'rb')))
>>> len(values)
100000
>>> sum(values)
4999950000L
Here is a generator that processes a file one character at a time and yields tokens when whitespace is encountered.
def generate_tokens(path):
with open(path, 'r') as fp:
buf = []
while True:
ch = fp.read(1)
if ch == '':
break
elif ch.isspace():
if buf:
yield ''.join(buf)
buf = []
else:
buf.append(ch)
if __name__ == '__main__':
for token in generate_tokens('input.txt'):
print token
To be more generic, it looks like you might be able to use the re
module as described at this link. Just feed the input with a generator from your file to avoid reading the whole file at once.
Python equivalent of ruby's StringScanner?
# python, read token file
# Put token on first line of a token.txt file.
token = open("token.txt","r").readline() # I've opted to just save my token to a text file.
token = token.rstrip()
...
print(token)