How to read tokens without reading whole line or file

后端未结

关注

 4  1831

Is there a well-hidden way to read tokens from a file or file-like object without reading entire lines? The application I immediately have (someone else\'s problem

相关标签:

4条回答

你的背包

2020-12-02 01:27

To read tokens from a file one by one; you could use re module to generate tokens from a memory-mapped file:

#!/usr/bin/env python3
import re
import sys
from mmap import ACCESS_READ, mmap    

def generate_tokens(filename, pattern):
    with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm:
         yield from re.finditer(pattern, mm)

# sum all integers in a file specified at the command-line
print(sum(int(m.group()) for m in generate_tokens(sys.argv[1], br'\d+')))

It works even if the file doesn't fit in memory.

0 讨论(0)

再見小時候

2020-12-02 01:27

You can read file in chunks with file.read(size). I would not recomment however to read by 1 byte, as this will drastically affect performance. Following snippet (not much tested, use on your own risk) reads file in chunks an yields numbers. You'll have to read through file first to determine rows starting position though.

def values_chunks(file_object, pos_from=0, chunk_size=32*1024):
    file_object.seek(pos_from)
    eol = False
    tail = ''
    while True:
        raw_data = file_object.read(chunk_size)
        raw_data = tail + raw_data
        raw_data = raw_data.split('\n', 1) # to check for eol, split in tuple
        if len(raw_data) > 1:
            eol = True
        raw_data = raw_data[0]
        raw_values = raw_data.split()
        if not eol and raw_data[-1] != ' ':
            tail = raw_values[-1]
            raw_values = raw_values[:-1]
        else:
            tail = ''
        for value in raw_values: # either case we need only first tuple elem
            yield int(value)
        if not raw_data[0] or eol: # eof/eol
            break

>>> with open('test', 'wb') as test:
...     test.write(' '.join(map(str, range(10**5))))
...     test.write('\n')
...     test.write(' '.join(map(str, range(10**4))))
...
>>> values = list(values_chunks(open('test', 'rb')))
>>> len(values)
100000
>>> sum(values)
4999950000L

0 讨论(0)

一整个雨季

2020-12-02 01:47

Here is a generator that processes a file one character at a time and yields tokens when whitespace is encountered.

def generate_tokens(path):
    with open(path, 'r') as fp:
        buf = []
        while True:
            ch = fp.read(1)
            if ch == '':
                break
            elif ch.isspace():
                if buf:
                    yield ''.join(buf)
                    buf = []
            else:
                buf.append(ch)

if __name__ == '__main__':
    for token in generate_tokens('input.txt'):
        print token

To be more generic, it looks like you might be able to use the re module as described at this link. Just feed the input with a generator from your file to avoid reading the whole file at once.

Python equivalent of ruby's StringScanner?

0 讨论(0)

别跟我提以往

2020-12-02 01:54

# python, read token file
# Put token on first line of a token.txt file. 

token = open("token.txt","r").readline()  # I've opted to just save my token to a text file.
token = token.rstrip()  
...

print(token)

0 讨论(0)