How to read tokens without reading whole line or file

后端 未结 4 1831
别那么骄傲
别那么骄傲 2020-12-02 00:51

Is there a well-hidden way to read tokens from a file or file-like object without reading entire lines? The application I immediately have (someone else\'s problem

相关标签:
4条回答
  • 2020-12-02 01:27

    To read tokens from a file one by one; you could use re module to generate tokens from a memory-mapped file:

    #!/usr/bin/env python3
    import re
    import sys
    from mmap import ACCESS_READ, mmap    
    
    def generate_tokens(filename, pattern):
        with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm:
             yield from re.finditer(pattern, mm)
    
    # sum all integers in a file specified at the command-line
    print(sum(int(m.group()) for m in generate_tokens(sys.argv[1], br'\d+')))
    

    It works even if the file doesn't fit in memory.

    0 讨论(0)
  • 2020-12-02 01:27

    You can read file in chunks with file.read(size). I would not recomment however to read by 1 byte, as this will drastically affect performance. Following snippet (not much tested, use on your own risk) reads file in chunks an yields numbers. You'll have to read through file first to determine rows starting position though.

    def values_chunks(file_object, pos_from=0, chunk_size=32*1024):
        file_object.seek(pos_from)
        eol = False
        tail = ''
        while True:
            raw_data = file_object.read(chunk_size)
            raw_data = tail + raw_data
            raw_data = raw_data.split('\n', 1) # to check for eol, split in tuple
            if len(raw_data) > 1:
                eol = True
            raw_data = raw_data[0]
            raw_values = raw_data.split()
            if not eol and raw_data[-1] != ' ':
                tail = raw_values[-1]
                raw_values = raw_values[:-1]
            else:
                tail = ''
            for value in raw_values: # either case we need only first tuple elem
                yield int(value)
            if not raw_data[0] or eol: # eof/eol
                break
    
    >>> with open('test', 'wb') as test:
    ...     test.write(' '.join(map(str, range(10**5))))
    ...     test.write('\n')
    ...     test.write(' '.join(map(str, range(10**4))))
    ...
    >>> values = list(values_chunks(open('test', 'rb')))
    >>> len(values)
    100000
    >>> sum(values)
    4999950000L
    
    0 讨论(0)
  • 2020-12-02 01:47

    Here is a generator that processes a file one character at a time and yields tokens when whitespace is encountered.

    def generate_tokens(path):
        with open(path, 'r') as fp:
            buf = []
            while True:
                ch = fp.read(1)
                if ch == '':
                    break
                elif ch.isspace():
                    if buf:
                        yield ''.join(buf)
                        buf = []
                else:
                    buf.append(ch)
    
    if __name__ == '__main__':
        for token in generate_tokens('input.txt'):
            print token
    

    To be more generic, it looks like you might be able to use the re module as described at this link. Just feed the input with a generator from your file to avoid reading the whole file at once.

    Python equivalent of ruby's StringScanner?

    0 讨论(0)
  • 2020-12-02 01:54
    # python, read token file
    # Put token on first line of a token.txt file. 
    
    token = open("token.txt","r").readline()  # I've opted to just save my token to a text file.
    token = token.rstrip()  
    ...
    
    print(token)
    
    0 讨论(0)
提交回复
热议问题