numpy.memmap for an array of strings?

和自甴很熟 提交于 2019-12-21 17:57:28

问题


Is it possible to use numpy.memmap to map a large disk-based array of strings into memory?

I know it can be done for floats and suchlike, but this question is specifically about strings.

I am interested in solutions for both fixed-length and variable-length strings.

The solution is free to dictate any reasonable file format.


回答1:


If all the strings have the same length, as suggested by the term "array", this is easily possible:

a = numpy.memmap("data", dtype="S10")

would be an example for strings of length 10.

Edit: Since apparently the strings don't have the same length, you need to index the file to allow for O(1) item access. This requires reading the whole file once and storing the start indices of all strings in memory. Unfortunately, I don't think there is a pure NumPy way of indexing without creating an array the same size as the file in memory first. This array can be dropped after extracting the indices, though.




回答2:


The most flexible option would be to switch to a database or some other more complex on-disk file structure.

However, there's probably some good reason that you'd rather keep things as a plain text file...

Because you have control of how the files are created, one option is to simply write out a second file that only contains the starting positions (in bytes) of each string in the other file.

This would require a bit more work, but you could essentially do something like this:

class IndexedText(object):
    def __init__(self, filename, mode='r'):
        if mode not in ['r', 'w', 'a']:
            raise ValueError('Only read, write, and append is supported')
        self._mainfile = open(filename, mode)
        self._idxfile = open(filename+'idx', mode)

        if mode != 'w':
            self.indicies = [int(line.strip()) for line in self._idxfile]
        else:
            self.indicies = []

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        self._mainfile.close()
        self._idxfile.close()

    def __getitem__(self, idx):
        position = self.indicies[idx]
        self._mainfile.seek(position)
        # You might want to remove the automatic stripping...
        return self._mainfile.readline().rstrip('\n')

    def write(self, line):
        if not line.endswith('\n'):
            line += '\n'
        position = self._mainfile.tell()
        self.indicies.append(position)
        self._idxfile.write(str(position)+'\n')
        self._mainfile.write(line)

    def writelines(self, lines):
        for line in lines:
            self.write(line)


def main():
    with IndexedText('test.txt', 'w') as outfile:
        outfile.write('Yep')
        outfile.write('This is a somewhat longer string!')
        outfile.write('But we should be able to index this file easily')
        outfile.write('Without needing to read the entire thing in first')

    with IndexedText('test.txt', 'r') as infile:
        print infile[2]
        print infile[0]
        print infile[3]

if __name__ == '__main__':
    main()


来源:https://stackoverflow.com/questions/5896747/numpy-memmap-for-an-array-of-strings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!