Loading large file (25k entries) into dict is slow in Python?

匿名 (未验证) 提交于 2019-12-03 08:48:34

问题:

I have a file which has about 25000 lines, and it's a s19 format file.

each line is like: S214 780010 00802000000010000000000A508CC78C 7A

There are no spaces in the actual file, the first part 780010 is the address of this line, and I want it to be a dict's key value, and I want the data part 00802000000010000000000A508CC78C be the value of this key. I wrote my code like this:

def __init__(self,filename):     infile = file(filename,'r')     self.all_lines = infile.readlines()     self.dict_by_address = {}      for i in range(0, self.get_line_number()):         self.dict_by_address[self.get_address_of_line(i)] = self.get_data_of_line(i)      infile.close() 

get_address_of_line() and get_data_of_line() are all simply string slicing functions. get_line_number() iterates over self.all_lines and returns an int

problem is, the init process takes me over 1 min, is the way I construct the dict wrong or python just need so long to do this?

And by the way, I'm new to python:) maybe the code looks more C/C++ like, any advice of how to program like python is appreciated:)

回答1:

This code should be tremendously faster than what you have now. EDIT: As @sth pointed out, this doesn't work because there are no spaces in the actual file. I'll add a corrected version at the end.

def __init__(self,filename):     self.dict_by_address = {}      with open(filename, 'r') as infile:         for line in infile:             _, key, value, _ = line.split()             self.dict_by_address[key] = value 

Some comments:

  • Best practice in Python is to use a with statement, unless you are using an old Python that doesn't have it.

  • Best practice is to use open() rather than file(); I don't think Python 3.x even has file().

  • You can use the open file object as an iterator, and when you iterate it you get one line from the input. This is better than calling the .readlines() method, which slurps all the data into a list; then you use the data one time and delete the list. Since the input file is large, that means you are probably causing swapping to virtual memory, which is always slow. This version avoids building and deleting the giant list.

  • Then, having created a giant list of input lines, you use range() to make a big list of integers. Again it wastes time and memory to build a list, use it once, then delete the list. You can avoid this overhead by using xrange() but even better is just to build the dictionary as you go, as part of the same loop that is reading lines from the file.

  • It might be better to use your special slicing functions to pull out the "address" and "data" fields, but if the input is regular (always follows the pattern of your example) you can just do what I showed here. line.split() splits the line on white space, giving a list of four strings. Then we unpack it into four variables using "destructuring assignment". Since we only want to save two of the values, I used the variable name _ (a single underscore) for the other two. That's not really a language feature, but it is an idiom in the Python community: when you have data you don't care about you can assign it to _. This line will raise an exception if there are ever any number of values other than 4, so if it is possible to have blank lines or comment lines or whatever, you should add checks and handle the error (at least wrap that line in a try:/except).

EDIT: corrected version:

def __init__(self,filename):     self.dict_by_address = {}      with open(filename, 'r') as infile:         for line in infile:             key = extract_address(line)              value = extract_data(line)             self.dict_by_address[key] = value 


回答2:

How about something like this? (I made a test file with just a line S21478001000802000000010000000000A508CC78C7A so you might have to adjust the slicing.)

>>> with open('test.test') as f: ...     dict_by_address = {line[4:10]:line[10:-3] for line in f} ...  >>> dict_by_address {'780010': '00802000000010000000000A508CC78C'} 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!