Performance reading large SPSS file in pandas dataframe on Windows 7 (x64)

问题

I have a large SPSS-file (containing a little over 1 million records, with a little under 150 columns) that I want to convert to a Pandas DataFrame.

It takes a few minutes to convert the file to a list, than another couple of minutes to convert it to a dataframe, than another few minutes to set the columnheaders.

Are there any optimizations possible, that I'm missing?

import pandas as pd
import numpy as np
import savReaderWriter as spss

raw_data = spss.SavReader('largefile.sav', returnHeader = True) # This is fast
raw_data_list = list(raw_data) # this is slow
data = pd.DataFrame(raw_data_list) # this is slow
data = data.rename(columns=data.loc[0]).iloc[1:] # setting columnheaders, this is slow too.

回答1:

You can use rawMode=True to speed up things a bit, as in:

raw_data = spss.SavReader('largefile.sav', returnHeader=True, rawMode=True)

This way, datetime variables (if any) won't be converted to ISO-strings, and SPSS $sysmis values won't be converted to None, and a few other things.

来源：https://stackoverflow.com/questions/25181147/performance-reading-large-spss-file-in-pandas-dataframe-on-windows-7-x64

标签

python

pandas

spss

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!