问题
I am processing 100,000s of rows of text data using Pandas Dataframes. Every so often (<5 per 100,000) I have an error for a row that I have chosen to drop. The error handling function is as follows:
def unicodeHandle(datai):
for i, row in enumerate(datai['LDTEXT']):
print(i)
#print(text)
try:
text = row.read()
text.strip().split('[\W_]+')
print(text)
except UnicodeDecodeError as e:
datai.drop(i, inplace=True)
print('Error at index {}: {!r}'.format(i, row))
print(e)
return datai
The function works fine, and I have been using it a few weeks.
The problem is that I never know when the error will occur as the data comes from a DB that is constantly being added to (or I may pull different data). Point being, I must iterate through every row to run my error test function unicodeHandle
in order initialize my data. This process takes about ~5 minutes which gets a little annoying. I am trying to implement multiprocessing to speed up the loop. Via the web and various tutorials, I have come up with:
def unicodeMP(datai):
chunks = [datai[i::8] for i in range(8)]
pool = mp.Pool(processes=8)
results = pool.apply_async(unicodeHandle, chunks)
while not results.ready():
print("One Sec")
return results.get()
if __name__ == "__main__":
fast = unicodeMP(datai)
When I run it the multiprocessing, it takes the same amount of time as regular even through my CPU says it is running at a WAY higher utilization. In addition, the code returns the error as a normal error instead of my completed clean dataframe. What am I missing here?
How can I use multiprocessing for functions on DataFrames?
回答1:
You can try dask for multiprocessing a dataframe
import dask.dataframe as dd
partitions = 7 # cpu_cores - 1
ddf = dd.from_pandas(df, npartitions=partitions)
ddf.map_partitions(lambda df: df.apply(unicodeHandle).compute(scheduler='processes')
You can read more about dask
here
来源:https://stackoverflow.com/questions/59584238/python-multiprocessing-for-dataframe-operations-functions