问题
I have a simple dataframe with two columns.
+---------+-------+ | subject | score |
+---------+-------+ | wow | 0 |
+---------+-------+ | cool | 0 |
+---------+-------+ | hey | 0 |
+---------+-------+ | there | 0 |
+---------+-------+ | come on | 0 |
+---------+-------+ | welcome | 0 |
+---------+-------+
For every record in 'subject' column, I am calling a function and updating the results in column 'score' :
df['score'] = df['subject'].apply(find_score)
Here find_score is a function, which processes strings and returns a score :
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
# Instantiates a client
language_client = language.Client()
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
This works fine as expected but its quite slow as it processes the record one by one.
Is there a way, this can be parallelised ? without manually splitting the dataframe into smaller chunks ? Is there any library which does that automatically ?
Cheers
回答1:
The instantiation of language.Client
every time you call the find_score
function is likely a major bottleneck. You don't need to create a new client instance for every use of the function, so try creating it outside the function, before you call it:
# Instantiates a client
language_client = language.Client()
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
df['score'] = df['subject'].apply(find_score)
If you insist, you can use multiprocessing like this:
from multiprocessing import Pool
# <Define functions and datasets here>
pool = Pool(processes = 8) # or some number of your choice
df['score'] = pool.map(find_score, df['subject'])
pool.terminate()
来源:https://stackoverflow.com/questions/44171554/automatically-multiprocessing-a-function-apply-on-a-dataframe-column