Automatically multiprocessing a 'function apply' on a dataframe column

南楼画角 提交于 2020-01-07 04:43:23

问题


I have a simple dataframe with two columns.

+---------+-------+ | subject | score |
+---------+-------+ | wow     | 0     |
+---------+-------+ | cool    | 0     |
+---------+-------+ | hey     | 0     |
+---------+-------+ | there   | 0     |
+---------+-------+ | come on | 0     |
+---------+-------+ | welcome | 0     |
+---------+-------+

For every record in 'subject' column, I am calling a function and updating the results in column 'score' :

df['score'] = df['subject'].apply(find_score)

Here find_score is a function, which processes strings and returns a score :

def find_score (row):
    # Imports the Google Cloud client library
    from google.cloud import language

    # Instantiates a client
    language_client = language.Client()

    import re
    pre_text = re.sub('<[^>]*>', '', row)
    text = re.sub(r'[^\w]', ' ', pre_text)

    document = language_client.document_from_text(text)

    # Detects the sentiment of the text
    sentiment = document.analyze_sentiment().sentiment

    print("Sentiment score - %f " % sentiment.score) 

    return sentiment.score

This works fine as expected but its quite slow as it processes the record one by one.

Is there a way, this can be parallelised ? without manually splitting the dataframe into smaller chunks ? Is there any library which does that automatically ?

Cheers


回答1:


The instantiation of language.Client every time you call the find_score function is likely a major bottleneck. You don't need to create a new client instance for every use of the function, so try creating it outside the function, before you call it:

# Instantiates a client
language_client = language.Client()

def find_score (row):
    # Imports the Google Cloud client library
    from google.cloud import language


    import re
    pre_text = re.sub('<[^>]*>', '', row)
    text = re.sub(r'[^\w]', ' ', pre_text)

    document = language_client.document_from_text(text)

    # Detects the sentiment of the text
    sentiment = document.analyze_sentiment().sentiment

    print("Sentiment score - %f " % sentiment.score) 

    return sentiment.score

df['score'] = df['subject'].apply(find_score)

If you insist, you can use multiprocessing like this:

from multiprocessing import Pool
# <Define functions and datasets here>
pool = Pool(processes = 8) # or some number of your choice
df['score'] = pool.map(find_score, df['subject'])
pool.terminate()


来源:https://stackoverflow.com/questions/44171554/automatically-multiprocessing-a-function-apply-on-a-dataframe-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!