Separate keywords and @ mentions from dataset

我怕爱的太早我们不能终老 提交于 2019-12-12 03:28:35

问题


I have a huge set of data which has several columns and about 10k rows in more than 100 csv files, for now I am concerned about only one column with message format and from them I want to extract two parameters. I searched extensively around and I found two solutions that seem close but are not enough close to solve the question here. ONE & TWO

Input : Col name "Text" and every message is a separate row in a csv.

"Let's Bounce!😉  #[message_1]

 Loving the energy & Microphonic Mayhem while…" #[message_2]

RT @IVijayboi: #[message_3]   @Bdutt@sardesairajdeep@rahulkanwal@abhisarsharma@ppbajpayi@Abpnewd@Ndtv@Aajtak#Jihadimedia@Ibn7 happy #PresstitutesDay

 "RT @RakeshKhatri23: MY LIFE #[message_4]

        WITHOUT YOU 

        IS

        LIKE 

        FLOWERS WITHOUT 

        FRAGRANCE 💞💞

        ~True Love~"


  Me & my baby ðŸ¶â¤ï¸ðŸ‘­ @ Home Sweet Home  #[message_5]

The input is a CSV file with several other columns in the data but I am interested only in this column. I want to separate the @name and #keywordfrom the input into a new column like:

expected output

text, mentions, keywords 
[message], NAN, NAN
[message], NAN, NAN
[message], @IVijayboi, #Jihadimedia  
           @Bdutt      #PresstitutesDay
           @sardesairajdeep 
           @rahulkanwal 
           @abhisarsharma 
           @ppbajpayi 
           @Abpnewd 
           @Ndtv 
           @Aajtak  
           @Ibn7

As we see in the input first and second message has no @ and # so the column values NAN but for the third message it has 10 @ and 2 # keywords.

In simple words how do I separate the @ mentioned names and # keywords from the message to a separate column.


回答1:


I suspect you want to use a regular expression. I don't know the exact format that your @ mentions and # keywords are allowed to take, but I would guess that something of the form @([a-zA-Z0-9]+)[^a-zA-Z0-9] would work.

#!/usr/bin/env python3
import re

test_string = """Text
"Let's Bounce!😉
Loving the energy & Microphonic Mayhem while…"
RT @IVijayboi: etc etc"""

mention_match = re.compile('@([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in mention_match.finditer(test_string):
    print(match.group(1))

hashtag_match = re.compile('#([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in hashtag_match.finditer(test_string):
    print(match.group(1))

Hopefully that gives you enough to get started with.



来源:https://stackoverflow.com/questions/37224354/separate-keywords-and-mentions-from-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!