Count Frequency of words of a Text variable with Hive

自闭症网瘾萝莉.ら 提交于 2020-01-24 20:41:05

问题


I have a variable that every row is a sentence. Example:

 -Row1 "Hey, how are you?
 -Rwo2 "Hey, Who is there?

I want that the output is the count group by word.

Example:

Hey 2
How 1
are 1
...

I am using split a bit funtion but I am a bit stuck. Any thoughts on this?

Thanks!


回答1:


This is possible in Hive. Split by non-alpha characters and use lateral view+explode, then count words:

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;

Result:

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

One more method using sentences function, it returns array of tokenized sentences (array of array of words):

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
   lateral view explode(s.sentence) w as word
group by w.word;

Result:

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

sentences(string str, string lang, string locale) function tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The 'lang' and 'locale' are optional arguments. For example, sentences('Hello there! How are you?') returns ( ("Hello", "there"), ("How", "are", "you") )




回答2:


Hive wont be able to this alone. You can read the data from Hive into a Pandas DataFrame and do the processing there with Python. Then your question is how to count word frequency in a DataFrame column.

Counting the Frequency of words in a pandas data frame



来源:https://stackoverflow.com/questions/59855489/count-frequency-of-words-of-a-text-variable-with-hive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!