Count Frequency of words of a Text variable with Hive

问题

I have a variable that every row is a sentence. Example:

 -Row1 "Hey, how are you?
 -Rwo2 "Hey, Who is there?

I want that the output is the count group by word.

Example:

Hey 2
How 1
are 1
...

I am using split a bit funtion but I am a bit stuck. Any thoughts on this?

Thanks!

回答1:

This is possible in Hive. Split by non-alpha characters and use lateral view+explode, then count words:

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;

Result:

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

One more method using sentences function, it returns array of tokenized sentences (array of array of words):

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
   lateral view explode(s.sentence) w as word
group by w.word;

Result:

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

sentences(string str, string lang, string locale) function tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The 'lang' and 'locale' are optional arguments. For example, sentences('Hello there! How are you?') returns ( ("Hello", "there"), ("How", "are", "you") )

回答2:

Hive wont be able to this alone. You can read the data from Hive into a Pandas DataFrame and do the processing there with Python. Then your question is how to count word frequency in a DataFrame column.

Counting the Frequency of words in a pandas data frame

来源：https://stackoverflow.com/questions/59855489/count-frequency-of-words-of-a-text-variable-with-hive

标签

Hadoop

text

Hive

counter

hiveql