Dataframe aggregation of n-gram, their frequency and associate the entries of other columns with it using R

人盡茶涼 提交于 2021-01-29 15:48:18

问题


I am trying to aggregate a dataframe based on 1-gram (can be extended to n-gram by changing n in the code below) frequency and associate other columns to it. The way I did it is shown below. Are there any other shortcuts/ alternatives to produce the table shown at the very end of this question for the dataframe given below?

The code and the results are shown below.

The below chunk sets the environment, loads the libraries and reads the dataframe:

# Clear variables in the working environment 
rm(list = ls(all.names = TRUE))
gc()

# load libraries
library("quanteda")
library("data.table")
library("tidyverse")

# Dataframe
Data <- data.frame(Column1 = c(1.222, 3.445, 5.621, 8.501, 9.302), 
                   Column2 = c(654231, 12347, -2365, 90000, 12897), 
                   Column3 = c('A1', 'B2', 'E3', 'C1', 'F5'), 
                   Column4 = c('I bought it', 'The flower has a beautiful fragrance', 'It was bought by me', 'I have bought it', 'The flower smells good'), 
                   Column5 = c('Good', 'Bad', 'Ok', 'Moderate', 'Perfect'))

The Corpus, tokenization and n-grams (1-gram/ unigram/onegram in this case) are calculated next:

# Corpus
Content <- corpus(Data, text_field = "Column4")
docnames(Content) <- seq_len(nrow(Data))

# Tokenization and Unigram
Tokens <- tokens(Content, what = "word") %>%  tokens_tolower() %>%  tokens_ngrams(n = 1)

Unigram <- textstat_frequency(dfm(Tokens), groups = docnames(Tokens)) %>% as.data.table()

setnames(Unigram, "group", "rownumber")

The Unigram result is:

The below gives the structure associated with Unigram:

str(Unigram)

Next, copy Data to DataFrame, add rownumber as a variable in it, and set that variable as numeric:

DataFrame <- Data

DataFrame <- add_column(DataFrame, rownumber = 1:nrow(Data), .before = "Column1")

DataFrame$rownumber <- as.numeric(DataFrame$rownumber)

Then convert the Unigram as a dataframe (UnigramDataFrame), set the rownumber as numeric in it and merge the UnigramDataFrame with DataFrame:

UnigramDataFrame <- as.data.frame(Unigram)

UnigramDataFrame$rownumber <- as.numeric(UnigramDataFrame$rownumber)

MergeDF <- dplyr::left_join(UnigramDataFrame,DataFrame,by="rownumber")

Finally aggregate MergeDF for frequency of 1-gram/unigram (it's first column), and associate Column3 and Column5 together with it:

UnigramAgg <- MergeDF %>% group_by(feature) %>% summarise(Freq=n(), Column3=toString(Column3), Column5=toString(Column5))

Running the above produces the desired result:

The aim was to aggregate 1-grams at their frequencies and associate Column 3 and Column 5 with it as shown above. The last two columns show, what were the entries associated with Column3 and Column5 with this aggregation. For example, all the IDs (identification numbers) can be gathered this way associated with every 1-gram.

来源:https://stackoverflow.com/questions/65740046/dataframe-aggregation-of-n-gram-their-frequency-and-associate-the-entries-of-ot

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!