Dataframe aggregation of n-gram, their frequency and associate the entries of other columns with it using R

问题

I am trying to aggregate a dataframe based on 1-gram (can be extended to n-gram by changing n in the code below) frequency and associate other columns to it. The way I did it is shown below. Are there any other shortcuts/ alternatives to produce the table shown at the very end of this question for the dataframe given below?

The code and the results are shown below.

The below chunk sets the environment, loads the libraries and reads the dataframe:

# Clear variables in the working environment 
rm(list = ls(all.names = TRUE))
gc()

# load libraries
library("quanteda")
library("data.table")
library("tidyverse")

# Dataframe
Data <- data.frame(Column1 = c(1.222, 3.445, 5.621, 8.501, 9.302), 
                   Column2 = c(654231, 12347, -2365, 90000, 12897), 
                   Column3 = c('A1', 'B2', 'E3', 'C1', 'F5'), 
                   Column4 = c('I bought it', 'The flower has a beautiful fragrance', 'It was bought by me', 'I have bought it', 'The flower smells good'), 
                   Column5 = c('Good', 'Bad', 'Ok', 'Moderate', 'Perfect'))

The Corpus, tokenization and n-grams (1-gram/ unigram/onegram in this case) are calculated next:

# Corpus
Content <- corpus(Data, text_field = "Column4")
docnames(Content) <- seq_len(nrow(Data))

# Tokenization and Unigram
Tokens <- tokens(Content, what = "word") %>%  tokens_tolower() %>%  tokens_ngrams(n = 1)

Unigram <- textstat_frequency(dfm(Tokens), groups = docnames(Tokens)) %>% as.data.table()

setnames(Unigram, "group", "rownumber")

The Unigram result is:

The below gives the structure associated with Unigram:

str(Unigram)

Next, copy Data to DataFrame, add rownumber as a variable in it, and set that variable as numeric:

DataFrame <- Data

DataFrame <- add_column(DataFrame, rownumber = 1:nrow(Data), .before = "Column1")

DataFrame$rownumber <- as.numeric(DataFrame$rownumber)

Then convert the Unigram as a dataframe (UnigramDataFrame), set the rownumber as numeric in it and merge the UnigramDataFrame with DataFrame:

UnigramDataFrame <- as.data.frame(Unigram)

UnigramDataFrame$rownumber <- as.numeric(UnigramDataFrame$rownumber)

MergeDF <- dplyr::left_join(UnigramDataFrame,DataFrame,by="rownumber")

Finally aggregate MergeDF for frequency of 1-gram/unigram (it's first column), and associate Column3 and Column5 together with it:

UnigramAgg <- MergeDF %>% group_by(feature) %>% summarise(Freq=n(), Column3=toString(Column3), Column5=toString(Column5))

Running the above produces the desired result:

The aim was to aggregate 1-grams at their frequencies and associate Column 3 and Column 5 with it as shown above. The last two columns show, what were the entries associated with Column3 and Column5 with this aggregation. For example, all the IDs (identification numbers) can be gathered this way associated with every 1-gram.

来源：https://stackoverflow.com/questions/65740046/dataframe-aggregation-of-n-gram-their-frequency-and-associate-the-entries-of-ot

标签

nlp

aggregate

text-mining

n-gram