R (arules) Convert dataframe into transactions and remove NA

问题

i have a set dataframe. My purpose is to convert the dataframe into transactions data in order to do market basket analysis using Arules package in R. I did do some research online regarding conversion of dataframe to transactions data, e.g.(How to prep transaction data into basket for arules) and (Transform csv into transactions for arules), but the result i got was different.

dput(df)

structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"), 
Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"), 
Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"), 
Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA), 
Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"), 
Other = c(NA, NA, NA, NA, "Promo", NA)), 
.Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

Below is my dataframe structure

Transaction_ID  Fruits  Vegetables  Personal  Drink  Other
      A001        NA        NA       ToothP   Coff    NA
      A002       Apple      NA       ToothP    NA     NA
      A003      Orange      NA         NA     Coff    NA
      A004        NA      Potato     ToothB   Milk    NA
      A005       Pear       NA       ToothB   Milk   Promo
      A006      Grape      Yam         NA     Coff    NA

class for each column

sapply(df, class)
Transaction_ID         Fruits     Vegetables       Personal          Drink          Other 
"character"    "character"    "character"    "character"    "character"    "character"

Convert dataframe to transaction data

data <- as(split(df[,"Fruits"], df[,"Vegetables"],df[,"Personal"], df[,"Drink"], df[,"Other"]), "transactions")
inspect(data)

Results i got

[1] {NA,NA,ToothP,Coff,NA}
[2] {Apple,NA,ToothP,NA,NA}
[3] {Orange,NA,NA,Coff,NA}
[4] {NA,Potato,ToothB,Milk,NA}
[5] {Pear,NA,ToothB,Milk,Promo}
[6] {Grape,Yam,NA,Coff,NA}

The transaction data was successfully converted, but I was wondering is there any way to remove the NA items? since the NA will take consideration as an item if they still remain in the transaction list.

回答1:

Ogustari is right. Here is the complete code that also handles the transaction IDs.

library("arules")
library("dplyr")  ### for dbl_df
df <- structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"), 
  Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"), 
  Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"), 
  Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA), 
  Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"), 
  Other = c(NA, NA, NA, NA, "Promo", NA)), 
  .Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

### remove transaction IDs
tid <- as.character(df[["Transaction_ID"]])
df <- df[,-1]

### make all columns factors
for(i in 1:ncol(df)) df[[i]] <- as.factor(df[[i]])

trans <- as(df, "transactions")

### set transactionIDs
transactionInfo(trans)[["transactionID"]] <- tid

inspect(trans)

   items                                          transactionID
[1] {Personal=ToothP,Drink=Coff}                   A001         
[2] {Personal=ToothP}                              A002         
[3] {Drink=Coff}                                   A003         
[4] {Vegetables=Potato,Personal=ToothB,Drink=Milk} A004         
[5] {Personal=ToothB,Drink=Milk,Other=Promo}       A005         
[6] {Vegetables=Yam,Drink=Coff}                    A006

回答2:

I can propose you this solution but I do not know if is the one you are looking for.

dput(df)

df <- data.frame(structure(list(Transaction_ID = as.factor(c("A001", "A002", "A003", "A004", "A005", "A006")), 
               Fruits = as.factor(c(NA, "Apple", "Orange", NA, "Pear", "Grape")), 
               Vegetables = as.factor(c(NA, NA, NA, "Potato", NA, "Yam")), 
               Personal = as.factor(c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA)), 
               Drink = as.factor(c("Coff", NA, "Coff", "Milk", "Milk", "Coff")), 
               Other = as.factor(c(NA, NA, NA, NA, "Promo", NA))), 
          .Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
          class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L)))

Class for each column Note that the classe are all "Factor"

sapply(df, class)
Transaction_ID         Fruits     Vegetables       Personal          Drink          Other 
      "factor"       "factor"       "factor"       "factor"       "factor"       "factor"

Convert data frame to transaction data

data <- as(df, "transactions")
inspect(data)

The result I've got

     items                 transactionID
[1] {Transaction_ID=A001,              
     Personal=ToothP,                  
     Drink=Coff}                      1
[2] {Transaction_ID=A002,              
     Fruits=Apple,                     
     Personal=ToothP}                 2
[3] {Transaction_ID=A003,              
     Fruits=Orange,                    
     Drink=Coff}                      3
[4] {Transaction_ID=A004,              
     Vegetables=Potato,                
     Personal=ToothB,                  
     Drink=Milk}                      4
[5] {Transaction_ID=A005,              
     Fruits=Pear,                      
     Personal=ToothB,                  
     Drink=Milk,                       
     Other=Promo}                     5
[6] {Transaction_ID=A006,              
     Fruits=Grape,                     
     Vegetables=Yam,                   
     Drink=Coff}                      6

I found part of the solution here convert data frame in r to transaction or an itemMatrix. Moreover is seems that your command

data <- as(split(df[,"Fruits"], df[,"Vegetables"],df[,"Personal"], df[,"Drink"], df[,"Other"]), "transactions")
inspect(data)

only works for a data.frame containing only two columns.

来源：https://stackoverflow.com/questions/45773861/r-arules-convert-dataframe-into-transactions-and-remove-na

标签

dataframe

transactions

arules