Match Dataframes Excluding Last Non-NA Value and disregarding order

有些话、适合烂在心里 提交于 2019-11-27 03:49:42

问题


I have two dataframes:

Partner<-c("Alpha","Beta","Zeta")
COL1<-c("A","C","M")
COL2<-c("B","D","K")
COL3<-c("C","F",NA)
COL4<-c("D",NA,NA)
df1<-data.frame(Partner,COL1,COL2,COL3,COL4)

lift<-c(9,10,11,12,12,23,12,24)
RULE1<-c("B","B","D","A","C","K","M","K")
RULE2<-c("A","A","C","B","A","M","T","M")
RULE3<-c("G","D","M","C" ,"M", "E",NA,NA)
RULE4<-c(NA,NA,"K","D" ,NA, NA,NA,NA)

df2<-data.frame(lift,RULE1,RULE2,RULE3,RULE4)

df1
Partner    COL1    COL2    COL3    COL4
Alpha         A       B       C       D
Beta          C       D       F      NA
Zeta          M       K      NA      NA

df2
lift    RULE1    RULE2     RULE3    RULE4
   9        B        A         G       NA
  10        B        A         D       NA
  11        D        C         M        K
  12        A        B         C        D
  12        C        A         M       NA
  23        K        M         E       NA
  12        M        T        NA       NA
  24        K        M        NA       NA

This is a market basket analysis. df1 is the customer/partner that bought each of the items listed: A, B, C...etc.

df2 are the recommendations associated with the items bought in the past.

The last value in each of the df2 rows represent the recommendation. So the preceding values in each row from the last non-NA value are the "baskets".

So for example in the first row of df2, it is stating: If B and A are bought together, recommend G.

I want to be able to figure out if each partner from df1 bought ALL the values in each row excluding the final value since that is the recommendation. Then add that recommendation to the end of each row of the new dataframe.

For example: For partner: Alpha, would it be good to recommend value G from the first row? Answer would be yes because they bought all the values from that row in df2 (A and B).

For partner: Beta, it would not be good to recommend value G because not all of the values from the first row of df2 are found in the Beta row.

Final Output:

Partner    COL1    COL2    COL3    COL4    lift   RULE1    RULE2    RULE3    RULE4   Does Last Non-NA Value Exist in Row?
Alpha         A       B       C       D       9       B        A        G       NA                                    No
Alpha         A       B       C       D      10       B        A        D       NA                                   Yes
Alpha         A       B       C       D      12       A        B        C        D                                   Yes
Alpha         A       B       C       D      12       C        A        M       NA                                    No
Zeta          M       K      NA      NA      23       K        M        E       NA                                    No
Zeta          M       K      NA      NA      12       M        T       NA       NA                                    No
Zeta          M       K      NA      NA      24       K        M       NA       NA                                   Yes

Written out results for clarity:

df3

row1 outputs "No" because G is not found in Alpha Partner and all values before G show up in Alpha Partner (B,A)

row2 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (B,A)

row3 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (A,B,C)

row4 outputs "No" because M is not found in Alpha Partner and all values before M show up in Alpha Partner (C,A)

row5 outputs "No" because E is not found in Zeta Partner and all values before E show up in Zeta Partner (K,M)

row6 outputs "No" because T is not found in Zeta Partner and all values before T show up in Zeta Partner (M)

row7 outputs "Yes" because M is found in Zeta Partner and all values before M show up in Zeta Partner (K)

I think that has to be a join or a match of some kind but can't figure out how to do it.

This would be extremely helpful if someone can help me out with this.

Thanks.

This was the attempt:

df1<-cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)
d11  <- df1 %>% gather(Col, Value,starts_with("C"))           #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA
d22  <- df2 %>%  gather(Rule, Value,starts_with("R"))
res <- inner_join(d11,d22)
rm(d22)
rm(d11)
final<-cbind(df1[res$df1_id,],df2[res$df2_id,])
final$Exist <- apply(final, 1, FUN = function(x) 
c("No", "Yes")[(anyDuplicated(x[!is.na(x) & x != "" ])!=0) +1])

But this didn't work because it didn't take all of the values into account, only if one of them matched...not all.


回答1:


This is quite tricky because the purchases of n customers have to be compared to a set of m rules. Besides this, there are two points which add to the complexity:

  1. The last non-NA RULE column in df2 is semantically different from the others. Unfortunately, the given data structure doesn't reflect this. So, df2 is missing an explicite recommended column.

  2. Finally, it has to be determined whether a partner already has purchased the recommended item.

The approach below relies on melt(), dcast() and join operations of the data.table package for performance reasons. However, in order to avoid creation of cartesian crossproduct of n * m rows, a loop is used.

EDIT The dcast() has been moved out of the lapply() function.

Prepare data for n:m join

library(data.table)
# convert to data.table and add row numbers
# here, a copy is used insteasd of setDT() in order to rename the data.tables
purchases <- as.data.table(df1)[, rnp := seq_len(.N)]
rules <- as.data.table(df2)[, rnr := seq_len(.N)]

# prepare purchases for joins
lp <- melt(purchases, id.vars = c("rnp", "Partner"), na.rm = TRUE)
wp <- dcast(lp, rnp ~ value, drop = FALSE)
wp
#   rnp  A  B  C  D  F  K  M
#1:   1  A  B  C  D NA NA NA
#2:   2 NA NA  C  D  F NA NA
#3:   3 NA NA NA NA NA  K  M


# prepare rules
lr <- melt(rules, id.vars = c("rnr", "lift"), na.rm = TRUE)
# identify last column of each rule which becomes the recommendation
rn_of_last_col <- lr[, last(.I), by = rnr][, V1]
# reshape from long to wide without recommendation
wr <- dcast(lr[-rn_of_last_col], rnr ~ value)
# add column with recommendations (kind of cbind, no join)
wr[, recommended := lr[rn_of_last_col, value]]
wr
#   rnr  A  B  C  D  K  M recommended
#1:   1  A  B NA NA NA NA           G
#2:   2  A  B NA NA NA NA           D
#3:   3 NA NA  C  D NA  M           K
#4:   4  A  B  C NA NA NA           D
#5:   5  A NA  C NA NA NA           M
#6:   6 NA NA NA NA  K  M           E
#7:   7 NA NA NA NA NA  M           T
#8:   8 NA NA NA NA  K NA           M

Combine rules and purchases

combi <- rbindlist(
  # implied loop over rules to find matching purchases for each rule
  lapply(seq_len(nrow(rules)), function(i) {
    # get col names except last col which is the recommendation
    cols <- lr[rnr == i, value[-.N]]
    # join single rule with all partners on relevant cols for this rule
    wp[wr[i, .SD, .SDcols = c(cols, "rnr", "recommended")], on = cols, nomatch = 0]
  })
)
# check if recommendation was purchased already
combi[, already_purchased := Reduce(`|`, lapply(.SD, function(x) x == recommended)), 
      .SDcols = -c("rnp", "rnr", "recommended")]
# clean up already purchased
combi[is.na(already_purchased), already_purchased := FALSE
      ][, already_purchased := ifelse(already_purchased, "Yes", "No")]
combi
#   rnp  A  B  C  D  F  K  M rnr recommended already_purchased
#1:   1  A  B  C  D NA NA NA   1           G                No
#2:   1  A  B  C  D NA NA NA   2           D               Yes
#3:   1  A  B  C  D NA NA NA   4           D               Yes
#4:   1  A  B  C  D NA NA NA   5           M                No
#5:   3 NA NA NA NA NA  K  M   6           E                No
#6:   3 NA NA NA NA NA  K  M   7           T                No
#7:   3 NA NA NA NA NA  K  M   8           M               Yes

In creating combi, the trick is to join only on those columns which are included in each rule. This is why the join needs to be done for each rule separately.

Essentially, we are done now. However, it doesn't look like the desired output.

Final joins

tmp_rules <- rules[combi[, .(rnp, rnr, recommended, already_purchased)], on = "rnr"]
tmp_purch <- purchases[combi[, .(rnp, rnr)], on = "rnp"]
result <- tmp_purch[tmp_rules, on = c("rnp", "rnr")]
result[, (c("rnp", "rnr")) := NULL]
result
#   Partner COL1 COL2 COL3 COL4 lift RULE1 RULE2 RULE3 RULE4 recommend already_purchased
#1:   Alpha    A    B    C    D    9     B     A     G    NA         G                No
#2:   Alpha    A    B    C    D   10     B     A     D    NA         D               Yes
#3:   Alpha    A    B    C    D   12     A     B     C     D         D               Yes
#4:   Alpha    A    B    C    D   12     C     A     M    NA         M                No
#5:    Zeta    M    K   NA   NA   23     K     M     E    NA         E                No
#6:    Zeta    M    K   NA   NA   12     M     T    NA    NA         T                No
#7:    Zeta    M    K   NA   NA   24     K     M    NA    NA         M               Yes


来源:https://stackoverflow.com/questions/43006044/match-dataframes-excluding-last-non-na-value-and-disregarding-order

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!