问题
I have 3 year history of real transaction data for 700 consumers, 400 different products. I am trying to make sequence analysis using TraMineR package and instructions from http://analyzecore.com/2014/12/04/sequence-carts-in-depth-analysis-with-r/
Unfortunately I have encountered several problems:
- End date ("to" parameter) of some purchases are same as beginning of next ones - I solved it by using every second order - it worked, but I would like to have all orders
- While trying to make
seqformat
I got an error:Error in levels<-(*tmp*, value = if (nl == nL) as.character(labels) else paste0(labels, : factor level [1058] is duplicated
- I tried to:- select only distinct orders - didn't work
- select only some consumers - didn't work
- I also tried to shorten the products names - didn't work
- and select only part of history - didn't work
Also worth noting is that the data is provided by wholesaler, not retailer so often consumers buy repetitively only specific products regularly (for example one consumer every working day for past 3 years was buying product "a12" and nothing else)
Maybe I should be using different package?
EDIT: Sorry for not providing the data and code. Thanks Gilbert for pointing it out. My sample data for 2 consumers, 8 days:
dftmp <- data.frame(Client = c('k622', 'k622', 'k71', 'k71', 'k71', 'k71'),
Date = c(6, 8, 1, 2, 6, 8),
Basket = c('a126;a293;a300;a362;a363;a364;a401;a402', 'a204;a301;a303;a364;a402', 'a113;a117;a133;a148;a18;a185;a22;a230;a238;a300;a360;a367;a386;a389;a403;a405', 'a22;a388', 'a194', 'a113;a146;a204;a230;a258;a303;a362;a386;a388;a389;a393;a395;a401;a402;a403;a405'),
to = c(7, 8, 1, 5, 7, 8))
The code I am using:
df.form <- seqformat(dftmp, id='Client', begin='Date', end='to', status='Basket',
from='SPELL', to='STS', process=FALSE)
df.seq <- seqdef(df.form, left='DEL', right='unknown', xtstep=10, void='unknown')
but I get error:
Error in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
factor level [9] is duplicated
According to your answer from Error in levels for seqdef in R
I fixed the problem by changing void='unknown'
to with.missing = T
, but the outcome is unreadable, as I have 19318 different states
I now thing I ought not to be using this formula, as I seek something like an "association rules in time" (If a client have bought product a1 he will probably buy week after that product a2)
回答1:
Glad to see that you fixed the error yourself. Let me just say that seqdef
has no with.missing
argument, and that simply letting void
at its default value solves the problem.
Let me comment on the analysis you attempt to do.
From your example data, it appears that you attempt to create state sequences where the states would be the consumer baskets. However, from your description, it seems that you are interested in sequences of products. I first explain how you should proceed to have sequences of products, and then comment on the relevance of creating states from transactions.
Since you have transactions (events), your data should be time stamped events (TSE) and not spells.
TraMineR
can handle TSE data, but expects them as one distinct row (id, time, event) for each event, i.e., for each bought product if you want sequences of products. E.g., on date6
for customerk622
, you should have 8 rows, and 5 rows for date8
. Once you have the data in such a TSE format you can create an event sequence object withseqecreate
, and then use the different functions for event sequences (seqpcplot
,seqefsub
, ...).You cannot create state sequences directly from your TSE data, because unlike simultaneous events, simultaneous states are not allowed. If you want to use the functions for state sequences, you can try to transform your event sequences into state sequences using the
TSE_to_STS
function fromTraMineRextras
. There are two issues here that you need to solve. First, you will have to determine what state a set of simultaneous transactions defines. Second, what will be the duration of the state you define? Moreover, the number of products you are considering is excessively high for state sequence analysis withTraMineR
, especially for a visual exploration that would require 400 contrasting colors.
In conclusion, you have a very complicated task here and I agree with you that TraMineR
is perhaps not the best suited package for what you want to do. At least you should try to drastically aggregate your products into a reasonable number.
来源:https://stackoverflow.com/questions/45555104/r-sequence-analysis-of-consumer-baskets