Sorry about the hopeless title..
I have a dataset that looks like:
|userId|movieId|rating|genre1|genre2|
|1 |13 |3.5 |1 |0 |
|1
Try
library(dplyr)
library(tidyr)
df %>%
select(-(genre1:genre2)) %>%
spread(userId, rating, fill = "")
Which gives:
# movieId 1 2 3 4
#1 4 3
#2 13 3.5 4.5
#3 412 2.5 2.5 5
Data
df <- structure(list(userId = c(1L, 1L, 2L, 3L, 4L, 4L), movieId = c(13L,
412L, 4L, 412L, 13L, 412L), rating = c(3.5, 2.5, 3, 2.5, 4.5,
5), genre1 = c(1L, 1L, 0L, 1L, 1L, 1L), genre2 = c(0L, 1L, 1L,
1L, 0L, 1L)), .Names = c("userId", "movieId", "rating", "genre1",
"genre2"), class = "data.frame", row.names = c(NA, -6L))
If you have several users and several movies, you could easily run out of memory in building a matrix
. For instance say that users are 1000 and the different movies are 1000. You'll end up with a matrix
containing 1M entries, most of them will be missing (since not every users saw every movie).
If your dataset is big, a sparseMatrix
from the Matrix
package is the way to go. If both users and movies id's are sequential (i.e. they start with 1 and finish with the number of different entries), building it is straightforward. Using @StevenBeaupré data
:
require(Matrix)
mat<-sparseMatrix(df$userId,df$movieId,x=df$rating)
If the id's are not sequential:
mat<-sparseMatrix(as.integer(factor(df$userId)),
as.integer(factor(df$movieId)),x=df$rating)
You can basically perform any matrix
operation on a sparseMatrix
too.