How to optimize this process?

夙愿已清 提交于 2019-12-11 14:57:27

问题


I have somewhat of broad question, but I will try to make my intent as clear as possible so that people can make suggestions. I am trying to optimize a process I am doing. Generally, what I am doing is feeding a function a data frame of values and generating a prediction off of operations on specific columns. Basically a custom function that is being used with sapply (code below). What I'm doing is much to large to provide any meaningful example, so instead I will try to describe the inputs to the process. I know this will restrict how helpful answers can be, but I am interested in any ideas for optimizing the time it takes me to compute a prediction. Currently it is taking me about 10 seconds to generate one prediction (run the sapply for one line of a dataframe).

mean_rating <- function(df){
  user<-df$user
  movie<-df$movie
  u_row<-which(U_lookup == user)[1]
  m_row<-which(M_lookup==movie)[1]

  knn_match<- knn_txt[u_row,1:100]

  knn_match1<-as.numeric(unlist(knn_match))

  dfm_test<- dfm[knn_match1,]

  dfm_mov<- dfm_test[,m_row] # row number from DFM associated with the query_movie




  C<-mean(dfm_mov)

}

test<-sapply(1:nrow(probe_test),function(x) mean_rating(probe_test[x,]))

Inputs: dfm is my main data matrix, users in the rows and movies in the columns. Very sparse.

> str(dfm)
Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:99072112] 378 1137 1755 1893 2359 3156 3423 4380 5103 6762 ...
  ..@ j       : int [1:99072112] 0 0 0 0 0 0 0 0 0 0 ...
  ..@ Dim     : int [1:2] 480189 17770
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:99072112] 4 5 4 1 4 5 4 5 3 3 ...
  ..@ factors : list()

probe_test is my test set, the set I'm trying to predict for. The actual probe test contains approximately 1.4 million rows but I am trying it on a subset first to optimize the time. It is being fed into my function.

> str(probe_test)
'data.frame':   6 obs. of  6 variables:
 $ X          : int  1 2 3 4 5 6
 $ movie      : int  1 1 1 1 1 1
 $ user       : int  1027056 1059319 1149588 1283744 1394012 1406595
 $ Rating     : int  3 3 4 3 5 4
 $ Rating_Date: Factor w/ 1929 levels "2000-01-06","2000-01-08",..: 1901 1847 1911 1312 1917 1803
 $ Indicator  : int  1 1 1 1 1 1

U_lookup is the lookup I use to convert between user id and the line of the matrix a user is in since we lose user id's when they are converted to a sparse matrix.

> str(U_lookup)
'data.frame':   480189 obs. of  1 variable:
 $ x: int  10 100000 1000004 1000027 1000033 1000035 1000038 1000051 1000053 1000057 ...

M_lookup is the lookup I use to convert between movie id and the column of a matrix a movie is in for similar reasons as above.

> str(M_lookup)
'data.frame':   17770 obs. of  1 variable:
 $ x: int  1 10 100 1000 10000 10001 10002 10003 10004 10005 ...

knn_text contains the 100 nearest neighbors for all the lines of dfm

> str(knn_txt)
'data.frame':   480189 obs. of  200 variables:

Thank you for any advice you can provide to me.

来源:https://stackoverflow.com/questions/51370465/how-to-optimize-this-process

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!