plyr | 易学教程

split apply recombine, plyr, data.table in R

阅读更多关于 split apply recombine, plyr, data.table in R

问题 I am doing the classic split-apply-recombine thing in R. My data set is a bunch of firms over time. The applying I am doing is running a regression for each firm and returning the residuals, therefore, I am not aggregating by firm. plyr is great for this but it takes a very very long time to run when the number of firms is large. Is there a way to do this with data.table ? Sample Data: dte, id, val1, val2 2001-10-02, 1, 10, 25 2001-10-03, 1, 11, 24 2001-10-04, 1, 12, 23 2001-10-02, 2, 13, 22

How to fill NA with median?

阅读更多关于 How to fill NA with median?

问题 Example data: set.seed(1) df <- data.frame(years=sort(rep(2005:2010, 12)), months=1:12, value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)) head(df) years months value 1 2005 1 -0.6264538 2 2005 2 0.1836433 3 2005 3 -0.8356286 4 2005 4 1.5952808 5 2005 5 0.3295078 6 2005 6 -0.8204684 Tell me please, how i can replace NA in df$value to median of others months? "value" must contain the median of value of all previous values for the same month. That is, if current month is May, "value" must

Split Data Frame into Rows of Fixed Size

阅读更多关于 Split Data Frame into Rows of Fixed Size

问题 I have a bunch of data frames with varying degrees of length, ranging from approx. 15,000 to 500,000. For each of these data frames, I would like to split them up into smaller data frames each with 300 rows which I would do further processing on. How can I do this? This (Split up a dataframe by number of rows) provides a partial answer, but it doesn't work because not all my data frames have length that are multiples of 300. Would greatly appreciate it if a plyr and non-plyr solution can both

doMC vs doSNOW vs doSMP vs doMPI: why aren't the various parallel backends for 'foreach' functionally equivalent?

阅读更多关于 doMC vs doSNOW vs doSMP vs doMPI: why aren't the various parallel backends for 'foreach' functionally equivalent?

问题 I've got a few test pieces of code that I've been running on various machines, always with the same results. I thought the philosophy behind the various do... packages was that they could be used interchangeably as a backend for foreach's %dopar%. Why is this not the case? For example, this code snippet works: library(plyr) library(doMC) registerDoMC() x <- data.frame(V= c("X", "Y", "X", "Y", "Z" ), Z = 1:5) ddply(x, .(V), function(df) sum(df$Z),.parallel=TRUE) While each of these code

Returning first row of group

阅读更多关于 Returning first row of group

问题 I have a dataframe consisting of an ID , that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched. My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry

Why is plyr so slow?

阅读更多关于 Why is plyr so slow?

问题 I think I am using plyr incorrectly. Could someone please tell me if this is 'efficient' plyr code? require(plyr) plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) A little context: I have a few large aggregation problems and I have noted that they were each taking some time. In trying to solve the issues, I became interested in the performance of various aggregation procedures in R. I tested a few aggregation methods - and found myself waiting around all day. When I finally

Why are my dplyr group_by & summarize not working properly? (name-collision with plyr)

阅读更多关于 Why are my dplyr group_by & summarize not working properly? (name-collision with plyr)

问题 I have a data frame that looks like this: #df ID DRUG FED AUC0t Tmax Cmax 1 1 0 100 5 20 2 1 1 200 6 25 3 0 1 NA 2 30 4 0 0 150 6 65 Ans so on. I want to summarize some statistics on AUC, Tmax and Cmax by drug DRUG and FED STATUS FED . I use dplyr. For example: for the AUC: CI90lo <- function(x) quantile(x, probs=0.05, na.rm=TRUE) CI90hi <- function(x) quantile(x, probs=0.95, na.rm=TRUE) summary <- df %>% group_by(DRUG,FED) %>% summarize(mean=mean(AUC0t, na.rm=TRUE), low = CI90lo(AUC0t), high

Assigning Label based on quantile for every sub group

阅读更多关于 Assigning Label based on quantile for every sub group

问题 My data.frame looks like this: Region Store Sales A 1 *** A 2 *** B 1 *** B 2 **** I want to create labels of store based on Sales Performance. That is if a store Sales is higher than 75% quantile assign "High" else low. Applying ddply using the code R3 <- ddply(dat, .(REGION), function(x) quantile(x$Sales, na.rm = TRUE)) returns a dataframe with all quantile numbers for the regions. I can use that frame to join with original and do a if-else for each cluster. I am sure it's not an efficient

ddply and ggplot - generating no plot [duplicate]

阅读更多关于 ddply and ggplot - generating no plot [duplicate]

问题 This question already has an answer here : ggplot's qplot does not execute on sourcing (1 answer) Closed 6 years ago . I have a data which is given below: > dput(qq) structure(list(SIC = c(50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50), AVGAT = c(380.251, 391.3885, 421.72, 431.83, 483.715, 600.0715, 698.5945, 733.814, 721.426, 706.0265, 698.41, 697.9565, 720.761, 855.5245, 1023.226, 1214.8215, 1369.7605, 1439.2765, 1602.3845, 1949.69), ADA = c(0

Error: only defined on a data frame with all numeric variables with ddply on large dataset

阅读更多关于 Error: only defined on a data frame with all numeric variables with ddply on large dataset

问题 I'm trying to calculate sums and means on a very large dataset (~22000 records) for several parameters (e.g. Er_Count, Mn_Count) by month, year , Survey ID and Grid ID. I tried this code initially to get overall sums: dlply(Effort_All,c("Er_Count","Mn_Count","Bp_Count"),sum) And received the following error: Error: only defined on a data frame with all numeric variables Since I cannot even get overall sums, I am unable to get statistics by the specific variables either. Do I need to split the