split-apply-combine | 易学教程

Applying multiple functions to each column in a data frame using aggregate

阅读更多关于 Applying multiple functions to each column in a data frame using aggregate

When I need to apply multiple functions to multiple columns sequentially and aggregate by multiple columns and want the results to be bound into a data frame I usually use aggregate() in the following manner: # bogus functions foo1 <- function(x){mean(x)*var(x)} foo2 <- function(x){mean(x)/var(x)} # for illustration purposes only npk$block <- as.numeric(npk$block) subdf <- aggregate(npk[,c("yield", "block")], by = list(N = npk$N, P = npk$P), FUN = function(x){c(col1 = foo1(x), col2 = foo2(x))}) Having the results in a nicely ordered data frame is achieved by using: df <- do.call(data.frame,

Adding rows in `dplyr` output

阅读更多关于 Adding rows in `dplyr` output

In traditional plyr , returned rows are added automagically to the output even if they exceed the number of input rows for that grouping: set.seed(1) dat <- data.frame(x=runif(10),g=rep(letters[1:5],each=2)) > ddply( dat, .(g), function(df) df[c(1,1,1,2),] ) x g 1 0.26550866 a 2 0.26550866 a 3 0.26550866 a 4 0.37212390 a 5 0.57285336 b 6 0.57285336 b 7 0.57285336 b 8 0.90820779 b 9 0.20168193 c 10 0.20168193 c 11 0.20168193 c 12 0.89838968 c 13 0.94467527 d 14 0.94467527 d 15 0.94467527 d 16 0.66079779 d 17 0.62911404 e 18 0.62911404 e 19 0.62911404 e 20 0.06178627 e I cannot figure out how to

Efficient conditional summing by multiple conditions in R

阅读更多关于 Efficient conditional summing by multiple conditions in R

问题 I'm struggling with finding an efficient solution for the following problem: I have a large manipulated data frame with around 8 columns and 80000 rows that generally includes multiple data types. I want to create a new data frame that includes the sum of one column if conditions from the large data frame are met. Imagine the head of the original data frame looks like this. The column $years.raw indicates that the company measured data for x years. > cbind(company.raw,years.raw,source,amount

Efficient conditional summing by multiple conditions in R

阅读更多关于 Efficient conditional summing by multiple conditions in R

I'm struggling with finding an efficient solution for the following problem: I have a large manipulated data frame with around 8 columns and 80000 rows that generally includes multiple data types. I want to create a new data frame that includes the sum of one column if conditions from the large data frame are met. Imagine the head of the original data frame looks like this. The column $years.raw indicates that the company measured data for x years. > cbind(company.raw,years.raw,source,amount.inkg) company.raw years.raw source amount.inkg [1,] "C1" "1" "Ink" "5" [2,] "C1" "1" "Recycling" "2" [3

Find top deciles from dataframe by group

阅读更多关于 Find top deciles from dataframe by group

问题 I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here. Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign ), I have left the naming process until after the lapply . I am then using a for loop to do the renaming prior to merging and again for the merging.

Find top deciles from dataframe by group

阅读更多关于 Find top deciles from dataframe by group

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here . Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign ), I have left the naming process until after the lapply . I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some

Use dplyr's group_by to perform split-apply-combine

阅读更多关于 Use dplyr's group_by to perform split-apply-combine

I am trying to use dplyr to do the following: tapply(iris$Petal.Length, iris$Species, shapiro.test) I want to split the Petal.Lengths by Speicies, and apply a function, in this case shapiro.test. I read this SO question and quite a number of other pages. I am sort of able to split the variable into groups, using do : iris %>% group_by(Species) %>% select(Petal.Length) %>% do(print(.$Petal.Length)) [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 [16] 1.5 1.3 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 [31] 1.6 1.5 1.5 1.4 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 [46]

python pandas, DF.groupby().agg(), column reference in agg()

阅读更多关于 python pandas, DF.groupby().agg(), column reference in agg()

On a concrete problem, say I have a DataFrame DF word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10 I want to find, for every "word", the "tag" that has the most "count" . So the return would be something like word tag count 1 the S 20 2 a T 60 3 an T 5 I don't care about the count column or if the order/Index is original or messed up. Returning a dictionary { 'the' : 'S' , ...} is just fine. I hope I can do DF.groupby(['word']).agg(lambda x: x['tag'][ x['count'].argmax() ] ) but it doesn't work. I can't access column information. More abstractly, what does the function in agg(

ddply + summarize for repeating same statistical function across large number of columns

阅读更多关于 ddply + summarize for repeating same statistical function across large number of columns

Ok, second R question in quick succession. My data: Timestamp St_01 St_02 ... 1 2008-02-08 00:00:00 26.020 25.840 ... 2 2008-02-08 00:10:00 25.985 25.790 ... 3 2008-02-08 00:20:00 25.930 25.765 ... 4 2008-02-08 00:30:00 25.925 25.730 ... 5 2008-02-08 00:40:00 25.975 25.695 ... ... Basically normally I would use a combination of ddply and summarize to calculate ensembles (e.g. mean for every hour across the whole year). In the case above, I would create a category, e.g. hour (e.g. strptime(data$Timestamp,"%H") -> data$hour and then use that category in ddply , like ddply(data,"hour", summarize,

ddply + summarize for repeating same statistical function across large number of columns

阅读更多关于 ddply + summarize for repeating same statistical function across large number of columns

问题 Ok, second R question in quick succession. My data: Timestamp St_01 St_02 ... 1 2008-02-08 00:00:00 26.020 25.840 ... 2 2008-02-08 00:10:00 25.985 25.790 ... 3 2008-02-08 00:20:00 25.930 25.765 ... 4 2008-02-08 00:30:00 25.925 25.730 ... 5 2008-02-08 00:40:00 25.975 25.695 ... ... Basically normally I would use a combination of ddply and summarize to calculate ensembles (e.g. mean for every hour across the whole year). In the case above, I would create a category, e.g. hour (e.g. strptime