split-apply-combine

Applying multiple functions to each column in a data frame using aggregate

ε祈祈猫儿з 提交于 2019-12-06 01:55:22
When I need to apply multiple functions to multiple columns sequentially and aggregate by multiple columns and want the results to be bound into a data frame I usually use aggregate() in the following manner: # bogus functions foo1 <- function(x){mean(x)*var(x)} foo2 <- function(x){mean(x)/var(x)} # for illustration purposes only npk$block <- as.numeric(npk$block) subdf <- aggregate(npk[,c("yield", "block")], by = list(N = npk$N, P = npk$P), FUN = function(x){c(col1 = foo1(x), col2 = foo2(x))}) Having the results in a nicely ordered data frame is achieved by using: df <- do.call(data.frame,

Adding rows in `dplyr` output

大兔子大兔子 提交于 2019-12-04 05:30:42
In traditional plyr , returned rows are added automagically to the output even if they exceed the number of input rows for that grouping: set.seed(1) dat <- data.frame(x=runif(10),g=rep(letters[1:5],each=2)) > ddply( dat, .(g), function(df) df[c(1,1,1,2),] ) x g 1 0.26550866 a 2 0.26550866 a 3 0.26550866 a 4 0.37212390 a 5 0.57285336 b 6 0.57285336 b 7 0.57285336 b 8 0.90820779 b 9 0.20168193 c 10 0.20168193 c 11 0.20168193 c 12 0.89838968 c 13 0.94467527 d 14 0.94467527 d 15 0.94467527 d 16 0.66079779 d 17 0.62911404 e 18 0.62911404 e 19 0.62911404 e 20 0.06178627 e I cannot figure out how to

Efficient conditional summing by multiple conditions in R

折月煮酒 提交于 2019-12-02 15:59:15
问题 I'm struggling with finding an efficient solution for the following problem: I have a large manipulated data frame with around 8 columns and 80000 rows that generally includes multiple data types. I want to create a new data frame that includes the sum of one column if conditions from the large data frame are met. Imagine the head of the original data frame looks like this. The column $years.raw indicates that the company measured data for x years. > cbind(company.raw,years.raw,source,amount

Efficient conditional summing by multiple conditions in R

孤街醉人 提交于 2019-12-02 12:02:17
I'm struggling with finding an efficient solution for the following problem: I have a large manipulated data frame with around 8 columns and 80000 rows that generally includes multiple data types. I want to create a new data frame that includes the sum of one column if conditions from the large data frame are met. Imagine the head of the original data frame looks like this. The column $years.raw indicates that the company measured data for x years. > cbind(company.raw,years.raw,source,amount.inkg) company.raw years.raw source amount.inkg [1,] "C1" "1" "Ink" "5" [2,] "C1" "1" "Recycling" "2" [3

Find top deciles from dataframe by group

不打扰是莪最后的温柔 提交于 2019-12-02 11:39:15
问题 I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here. Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign ), I have left the naming process until after the lapply . I am then using a for loop to do the renaming prior to merging and again for the merging.

Find top deciles from dataframe by group

只谈情不闲聊 提交于 2019-12-02 05:32:41
I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here . Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign ), I have left the naming process until after the lapply . I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some

Use dplyr's group_by to perform split-apply-combine

半世苍凉 提交于 2019-11-29 11:16:53
I am trying to use dplyr to do the following: tapply(iris$Petal.Length, iris$Species, shapiro.test) I want to split the Petal.Lengths by Speicies, and apply a function, in this case shapiro.test. I read this SO question and quite a number of other pages. I am sort of able to split the variable into groups, using do : iris %>% group_by(Species) %>% select(Petal.Length) %>% do(print(.$Petal.Length)) [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 [16] 1.5 1.3 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 [31] 1.6 1.5 1.5 1.4 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 [46]

python pandas, DF.groupby().agg(), column reference in agg()

南笙酒味 提交于 2019-11-28 03:20:00
On a concrete problem, say I have a DataFrame DF word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10 I want to find, for every "word", the "tag" that has the most "count" . So the return would be something like word tag count 1 the S 20 2 a T 60 3 an T 5 I don't care about the count column or if the order/Index is original or messed up. Returning a dictionary { 'the' : 'S' , ...} is just fine. I hope I can do DF.groupby(['word']).agg(lambda x: x['tag'][ x['count'].argmax() ] ) but it doesn't work. I can't access column information. More abstractly, what does the function in agg(

ddply + summarize for repeating same statistical function across large number of columns

北城以北 提交于 2019-11-28 03:07:04
Ok, second R question in quick succession. My data: Timestamp St_01 St_02 ... 1 2008-02-08 00:00:00 26.020 25.840 ... 2 2008-02-08 00:10:00 25.985 25.790 ... 3 2008-02-08 00:20:00 25.930 25.765 ... 4 2008-02-08 00:30:00 25.925 25.730 ... 5 2008-02-08 00:40:00 25.975 25.695 ... ... Basically normally I would use a combination of ddply and summarize to calculate ensembles (e.g. mean for every hour across the whole year). In the case above, I would create a category, e.g. hour (e.g. strptime(data$Timestamp,"%H") -> data$hour and then use that category in ddply , like ddply(data,"hour", summarize,

ddply + summarize for repeating same statistical function across large number of columns

霸气de小男生 提交于 2019-11-27 05:04:07
问题 Ok, second R question in quick succession. My data: Timestamp St_01 St_02 ... 1 2008-02-08 00:00:00 26.020 25.840 ... 2 2008-02-08 00:10:00 25.985 25.790 ... 3 2008-02-08 00:20:00 25.930 25.765 ... 4 2008-02-08 00:30:00 25.925 25.730 ... 5 2008-02-08 00:40:00 25.975 25.695 ... ... Basically normally I would use a combination of ddply and summarize to calculate ensembles (e.g. mean for every hour across the whole year). In the case above, I would create a category, e.g. hour (e.g. strptime