Calculating subtotals in R

后端 未结 7 1697
北荒
北荒 2021-02-15 11:28

I have a data frame with 900,000 rows and 11 columns in R. The column names and types are as follows:

column name: date / mcode / mname / ycode / yname / yissue          


        
相关标签:
7条回答
  • 2021-02-15 11:49

    Or the plyr library, which is easily extensible to other data classes:

    > library(plyr)
    > result.2 <- ddply(df$a, .(df$b), sum)
    > result.2
      df.b V1
    1 down 30
    2   up 25
    
    0 讨论(0)
  • 2021-02-15 11:54

    Google wasn't super helpful when I tried to find an answer to a similar question. I thought I would share my solution below using the library(janitor) package with split(), and purrr::map_df().

    My use case was to run a script that would grab CC expenses from many different people to be reviewed by a person.

    
    library(janitor)
    library(purrr)
    library(dplyr)
    
    mtcars %>% 
      split(.[,"cyl"]) %>% ## splits each change in cyl into a list of dataframes 
      map_df(., janitor::adorn_totals)
    
       mpg cyl   disp   hp  drat     wt   qsec vs am gear carb
      22.8   4  108.0   93  3.85  2.320  18.61  1  1    4    1
      24.4   4  146.7   62  3.69  3.190  20.00  1  0    4    2
      22.8   4  140.8   95  3.92  3.150  22.90  1  0    4    2
      32.4   4   78.7   66  4.08  2.200  19.47  1  1    4    1
      30.4   4   75.7   52  4.93  1.615  18.52  1  1    4    2
      33.9   4   71.1   65  4.22  1.835  19.90  1  1    4    1
      21.5   4  120.1   97  3.70  2.465  20.01  1  0    3    1
      27.3   4   79.0   66  4.08  1.935  18.90  1  1    4    1
        26   4  120.3   91  4.43  2.140  16.70  0  1    5    2
      30.4   4   95.1  113  3.77  1.513  16.90  1  1    5    2
      21.4   4  121.0  109  4.11  2.780  18.60  1  1    4    2
     Total  44 1156.5  909 44.78 25.143 210.51 10  8   45   17
        21   6  160.0  110  3.90  2.620  16.46  0  1    4    4
        21   6  160.0  110  3.90  2.875  17.02  0  1    4    4
      21.4   6  258.0  110  3.08  3.215  19.44  1  0    3    1
      18.1   6  225.0  105  2.76  3.460  20.22  1  0    3    1
      19.2   6  167.6  123  3.92  3.440  18.30  1  0    4    4
      17.8   6  167.6  123  3.92  3.440  18.90  1  0    4    4
      19.7   6  145.0  175  3.62  2.770  15.50  0  1    5    6
     Total  42 1283.2  856 25.10 21.820 125.84  4  3   27   24
      18.7   8  360.0  175  3.15  3.440  17.02  0  0    3    2
      14.3   8  360.0  245  3.21  3.570  15.84  0  0    3    4
      16.4   8  275.8  180  3.07  4.070  17.40  0  0    3    3
      17.3   8  275.8  180  3.07  3.730  17.60  0  0    3    3
      15.2   8  275.8  180  3.07  3.780  18.00  0  0    3    3
      10.4   8  472.0  205  2.93  5.250  17.98  0  0    3    4
      10.4   8  460.0  215  3.00  5.424  17.82  0  0    3    4
      14.7   8  440.0  230  3.23  5.345  17.42  0  0    3    4
      15.5   8  318.0  150  2.76  3.520  16.87  0  0    3    2
      15.2   8  304.0  150  3.15  3.435  17.30  0  0    3    2
      13.3   8  350.0  245  3.73  3.840  15.41  0  0    3    4
      19.2   8  400.0  175  3.08  3.845  17.05  0  0    3    2
      15.8   8  351.0  264  4.22  3.170  14.50  0  1    5    4
        15   8  301.0  335  3.54  3.570  14.60  0  1    5    8
     Total 112 4943.4 2929 45.21 55.989 234.81  0  2   46   49
    
    
    # if you're sending the output to be reviewed by a person, add a row! 
    
    mtcars %>% 
      split(.[,"cyl"]) %>% 
      map_df(., ~janitor::adorn_totals(.x) %>% 
               dplyr::add_row()) %>% 
      write.csv(., "demo.csv")
    
    
    0 讨论(0)
  • 2021-02-15 11:57

    if your data is large and speed matters, i would recommend using the R function rowsum, which is a lot faster. i applied the 3 methods (f1 = aggregate, f2 = ddply, f3 = tapply) suggested in the answers to compare it with f4 = rowsum and here is what i find:

       test replications elapsed relative
    4 f4()          100   0.033     1.00
    3 f3()          100   0.046     1.39
    1 f1()          100   0.165     5.00
    2 f2()          100   0.605    18.33
    

    i have added my code below if someone wants to explore in more detail.

    library(plyr);
    library(rbenchmark);
    
    val  = rnorm(50);
    name = rep(letters[1:5], each = 10);
    data = data.frame(val, name);
    
    f1 = function(){aggregate(data$val, by=list(data$name), FUN=sum)}
    f2 = function(){ddply(data, .(name), summarise, sum = sum(val))}
    f3 = function(){tapply(data$val, data$name, sum)}
    f4 = function(){rowsum(x = data$val, group = data$name)}
    
    benchmark(f1(), f2(), f3(), f4(),
              columns=c("test", "replications", "elapsed", "relative"),
              order="relative", replications=100)
    
    0 讨论(0)
  • 2021-02-15 11:59

    You can also use xtabs or tapply:

    xtabs(cbind(bsent, breturn, tsent, treturn, csales) ~ yname, data)
    
    tapply(data$bsent, data$yname, sum)
    
    0 讨论(0)
  • You can use aggregate

    For instance, say that you have

    val = rnorm(50)
    name = rep(letters[1:5], each=10)
    data <- data.frame(val, name)
    

    Then you can do

    aggregate(data$val, by=list(data$name), FUN=sum)
    
    0 讨论(0)
  • 2021-02-15 12:01

    OK. Assuming your data are in a data frame named foo:

    > head(foo)
                 date mcode      mname ycode yname   yissue bsent breturn tsent
    417572 2010/07/28 45740 ENDPOINT A  5772  XMAG 20100800     7       0     7
    417573 2010/07/31 45740 ENDPOINT A  5772  XMAG 20100800     0       0     0
    417574 2010/08/04 45740 ENDPOINT A  5772  XMAG 20100800     0       0     0
    417575 2010/08/14 45740 ENDPOINT A  5772  XMAG 20100800     0       0     0
    417576 2010/08/26 45740 ENDPOINT A  5772  XMAG 20100800     0       4     0
    417577 2010/07/28 45741 ENDPOINT L  5772  XMAG 20100800     2       0     2
           treturn csales
    417572       0      0
    417573       0      1
    417574       0      1
    417575       0      1
    417576       0      0
    417577       0      0
    

    Then this will do the aggregation of the numeric columns in your data:

    > aggregate(cbind(bsent, breturn, tsent, treturn, csales) ~ yname, data = foo, 
    +           FUN = sum)
      yname bsent breturn tsent treturn csales
    1  XMAG    14       8    14       0      6
    2  YMAG    11       6    11       6      5
    

    That was using the snippet of data you included in your Q. I used the formula interface to aggregate(), which is a bit nicer in this instance because you don't need all the foo$ bits on the variable names you wish the aggregate. If you have missing data (NA)in your full data set, then you'll need add an extra argument na.rm = TRUE which will get passed to sum(), like so:

    > aggregate(cbind(bsent, breturn, tsent, treturn, csales) ~ yname, data = foo, 
    +           FUN = sum, na.rm = TRUE)
    
    0 讨论(0)
提交回复
热议问题