Summarize data in R | 易学教程

问题

I have a dataset which contains weekly sale of various products by outlet. Here is how the data looks like:

Store ID    Week ID Item Code   Sales in $
253422  191 41130   2.95
272568  188 41130   2.95
272568  188 41160   2.95
272568  189 41130   2.95
272568  189 41160   2.95
272568  190 41160   2.95
217460  188 41110   2.95
217460  188 41130   5.9
217460  188 41160   5.9
217460  189 41110   11.8
217460  189 41130   8.85
217460  189 41160   11.8
217460  191 41130   5.95
217460  191 41160   8.93

This is a very large dataset and I would like to generate a summary output which gives me the ITEM wise total sales and the number of stores in which the item is present. I tried the following, but that doesn't work because I get a store count which is repeated due to the repetition of weeks in the dataset:

dataset %>% group_by(Store ID) %>% summarize(count(Item Code))

Any help is highly appreciated. Thanks

回答1:

You can do this with aggregate

## Recreate your data
df = read.table(text="'Store ID'    'Week ID' 'Item Code'   'Sales in Dollars'
253422  191 41130   2.95
272568  188 41130   2.95
272568  188 41160   2.95
272568  189 41130   2.95
272568  189 41160   2.95
272568  190 41160   2.95
217460  188 41110   2.95
217460  188 41130   5.9
217460  188 41160   5.9
217460  189 41110   11.8
217460  189 41130   8.85
217460  189 41160   11.8
217460  191 41130   5.95
217460  191 41160   8.93",
header=TRUE)

aggregate(df$Sales.in.Dollars, list(df$Item.Code), sum)

  Group.1     x
1   41110 14.75
2   41130 29.55
3   41160 35.48

StoreItems = unique(df[,c(1,3)])
aggregate(StoreItems$Store.ID, list(StoreItems$Item.Code), length)

  Group.1 x
1   41110 1
2   41130 3
3   41160 2

回答2:

Here's a way to do it using dplyr

library(dplyr)

df <- tibble::tribble(
  ~store_id, ~week_id, ~item_code, ~sales,
  253422L,     191L,     41130L,   2.95,
  272568L,     188L,     41130L,   2.95,
  272568L,     188L,     41160L,   2.95,
  272568L,     189L,     41130L,   2.95,
  272568L,     189L,     41160L,   2.95,
  272568L,     190L,     41160L,   2.95,
  217460L,     188L,     41110L,   2.95,
  217460L,     188L,     41130L,    5.9,
  217460L,     188L,     41160L,    5.9,
  217460L,     189L,     41110L,   11.8,
  217460L,     189L,     41130L,   8.85,
  217460L,     189L,     41160L,   11.8,
  217460L,     191L,     41130L,   5.95,
  217460L,     191L,     41160L,   8.93
)

df %>% 
  group_by(item_code) %>% 
  summarise(total_sales = sum(sales),
            stores = n_distinct(store_id))

#> # A tibble: 3 x 3
#>   item_code total_sales stores
#>       <int>       <dbl>  <int>
#> 1     41110       14.75      1
#> 2     41130       29.55      3
#> 3     41160       35.48      2

来源：https://stackoverflow.com/questions/44188515/summarize-data-in-r

标签

dataframe

count

dplyr

summarize