问题
Data frame have
includes a few thousand vectors that follow a naming pattern. Each vector name includes a noun, then either _a
, _b
, or _c
. Below are the first 10 vars and obs:
id turtle_a banana_a castle_a turtle_b banana_b castle_b turtle_c banana_c castle_c
A -0.58 -0.88 -0.56 -0.53 -0.32 -0.42 -0.52 -0.89 -0.72
B NA NA NA -0.84 -0.36 -0.26 NA NA NA
C 0.00 -0.43 -0.75 -0.35 -0.88 -0.14 -0.26 -0.15 -0.81
D -0.81 -0.63 -0.77 -0.82 -0.83 -0.50 -0.77 -0.25 -0.07
E -0.25 -0.33 -0.09 -0.51 -0.27 -0.81 -0.06 -0.23 -0.97
F -0.80 -0.88 -0.05 NA NA NA NA NA NA
G -0.25 -0.76 -0.21 NA NA NA NA NA NA
H -0.47 -0.10 -0.67 -0.46 -0.71 -0.24 -0.76 -0.04 -0.11
I -0.15 -0.34 -0.57 -0.40 -0.14 -0.49 NA NA NA
J -0.65 -0.86 -0.37 -0.67 -0.81 -0.63 NA NA NA
Data frame want
is the mean across all columns for every set of variables in a noun group. For example, averaging turtle_a
, turtle_b
, and turtle_c
for id
=A
equals -0.54
. Here's what want
looks like if I just do it for the handful of noun groups in the example.
id turtle_m banana_m castle_m
A -0.54 -0.70 -0.57
B -0.84 -0.36 -0.26
C -0.20 -0.49 -0.57
D -0.80 -0.57 -0.45
E -0.27 -0.28 -0.62
F -0.80 -0.88 -0.05
G -0.25 -0.76 -0.21
H -0.56 -0.29 -0.34
I -0.27 -0.24 -0.53
J -0.66 -0.83 -0.50
Options so far:
- convert to long,
summarize
with agroup_by()
function indplyr
, and transpose back to wide. - resort the vectors so the noun groups appear next to each other, and write a loop that computes means across columns, taking three-column steps at each iteration
It seems like summarize_at
or summarize_all
could be used more effectively than either of my current options, but I'm not sure how to use it in a way that will dynamically group variables by naming convention.
Any thoughts?
回答1:
We can use split.default
to split the columns based on the substring of column names, loop over the list
with sapply
with rowMeans
and then cbind
with the first column
out <- cbind(df1[1], sapply(split.default(df1[-1],
sub("_.*", "", names(df1)[-1])), rowMeans, na.rm = TRUE))
Or we can use pivot_longer
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -id, names_sep="_", names_to = c(".value", "group")) %>%
group_by(id) %>%
summarise(across(turtle:castle, mean, na.rm = TRUE))
来源:https://stackoverflow.com/questions/62722119/summary-stats-across-columns-where-column-names-indicate-groups