问题
I have like below mentioned dataframe:
Records:
ID Remarks Value
1 ABC 10
1 AAB 12
1 ZZX 15
2 XYZ 12
2 ABB 14
By utilizing the above mentioned dataframe, I want to add new column Status
in the existing dataframe.
Where if the Remarks
is ABC, AAB or ABB than status would be TRUE
and for XYZ and ZZX it should be FALSE
.
I am using below mentioned method for that but it didn't work.
Records$Status<-ifelse(Records$Remarks %in% ("ABC","AAB","ABB"),"TRUE",
ifelse(Records$Remarks %in%
("XYZ","ZZX"),"FALSE"))
And, bases on the Status
i want to derive following output:
ID TRUE FALSE Sum
1 2 1 37
2 1 1 26
回答1:
Records$Status<-ifelse(Records$Remarks %in% c("ABC","AAB","ABB"),TRUE,
ifelse(Records$Remarks %in%
c("XYZ","ZZX"),FALSE, NA))
You need to enclose your lists of strings with c()
, and add an "else" condition for the second ifelse (but see Roman's answer below for a better way of doing this with case_when
). (Also note that here I changed the "TRUE"
and "FALSE"
(as character class) into TRUE
and FALSE
(the logical class).
For the summary (using dplyr
):
Records %>% group_by(ID) %>%
dplyr::summarise(trues=sum(Status), falses=sum(!Status), sum=sum(Value))
# A tibble: 2 x 4
ID trues falses sum
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Of course, if you don't really need the intermediate Status column but just want the summary table, you can skip the first step altogether:
Records %>% group_by(ID) %>%
dplyr::summarise(trues=sum(Remarks %in% c("ABC","AAB","ABB")),
falses=sum(Remarks %in% c("XYZ","ZZX")),
sum=sum(Value))
回答2:
Since it makes sense to use dplyr
for your second question (see @iod's answer) it is also a good opportunity to use the package's very straightforward case_when()
function for the first part.
Records %>%
mutate(Status = case_when(Remarks %in% c("ABC", "AAB", "ABB") ~ TRUE,
Remarks %in% c("XYZ", "ZZX") ~ FALSE,
TRUE ~ NA))
ID Remarks Value Status
1 1 ABC 10 TRUE
2 1 AAB 12 TRUE
3 1 ZZX 15 FALSE
4 2 XYZ 12 FALSE
5 2 ABB 14 TRUE
回答3:
This approach will scale to a large number of remarks.
Load the data and prepare a matching data frame
The second data frame makes a matching between remarks and their TRUE or FALSE value.
library(readr)
library(dplyr)
library(tidyr)
dtf <- read_table("id remarks value
1 ABC 10
1 AAB 12
1 ZZX 15
2 XYZ 12
2 ABB 14")
truefalse <- data_frame(remarks = c("ABC", "AAB", "ABB", "ZZX", "XYZ"),
tf = c(TRUE, TRUE, TRUE, FALSE, FALSE))
Group by id and summarise
This is the format as asked in the question
dtf %>%
left_join(truefalse, by = "remarks") %>%
group_by(id) %>%
summarise(true = sum(tf),
false = sum(!tf),
value = sum(value))
# A tibble: 2 x 4
id true false value
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Alternative proposal: group by id, tf and summarise
This option retains more details on the spread of value
along the grouping variables id
and tf
.
dtf %>%
left_join(truefalse, by = "remarks") %>%
group_by(id, tf) %>%
summarise(n = n(),
value = sum(value))
# A tibble: 4 x 4
# Groups: id [?]
id tf n value
<int> <lgl> <int> <int>
1 1 FALSE 1 15
2 1 TRUE 2 22
3 2 FALSE 1 12
4 2 TRUE 1 14
回答4:
In most cases, life is easier and lines are shorter without ifelse
:
# short version
df$Status <- df$Remarks %in% c("ABC","AAB","ABB")
This version is OK for most purposes but it has shortcomings. Status
will be FALSE
if Remarks
is NA
or, say "garbage"
but one might want it to be NA
in these cases and FALSE
only if Remarks %in% c("XYZ", "ZZX")
. So one can add and multiply the conditions and finally convert it to logical
:
df$Status <- as.logical(with(df,
Remarks %in% c("ABC","AAB","ABB") +
! Remarks %in% c("XYZ","ZZX") ))
And the summary table with base R:
aggregate(df[,-(1:2)], df["ID"], function(x) if(is.numeric(x)) sum(x) else table(x))
Umm... perhaps some formatting would be useful:
t1 <- aggregate(df[,-(1:2)], df["ID"], function(x) if(is.numeric(x)) sum(x) else table(x))
t1 <- t1[, c(1,3,2)]
colnames(t1) <- c("ID", "", "Sum")
t1
# ID FALSE TRUE Sum
# 1 1 1 2 37
# 2 2 1 1 26
回答5:
This one returns correct result, only if there are two mentioned groups ("ABC", "AAB", "ABB"
vs "XYZ","ZZX", ...
). For me @iod's solution, is more R
-like, but I've tried to avoid ifelse
, and do it another way:
Code:
library(tidyverse)
dt %>%
group_by(ID, Status = grepl("^A[AB][CB]$", Remarks)) %>%
summarise(N = n(), Sum = sum(Value)) %>%
spread(Status, N) %>%
summarize_all(sum, na.rm = T) %>% # data still groupped by ID
select("ID", "TRUE", "FALSE", "Sum")
# A tibble: 2 x 4
ID `TRUE` `FALSE` Sum
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Data:
dt <- structure(
list(ID = c(1L, 1L, 1L, 2L, 2L),
Remarks = c("ABC", "AAB", "ZZX", "XYZ", "ABB"),
Value = c(10L, 12L, 15L, 12L, 14L)),
.Names = c("ID", "Remarks", "Value"), class = "data.frame", row.names = c(NA, -5L)
)
来源:https://stackoverflow.com/questions/53064595/how-to-check-multiple-values-using-if-condition