问题
This is my sample dataset:
vector1 <-
data.frame(
"name" = "a",
"age" = 10,
"fruit" = c("orange", "cherry", "apple"),
"count" = c(1, 1, 1),
"tag" = c(1, 1, 2)
)
vector2 <-
data.frame(
"name" = "b",
"age" = 33,
"fruit" = c("apple", "mango"),
"count" = c(1, 1),
"tag" = c(2, 2)
)
vector3 <-
data.frame(
"name" = "c",
"age" = 58,
"fruit" = c("cherry", "apple"),
"count" = c(1, 1),
"tag" = c(1, 1)
)
list <- list(vector1, vector2, vector3)
print(list)
This is my test:
default <- c("cherry",
"orange",
"apple",
"mango")
for (num in 1:length(list)) {
#print(list[[num]])
list[[num]] <- rbind(
list[[num]],
data.frame(
"name" = list[[num]]$name,
"age" = list[[num]]$age,
"fruit" = setdiff(default, list[[num]]$fruit),#add missed value
"count" = 0,
"tag" = 1 #not found solutions
)
)
print(paste0("--------------", num, "--------"))
print(list)
}
#print(list)
I'm trying to find which fruit miss in the data frame and the fruit is based on the value of the tag.For example, in the first data frame, there are tags 1 and 2.If the value of tag 1 does not have the default fruit such as apple and banana, the missed default fruit will be added to 0 to the data frame.The expectation format likes the following:
[[1]]
name age fruit count tag
1 a 10 orange 1 1
2 a 10 cherry 1 1
3 a 10 apple 1 2
4 a 10 mango 0 1
5 a 10 apple 0 1
6 a 10 mango 0 2
7 a 10 orange 0 2
8 a 10 cherry 0 2
When I check the process of the loop, I also find that the first loop adds mango 3 times and I don't find the reason why it cannot add the missed value at one time.The overall output likes the following:
[[1]]
name age fruit count tag
1 a 10 orange 1 1
2 a 10 cherry 1 1
3 a 10 apple 1 2
4 a 10 mango 0 1
5 a 10 mango 0 1
6 a 10 mango 0 1
[[2]]
name age fruit count tag
1 b 33 apple 1 2
2 b 33 mango 1 2
3 b 33 cherry 0 1
4 b 33 orange 0 1
[[3]]
name age fruit count tag
1 c 58 cherry 1 1
2 c 58 apple 1 1
3 c 58 orange 0 1
4 c 58 mango 0 1
Does anyone help me and provides simple methods or other ways? Should I use the sqldf function to add 0 value?Is this a simple way to solve my problems?
回答1:
A solution using dplyr and tidyr. We can use complete
to expand the data frame and specify the fill values as 0 to count
.
Notice that I changed your list name from list
to fruit_list
because it is a bad practice to use reserved words in R to name an object. Also notice that when I created the example data frame I set stringsAsFactors = FALSE
because I don't want to create factor columns. Finally, I used lapply
instead of for-loop to loop through the list elements.
library(dplyr)
library(tidyr)
fruit_list2 <- lapply(fruit_list, function(x){
x2 <- x %>%
complete(name, age, fruit = default, tag = c(1, 2), fill = list(count = 0)) %>%
select(name, age, fruit, count, tag) %>%
arrange(tag, fruit) %>%
as.data.frame()
return(x2)
})
fruit_list2
# [[1]]
# name age fruit count tag
# 1 a 10 apple 0 1
# 2 a 10 cherry 1 1
# 3 a 10 mango 0 1
# 4 a 10 orange 1 1
# 5 a 10 apple 1 2
# 6 a 10 cherry 0 2
# 7 a 10 mango 0 2
# 8 a 10 orange 0 2
#
# [[2]]
# name age fruit count tag
# 1 b 33 apple 0 1
# 2 b 33 cherry 0 1
# 3 b 33 mango 0 1
# 4 b 33 orange 0 1
# 5 b 33 apple 1 2
# 6 b 33 cherry 0 2
# 7 b 33 mango 1 2
# 8 b 33 orange 0 2
#
# [[3]]
# name age fruit count tag
# 1 c 58 apple 1 1
# 2 c 58 cherry 1 1
# 3 c 58 mango 0 1
# 4 c 58 orange 0 1
# 5 c 58 apple 0 2
# 6 c 58 cherry 0 2
# 7 c 58 mango 0 2
# 8 c 58 orange 0 2
DATA
vector1 <-
data.frame(
"name" = "a",
"age" = 10,
"fruit" = c("orange", "cherry", "apple"),
"count" = c(1, 1, 1),
"tag" = c(1, 1, 2),
stringsAsFactors = FALSE
)
vector2 <-
data.frame(
"name" = "b",
"age" = 33,
"fruit" = c("apple", "mango"),
"count" = c(1, 1),
"tag" = c(2, 2),
stringsAsFactors = FALSE
)
vector3 <-
data.frame(
"name" = "c",
"age" = 58,
"fruit" = c("cherry", "apple"),
"count" = c(1, 1),
"tag" = c(1, 1),
stringsAsFactors = FALSE
)
fruit_list <- list(vector1, vector2, vector3)
default <- c("cherry", "orange", "apple", "mango")
回答2:
Consider base R methods --lapply
, expand.grid
, transform
, rbind
, aggregate
-- that appends all possible fruit and tag options to each dataframe and keeps the max counts.
new_list <- lapply(list, function(df) {
fruit_tag_df <- transform(expand.grid(fruit=c("apple", "cherry", "mango", "orange"),
tag=c(1,2)),
name = df$name[1],
age = df$age[1],
count = 0)
aggregate(.~name + age + fruit + tag, rbind(df, fruit_tag_df), FUN=max)
})
Output
new_list
# [[1]]
# name age fruit tag count
# 1 a 10 apple 1 0
# 2 a 10 cherry 1 1
# 3 a 10 orange 1 1
# 4 a 10 mango 1 0
# 5 a 10 apple 2 1
# 6 a 10 cherry 2 0
# 7 a 10 orange 2 0
# 8 a 10 mango 2 0
# [[2]]
# name age fruit tag count
# 1 b 33 apple 1 0
# 2 b 33 mango 1 0
# 3 b 33 cherry 1 0
# 4 b 33 orange 1 0
# 5 b 33 apple 2 1
# 6 b 33 mango 2 1
# 7 b 33 cherry 2 0
# 8 b 33 orange 2 0
# [[3]]
# name age fruit tag count
# 1 c 58 apple 1 1
# 2 c 58 cherry 1 1
# 3 c 58 mango 1 0
# 4 c 58 orange 1 0
# 5 c 58 apple 2 0
# 6 c 58 cherry 2 0
# 7 c 58 mango 2 0
# 8 c 58 orange 2 0
回答3:
The OP has requested to complete each data.frame in list
so that all combinations of default
fruit and tags 1:2
will appear in the result whereby count
should be set to 0
for the additional rows. Finally, each data.frame should consist at least of 4 x 2 = 8 rows.
I want to propose two different approaches:
- Using
lapply()
and theCJ()
(cross join) function fromdata.table
to return a list. - Combine the separate data.frames in
list
to one large data.table usingrbindlist()
and apply the required transformations on the whole data.table.
Using lapply()
and CJ()
library(data.table)
lapply(lst, function(x) setDT(x)[
CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE),
on = .(name, age, fruit, tag)][
is.na(count), count := 0][order(-count, tag)]
)
[[1]] name age fruit count tag 1: a 10 cherry 1 1 2: a 10 orange 1 1 3: a 10 apple 1 2 4: a 10 apple 0 1 5: a 10 mango 0 1 6: a 10 cherry 0 2 7: a 10 mango 0 2 8: a 10 orange 0 2 [[2]] name age fruit count tag 1: b 33 apple 1 2 2: b 33 mango 1 2 3: b 33 apple 0 1 4: b 33 cherry 0 1 5: b 33 mango 0 1 6: b 33 orange 0 1 7: b 33 cherry 0 2 8: b 33 orange 0 2 [[3]] name age fruit count tag 1: c 58 apple 1 1 2: c 58 cherry 1 1 3: c 58 mango 0 1 4: c 58 orange 0 1 5: c 58 apple 0 2 6: c 58 cherry 0 2 7: c 58 mango 0 2 8: c 58 orange 0 2
Ordering by count
and tag
is not required but helps to compare the result with OP's expected output.
Creating on large data.table
Instead of a list of data.frames with identical structure we can use one large data.table where the origin of each row can be identified by an id column.
Indeed, th OP has asked other questions ("using lapply function and list in r"
and "how to loop the dataframe using sqldf?" where he asked for help in handling a list of data.frames. G. Grothendieck already had suggested to rbind
the rows together.
The rbindlist()
function has the idcol
parameter which identifies the origin of each row:
library(data.table)
rbindlist(list, idcol = "df")
df name age fruit count tag 1: 1 a 10 orange 1 1 2: 1 a 10 cherry 1 1 3: 1 a 10 apple 1 2 4: 2 b 33 apple 1 2 5: 2 b 33 mango 1 2 6: 3 c 58 cherry 1 1 7: 3 c 58 apple 1 1
Note that df
contains the number of the source data.frame in list
(or the names of the list elements if list
is named).
Now, we can apply above solution by grouping over df
:
rbindlist(list, idcol = "df")[, .SD[
CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE),
on = .(name, age, fruit, tag)], by = df][
is.na(count), count := 0][order(df, -count, tag)]
df name age fruit count tag 1: 1 a 10 cherry 1 1 2: 1 a 10 orange 1 1 3: 1 a 10 apple 1 2 4: 1 a 10 apple 0 1 5: 1 a 10 mango 0 1 6: 1 a 10 cherry 0 2 7: 1 a 10 mango 0 2 8: 1 a 10 orange 0 2 9: 2 b 33 apple 1 2 10: 2 b 33 mango 1 2 11: 2 b 33 apple 0 1 12: 2 b 33 cherry 0 1 13: 2 b 33 mango 0 1 14: 2 b 33 orange 0 1 15: 2 b 33 cherry 0 2 16: 2 b 33 orange 0 2 17: 3 c 58 apple 1 1 18: 3 c 58 cherry 1 1 19: 3 c 58 mango 0 1 20: 3 c 58 orange 0 1 21: 3 c 58 apple 0 2 22: 3 c 58 cherry 0 2 23: 3 c 58 mango 0 2 24: 3 c 58 orange 0 2 df name age fruit count tag
来源:https://stackoverflow.com/questions/48043372/add-missed-value-based-on-the-value-of-the-column-in-r