问题
All,
I was hoping someone could find a solution to an issue of mine that isn't necessarily causing headaches, but, as of right now, invites the possibility for human error in creating a data set for a project on which I'm working.
The data set I'm using right now is a directed dyad-year (A vs. B, B vs. A) data set for select pairs of countries for every year between 1950 and 2010. Some countries, like A in my example, will be paired with every country in the world and every country will be paired with it. Some countries, like B and C in my example, will be paired with just a few countries. Some pairs will have missing data, which I don't show in my example.
What I would like to do is use R to find the maximum value of a given column, for a given country, in a given year, and insert that value into another data frame. Hopefully this illustration will clarify what I would like to do.
country1 country2 year x1 x2 x3 x4
A B 2000 50 30 1 20
A C 2000 70 2 5 90
A D 2000 10 90 20 30
A E 2000 95 10 10 5
A F 2000 10 10 10 0
A G 2000 5 5 0 0
A H 2000 10 30 25 40
........................................
B A 1998 5 10 30 2
B D 1998 30 6 9 0
B I 1998 10 9 7 0
........................................
C A 2005 10 15 2 6
C D 2005 90 0 0 40
C X 2005 49 90 5 0
Say, for example, that I'm interested in Country A in the year 2000. I want to know what is its max value of x1
in 2000 (which is 95, in its pairing with Country E). I also want to know what is its max value for x2
, x3
, and x4
in any pairing in that given year (which are 90, 25, and 90 with Country D, Country H, and Country C respectively).
The same follows for Country B in 1998, and Country C in 2005.
After isolating the max value of those columns for a given country in a given year, I'd like to dump those values into a dataframe, like this.
country year x1max x2max x3max x4max
A 2000 95 90 25 90
B 1998 30 10 30 2
C 2005 90 90 5 40
I'm flexible on this part. It might just be easiest to dump those max values for each country into their own data frames of dimensions 1x5, and then use rbind
to stack them together.
Does anyone have any advice on how to proceed? It'd save me the hassle of having to do it manually, which, more than anything, invites the possibility of human error.
Reproducible code follows, though, since my question does hinge on isolating a particular year for a particular country (e.g. 2000 for Country A instead of 2001), I'm not sure the reproducible code is necessarily helpful. I hope it is, or, at least, that my question is clear.
country1 <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C")
country2 <- c("B","C","D","E","F","G","H","A","D","I","A","D","X")
year <- c(2000, 2000, 2000, 2000, 2000, 2000, 2000, 1998, 1998, 1998, 2005, 2005, 2005)
x1 <- c(50, 70, 10, 95, 10, 5, 10, 5, 30, 10, 10, 90, 49)
x2 <- c(30, 2, 90, 10, 10, 5, 30, 10, 6, 9, 15, 0, 90)
x3 <- c(1, 5, 20, 10, 10, 0, 25, 30, 9, 7, 2, 0, 5)
x4 <- c(20, 90, 30, 5, 0,0,40,2,0,0,6,40,0)
Data=data.frame(country1=country1,country2=country2,year=year,x1=x1,x2=x2,x3=x3,x4=x4)
Data
回答1:
It sounds like you're just looking for aggregate
:
> aggregate(cbind(x1, x2, x3, x4) ~ country1 + year, Data, max)
country1 year x1 x2 x3 x4
1 B 1998 30 10 30 2
2 A 2000 95 90 25 90
3 C 2005 90 90 5 40
It's not very clear from your question how you want to proceed from there though....
回答2:
You can also use ddply
from plyr package. Assuming your sample is data.
data<-structure(list(country1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
country2 = structure(c(2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 4L,
9L, 1L, 4L, 10L), .Label = c("A", "B", "C", "D", "E", "F",
"G", "H", "I", "X"), class = "factor"), year = c(2000L, 2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 1998L, 1998L, 1998L, 2005L,
2005L, 2005L), x1 = c(50L, 70L, 10L, 95L, 10L, 5L, 10L, 5L,
30L, 10L, 10L, 90L, 49L), x2 = c(30L, 2L, 90L, 10L, 10L,
5L, 30L, 10L, 6L, 9L, 15L, 0L, 90L), x3 = c(1L, 5L, 20L,
10L, 10L, 0L, 25L, 30L, 9L, 7L, 2L, 0L, 5L), x4 = c(20L,
90L, 30L, 5L, 0L, 0L, 40L, 2L, 0L, 0L, 6L, 40L, 0L)), .Names = c("country1",
"country2", "year", "x1", "x2", "x3", "x4"), class = "data.frame", row.names = c(NA,
-13L))
install.packages("plyr")
library(plyr)
ddply(data,.(country1,year),numcolwise(max))
country1 year x1 x2 x3 x4
1 A 2000 95 90 25 90
2 B 1998 30 10 30 2
3 C 2005 90 90 5 40
回答3:
If you know SQL, then you could use sqldf
function from this package:
http://cran.r-project.org/web/packages/sqldf/index.html
df <- sqldf("select year, max(x1), max(x2), max(x3), max(x4) from Data group by year")
df
year max(x1) max(x2) max(x3) max(x4)
1 1998 30 10 30 2
2 2000 95 90 25 90
3 2005 90 90 5 40
来源:https://stackoverflow.com/questions/17539696/finding-maximum-value-of-one-column-by-group-and-inserting-value-into-another