Finding maximum value of one column (by group) and inserting value into another data frame in R

问题

All,

I was hoping someone could find a solution to an issue of mine that isn't necessarily causing headaches, but, as of right now, invites the possibility for human error in creating a data set for a project on which I'm working.

The data set I'm using right now is a directed dyad-year (A vs. B, B vs. A) data set for select pairs of countries for every year between 1950 and 2010. Some countries, like A in my example, will be paired with every country in the world and every country will be paired with it. Some countries, like B and C in my example, will be paired with just a few countries. Some pairs will have missing data, which I don't show in my example.

What I would like to do is use R to find the maximum value of a given column, for a given country, in a given year, and insert that value into another data frame. Hopefully this illustration will clarify what I would like to do.

country1 country2 year    x1   x2   x3   x4
   A        B     2000    50   30   1    20
   A        C     2000    70    2   5    90
   A        D     2000    10   90   20   30
   A        E     2000    95   10   10   5
   A        F     2000    10   10   10   0
   A        G     2000    5     5   0    0
   A        H     2000    10   30   25   40

  ........................................

  B        A      1998    5    10   30   2
  B        D      1998    30   6    9    0
  B        I      1998    10   9    7    0

  ........................................

  C        A      2005    10   15   2    6
  C        D      2005    90   0    0    40
  C        X      2005    49   90   5    0

Say, for example, that I'm interested in Country A in the year 2000. I want to know what is its max value of x1 in 2000 (which is 95, in its pairing with Country E). I also want to know what is its max value for x2, x3, and x4 in any pairing in that given year (which are 90, 25, and 90 with Country D, Country H, and Country C respectively).

The same follows for Country B in 1998, and Country C in 2005.

After isolating the max value of those columns for a given country in a given year, I'd like to dump those values into a dataframe, like this.

country   year    x1max    x2max    x3max    x4max
  A       2000      95       90       25       90
  B       1998      30       10       30        2
  C       2005      90       90        5       40

I'm flexible on this part. It might just be easiest to dump those max values for each country into their own data frames of dimensions 1x5, and then use rbind to stack them together.

Does anyone have any advice on how to proceed? It'd save me the hassle of having to do it manually, which, more than anything, invites the possibility of human error.

Reproducible code follows, though, since my question does hinge on isolating a particular year for a particular country (e.g. 2000 for Country A instead of 2001), I'm not sure the reproducible code is necessarily helpful. I hope it is, or, at least, that my question is clear.

country1 <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C")
country2 <- c("B","C","D","E","F","G","H","A","D","I","A","D","X")
year <- c(2000, 2000, 2000, 2000, 2000, 2000, 2000, 1998, 1998, 1998, 2005, 2005, 2005)
x1 <- c(50, 70, 10, 95, 10, 5, 10, 5, 30, 10, 10, 90, 49)
x2 <- c(30, 2, 90, 10, 10, 5, 30, 10, 6, 9, 15, 0, 90)
x3 <- c(1, 5, 20, 10, 10, 0, 25, 30, 9, 7, 2, 0, 5)
x4 <- c(20, 90, 30, 5, 0,0,40,2,0,0,6,40,0)

Data=data.frame(country1=country1,country2=country2,year=year,x1=x1,x2=x2,x3=x3,x4=x4)
Data

回答1:

It sounds like you're just looking for aggregate:

> aggregate(cbind(x1, x2, x3, x4) ~ country1 + year, Data, max)
  country1 year x1 x2 x3 x4
1        B 1998 30 10 30  2
2        A 2000 95 90 25 90
3        C 2005 90 90  5 40

It's not very clear from your question how you want to proceed from there though....

回答2:

You can also use ddply from plyr package. Assuming your sample is data.

data<-structure(list(country1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    country2 = structure(c(2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 4L, 
    9L, 1L, 4L, 10L), .Label = c("A", "B", "C", "D", "E", "F", 
    "G", "H", "I", "X"), class = "factor"), year = c(2000L, 2000L, 
    2000L, 2000L, 2000L, 2000L, 2000L, 1998L, 1998L, 1998L, 2005L, 
    2005L, 2005L), x1 = c(50L, 70L, 10L, 95L, 10L, 5L, 10L, 5L, 
    30L, 10L, 10L, 90L, 49L), x2 = c(30L, 2L, 90L, 10L, 10L, 
    5L, 30L, 10L, 6L, 9L, 15L, 0L, 90L), x3 = c(1L, 5L, 20L, 
    10L, 10L, 0L, 25L, 30L, 9L, 7L, 2L, 0L, 5L), x4 = c(20L, 
    90L, 30L, 5L, 0L, 0L, 40L, 2L, 0L, 0L, 6L, 40L, 0L)), .Names = c("country1", 
"country2", "year", "x1", "x2", "x3", "x4"), class = "data.frame", row.names = c(NA, 
-13L))

install.packages("plyr")
library(plyr)
ddply(data,.(country1,year),numcolwise(max))

  country1 year x1 x2 x3 x4
1        A 2000 95 90 25 90
2        B 1998 30 10 30  2
3        C 2005 90 90  5 40

回答3:

If you know SQL, then you could use sqldffunction from this package: http://cran.r-project.org/web/packages/sqldf/index.html

df <- sqldf("select year, max(x1), max(x2), max(x3), max(x4) from Data group by year")
df
  year max(x1) max(x2) max(x3) max(x4)
1 1998      30      10      30       2
2 2000      95      90      25      90
3 2005      90      90       5      40

来源：https://stackoverflow.com/questions/17539696/finding-maximum-value-of-one-column-by-group-and-inserting-value-into-another

标签

data-manipulation