How to find balanced panel data in R (aka, how to find which entries in panel are complete over given window)

I have a big panel of data from Compustat. To it I am adding some hand-collected data (seriously hand-collected from a stack of old books). But I don't want to hand-collect for the entire panel, only a randomly selected subset. To find the larger set (from which I'm randomly selecting) I would like to start with the balanced panel from Compustat.

I see the plm library for working with unbalanced panels, but I would like to keep it balanced. Is there a clean way to do this short of searching for and throwing out firms (individuals in panelspeak) that don't run the sample period? Thanks!

After a second thought, there is a much easier way for doing this.

Look at this:

data.with.only.complete.subjects.data <- function(xx, subject.column, number.of.observation.a.subject.should.have)
{
    subjects <- xx[,subject.column]
    num.of.observations.per.subject <- table(subjects)
    subjects.to.keep <- names(num.of.observations.per.subject)[num.of.observations.per.subject == number.of.observation.a.subject.should.have]

    subset.by.me <- subjects %in%   subjects.to.keep

    new.xx <- xx[subset.by.me ,]

    return(new.xx)
}

xx <- data.frame(subject = rep(1:4, each = 3),
            observation.per.subject = rep(rep(1:3), 4))
xx.mis <- xx[-c(2,5),]

data.with.only.complete.subjects.data(xx.mis , 1, 3)

Looking at it now, I lost the formatting on some of the data, but I can figure that out later. Here's my attempt at taking the balanced portion of the panel:

    > data <- read.csv("223601533.csv")
> head(data)
  gvkey indfmt  datafmt consol popsrc fyear fyr datadate exchg         isin
1  2721   INDL HIST_STD      C      I  2000  12 20001231   264 JP3242800005
2  2721   INDL HIST_STD      C      I  2001  12 20011231   264 JP3242800005
3  2721   INDL HIST_STD      C      I  2002  12 20021231   264 JP3242800005
4  2721   INDL HIST_STD      C      I  2003  12 20031231   264 JP3242800005
5  2721   INDL HIST_STD      C      I  2004  12 20041231   264 JP3242800005
6  2721   INDL HIST_STD      C      I  2005  12 20051231   264 JP3242800005
    sedol      conm costat fic
1 6172323 CANON INC      A JPN
2 6172323 CANON INC      A JPN
3 6172323 CANON INC      A JPN
4 6172323 CANON INC      A JPN
5 6172323 CANON INC      A JPN
6 6172323 CANON INC      A JPN
> 
> obs.all <- tabulate(data$gvkey) # incl lots of zeros for unused gvkey
> num.obs <- tabulate(obs.all)
> mode.num.obs <- which(num.obs == max(num.obs))
> nt.bal <- num.obs[mode.num.obs] * mode.num.obs
> pot.obs <- which(obs.all == mode.num.obs)
> data.bal <- as.data.frame(matrix(NA, nrow=nt.bal, ncol=ncol(data)))
> colnames(data.bal) <- colnames(data)
> 
> for(i in 1:length(pot.obs)) {
+   last.row <- i * mode.num.obs
+   first.row <- last.row - (mode.num.obs - 1)
+   data.bal[first.row:last.row, ] <- subset(data, gvkey == pot.obs[i])
+ }
> 
> head(data.bal)
  gvkey indfmt datafmt consol popsrc fyear fyr datadate exchg isin sedol conm
1  2721      2       1      1      1  2000  12 20001231   264  875   359  331
2  2721      2       1      1      1  2001  12 20011231   264  875   359  331
3  2721      2       1      1      1  2002  12 20021231   264  875   359  331
4  2721      2       1      1      1  2003  12 20031231   264  875   359  331
5  2721      2       1      1      1  2004  12 20041231   264  875   359  331
6  2721      2       1      1      1  2005  12 20051231   264  875   359  331
  costat fic
1      1   1
2      1   1
3      1   1
4      1   1
5      1   1
6      1   1
>

Update: I think this solution is less good then the other one I posted above, but I am leaving it as an example of a solution - which is not so good :) *

Hi Rishard,

It's a bit difficult with out some sample data to help.

But it sound like you could reshape your data using "melt" and "cast" from the "reshape" package. Doing that will enable you to find where you have too few observation per subject, and then use that information to subset your data.

Here is an example code of how this can be done:

xx <- data.frame(subject = rep(1:4, each = 3),
            observation.per.subject = rep(rep(1:3), 4))
xx.mis <- xx[-c(2,5),]

require(reshape)


num.of.obs.per.subject <- cast(xx.mis, subject ~.)
the.number <- num.of.obs.per.subject[,2]
subjects.to.keep <- num.of.obs.per.subject[,1] [the.number  == 3]

ss.index.of.who.to.keep <- xx.mis $subject %in% subjects.to.keep 

xx.to.work.with <- xx.mis[ss.index.of.who.to.keep ,]


xx.to.work.with

Cheers,

Tal

> # read data
> file.in <- "243815928.csv"
> data <- read.csv(file.in)
> 
> # find which gvkeys run the entire sample period
> obs.all <- tabulate(data$gvkey) # incl lots of zeros for unused gvkey
> num.obs <- tabulate(obs.all)
> mode.num.obs <- which(num.obs == max(num.obs))
> nt.bal <- num.obs[mode.num.obs] * mode.num.obs
> pot.obs <- which(obs.all == mode.num.obs)
> 
> # create new df w/o firms that don't run the whole sample period
> pot.obs.index <- which(data$gvkey %in% pot.obs)
> data.bal <- data[pot.obs.index, ]
> 
> # write data to csv file
> file.out <- paste(substr(file.in, 1, (nchar(file.in)-4)), "sorted.csv", sep="")
> write.csv(data.bal, file.out)

来源：https://stackoverflow.com/questions/3096495/how-to-find-balanced-panel-data-in-r-aka-how-to-find-which-entries-in-panel-ar

标签

economics