I have a big panel of data from Compustat. To it I am adding some hand-collected data (seriously hand-collected from a stack of old books). But I don't want to hand-collect for the entire panel, only a randomly selected subset. To find the larger set (from which I'm randomly selecting) I would like to start with the balanced panel from Compustat.
I see the plm
library for working with unbalanced panels, but I would like to keep it balanced. Is there a clean way to do this short of searching for and throwing out firms (individuals in panelspeak) that don't run the sample period? Thanks!
After a second thought, there is a much easier way for doing this.
Look at this:
data.with.only.complete.subjects.data <- function(xx, subject.column, number.of.observation.a.subject.should.have)
{
subjects <- xx[,subject.column]
num.of.observations.per.subject <- table(subjects)
subjects.to.keep <- names(num.of.observations.per.subject)[num.of.observations.per.subject == number.of.observation.a.subject.should.have]
subset.by.me <- subjects %in% subjects.to.keep
new.xx <- xx[subset.by.me ,]
return(new.xx)
}
xx <- data.frame(subject = rep(1:4, each = 3),
observation.per.subject = rep(rep(1:3), 4))
xx.mis <- xx[-c(2,5),]
data.with.only.complete.subjects.data(xx.mis , 1, 3)
Looking at it now, I lost the formatting on some of the data, but I can figure that out later. Here's my attempt at taking the balanced portion of the panel:
> data <- read.csv("223601533.csv")
> head(data)
gvkey indfmt datafmt consol popsrc fyear fyr datadate exchg isin
1 2721 INDL HIST_STD C I 2000 12 20001231 264 JP3242800005
2 2721 INDL HIST_STD C I 2001 12 20011231 264 JP3242800005
3 2721 INDL HIST_STD C I 2002 12 20021231 264 JP3242800005
4 2721 INDL HIST_STD C I 2003 12 20031231 264 JP3242800005
5 2721 INDL HIST_STD C I 2004 12 20041231 264 JP3242800005
6 2721 INDL HIST_STD C I 2005 12 20051231 264 JP3242800005
sedol conm costat fic
1 6172323 CANON INC A JPN
2 6172323 CANON INC A JPN
3 6172323 CANON INC A JPN
4 6172323 CANON INC A JPN
5 6172323 CANON INC A JPN
6 6172323 CANON INC A JPN
>
> obs.all <- tabulate(data$gvkey) # incl lots of zeros for unused gvkey
> num.obs <- tabulate(obs.all)
> mode.num.obs <- which(num.obs == max(num.obs))
> nt.bal <- num.obs[mode.num.obs] * mode.num.obs
> pot.obs <- which(obs.all == mode.num.obs)
> data.bal <- as.data.frame(matrix(NA, nrow=nt.bal, ncol=ncol(data)))
> colnames(data.bal) <- colnames(data)
>
> for(i in 1:length(pot.obs)) {
+ last.row <- i * mode.num.obs
+ first.row <- last.row - (mode.num.obs - 1)
+ data.bal[first.row:last.row, ] <- subset(data, gvkey == pot.obs[i])
+ }
>
> head(data.bal)
gvkey indfmt datafmt consol popsrc fyear fyr datadate exchg isin sedol conm
1 2721 2 1 1 1 2000 12 20001231 264 875 359 331
2 2721 2 1 1 1 2001 12 20011231 264 875 359 331
3 2721 2 1 1 1 2002 12 20021231 264 875 359 331
4 2721 2 1 1 1 2003 12 20031231 264 875 359 331
5 2721 2 1 1 1 2004 12 20041231 264 875 359 331
6 2721 2 1 1 1 2005 12 20051231 264 875 359 331
costat fic
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
>
Update: I think this solution is less good then the other one I posted above, but I am leaving it as an example of a solution - which is not so good :) *
Hi Rishard,
It's a bit difficult with out some sample data to help.
But it sound like you could reshape your data using "melt" and "cast" from the "reshape" package. Doing that will enable you to find where you have too few observation per subject, and then use that information to subset your data.
Here is an example code of how this can be done:
xx <- data.frame(subject = rep(1:4, each = 3),
observation.per.subject = rep(rep(1:3), 4))
xx.mis <- xx[-c(2,5),]
require(reshape)
num.of.obs.per.subject <- cast(xx.mis, subject ~.)
the.number <- num.of.obs.per.subject[,2]
subjects.to.keep <- num.of.obs.per.subject[,1] [the.number == 3]
ss.index.of.who.to.keep <- xx.mis $subject %in% subjects.to.keep
xx.to.work.with <- xx.mis[ss.index.of.who.to.keep ,]
xx.to.work.with
Cheers,
Tal
> # read data
> file.in <- "243815928.csv"
> data <- read.csv(file.in)
>
> # find which gvkeys run the entire sample period
> obs.all <- tabulate(data$gvkey) # incl lots of zeros for unused gvkey
> num.obs <- tabulate(obs.all)
> mode.num.obs <- which(num.obs == max(num.obs))
> nt.bal <- num.obs[mode.num.obs] * mode.num.obs
> pot.obs <- which(obs.all == mode.num.obs)
>
> # create new df w/o firms that don't run the whole sample period
> pot.obs.index <- which(data$gvkey %in% pot.obs)
> data.bal <- data[pot.obs.index, ]
>
> # write data to csv file
> file.out <- paste(substr(file.in, 1, (nchar(file.in)-4)), "sorted.csv", sep="")
> write.csv(data.bal, file.out)
来源:https://stackoverflow.com/questions/3096495/how-to-find-balanced-panel-data-in-r-aka-how-to-find-which-entries-in-panel-ar