Drop variable in panel data in R conditional based on a defined number of consecutive observations

我只是一个虾纸丫 提交于 2019-12-10 22:23:13

问题


I am quite new to R, my problem is as follows:

I have a set of panel data organised as time series like this(only part is shown):

Week_Starting    Team A            Team B      Team C   Team D              
2010-01-02         1                   2           3        4
2010-01-09         2                  40           1        5
2010-01-16        15                <NA>           4       11
2010-01-23        25                <NA>           7       18
2010-01-30        38                <NA>           9       29
2010-02-06      <NA>                <NA>          12       34
2010-02-13      <NA>                <NA>          16       40
2010-02-20      <NA>                <NA>          20     <NA>
2010-02-27      <NA>                <NA>          15       28
2010-03-06      <NA>                <NA>          20     <NA>
2010-03-13      <NA>                <NA>          24     <NA>
2010-03-20      <NA>                <NA>          24     <NA>
2010-03-27      <NA>                <NA>          21     <NA>
2010-04-03      <NA>                <NA>          27     <NA>
2010-04-10      <NA>                <NA>          24     <NA>
2010-04-17      <NA>                <NA>          25     <NA>
2010-04-24      <NA>                <NA>          35     <NA>
2010-05-01      <NA>                <NA>          40     <NA>
2010-05-08      <NA>                <NA>          32     <NA>
2010-05-15      <NA>                <NA>        <NA>     <NA>
2010-05-22      <NA>                <NA>          39     <NA>

It will be pointless to use Team B for example because of too many missing observations. The Ranking system does not provide data for rankings below 40. So I want to clean (e.g. Team A,B and D in this example) by dropping columns (variables) that do not have minimum of 8 weeks of continuous observations. So D does not meet the requirement because of a gap in week starting 2010-02-20. Bear in mind that I got over 1000+ columns.

I tried this "Subsetting a unbalanced panel dataset to have at least 2 consecutive observations in R" before but it doesn't give me what I want and unfortunately I am not skilled enough to modify the code to suit my need.

Some possible solutions I can think off:

1) subset the part of each variable that has 8 or more continuous observations

2) set observation value = NA if the continuous run of 8 obs contains a NA, then drop columns that have only NA because columns that do not meet 8 minimum weeks requirement will only have NA values (I hope you get what I mean)

Thanks in advanced for any help, comments and other suggestions please! :)

Edit:

Just out of interest, if the data is organised in long format, would it be more difficult to do the same thing?

#Using MrFlick's data frame

melt(dd,id="Week_Starting")

       Week_Starting variable value
    1     2010-01-02   Team_A     1
    2     2010-01-09   Team_A     2
    3     2010-01-16   Team_A    15
    4     2010-01-23   Team_A    25
    5     2010-01-30   Team_A    38
    6     2010-02-06   Team_A    NA
    7     2010-02-13   Team_A    NA
    8     2010-02-20   Team_A    NA
    9     2010-02-27   Team_A    NA
    10    2010-03-06   Team_A    NA
    11    2010-03-13   Team_A    NA
    12    2010-03-20   Team_A    NA
    13    2010-03-27   Team_A    NA
    14    2010-04-03   Team_A    NA
    15    2010-04-10   Team_A    NA
    16    2010-04-17   Team_A    NA
    17    2010-04-24   Team_A    NA
    18    2010-05-01   Team_A    NA
    19    2010-05-08   Team_A    NA
    20    2010-05-15   Team_A    NA
    21    2010-05-22   Team_A    NA
    22    2010-01-02   Team_B     2
    23    2010-01-09   Team_B    40
    24    2010-01-16   Team_B    NA
    25    2010-01-23   Team_B    NA
    26    2010-01-30   Team_B    NA
    27    2010-02-06   Team_B    NA
    28    2010-02-13   Team_B    NA
    29    2010-02-20   Team_B    NA
    30    2010-02-27   Team_B    NA
    31    2010-03-06   Team_B    NA
    32    2010-03-13   Team_B    NA
    33    2010-03-20   Team_B    NA
    34    2010-03-27   Team_B    NA
    35    2010-04-03   Team_B    NA
    36    2010-04-10   Team_B    NA
    37    2010-04-17   Team_B    NA
    38    2010-04-24   Team_B    NA
    39    2010-05-01   Team_B    NA
    40    2010-05-08   Team_B    NA
    41    2010-05-15   Team_B    NA
    42    2010-05-22   Team_B    NA
    43    2010-01-02   Team_C     3
    44    2010-01-09   Team_C     1
    45    2010-01-16   Team_C     4
    46    2010-01-23   Team_C     7
    47    2010-01-30   Team_C     9
    48    2010-02-06   Team_C    12
    49    2010-02-13   Team_C    16
    50    2010-02-20   Team_C    20
    51    2010-02-27   Team_C    15
    52    2010-03-06   Team_C    20
    53    2010-03-13   Team_C    24
    54    2010-03-20   Team_C    24
    55    2010-03-27   Team_C    21
    56    2010-04-03   Team_C    27
    57    2010-04-10   Team_C    24
    58    2010-04-17   Team_C    25
    59    2010-04-24   Team_C    35
    60    2010-05-01   Team_C    40
    61    2010-05-08   Team_C    32
    62    2010-05-15   Team_C    NA
    63    2010-05-22   Team_C    39
    64    2010-01-02   Team_D     4
    65    2010-01-09   Team_D     5
    66    2010-01-16   Team_D    11
    67    2010-01-23   Team_D    18
    68    2010-01-30   Team_D    29
    69    2010-02-06   Team_D    34
    70    2010-02-13   Team_D    40
    71    2010-02-20   Team_D    NA
    72    2010-02-27   Team_D    28
    73    2010-03-06   Team_D    NA
    74    2010-03-13   Team_D    NA
    75    2010-03-20   Team_D    NA
    76    2010-03-27   Team_D    NA
    77    2010-04-03   Team_D    NA
    78    2010-04-10   Team_D    NA
    79    2010-04-17   Team_D    NA
    80    2010-04-24   Team_D    NA
    81    2010-05-01   Team_D    NA
    82    2010-05-08   Team_D    NA
    83    2010-05-15   Team_D    NA
    84    2010-05-22   Team_D    NA

Any suggestions? :)


回答1:


You can do this using rle to calculate lengths of runs of non-NA values. First, here's nice data.frame you can copy/paste with your data.

dd<-structure(list(Week_Starting = structure(1:21, .Label = c("2010-01-02", 
"2010-01-09", "2010-01-16", "2010-01-23", "2010-01-30", "2010-02-06", 
"2010-02-13", "2010-02-20", "2010-02-27", "2010-03-06", "2010-03-13", 
"2010-03-20", "2010-03-27", "2010-04-03", "2010-04-10", "2010-04-17", 
"2010-04-24", "2010-05-01", "2010-05-08", "2010-05-15", "2010-05-22"
), class = "factor"), Team_A = c(1L, 2L, 15L, 25L, 38L, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Team_B = c(2L, 
40L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), Team_C = c(3L, 1L, 4L, 7L, 9L, 12L, 16L, 
20L, 15L, 20L, 24L, 24L, 21L, 27L, 24L, 25L, 35L, 40L, 32L, NA, 
39L), Team_D = c(4L, 5L, 11L, 18L, 29L, 34L, 40L, NA, 28L, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Week_Starting", 
"Team_A", "Team_B", "Team_C", "Team_D"), class = "data.frame", row.names = c(NA, 
-21L))

Now we define a function that can calculate the longest run of non-NA values in a vector

consecnonNA <- function(x) {
    rr<-rle(is.na(x))
    max(rr$lengths[rr$values==FALSE])
}

we can calculate this value for each of the columns and return the names of those columns that have at least 8 consecutive weeks

atleast <- function(i) {function(x) x>=i}
hasatleast8 <- names(Filter(atleast(8), sapply(dd[,-1], consecnonNA)))

And then we can subset with

dd[, c("Week_Starting", hasatleast8), drop=F]


来源:https://stackoverflow.com/questions/24600716/drop-variable-in-panel-data-in-r-conditional-based-on-a-defined-number-of-consec

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!