问题
I am quite new to R, my problem is as follows:
I have a set of panel data organised as time series like this(only part is shown):
Week_Starting Team A Team B Team C Team D
2010-01-02 1 2 3 4
2010-01-09 2 40 1 5
2010-01-16 15 <NA> 4 11
2010-01-23 25 <NA> 7 18
2010-01-30 38 <NA> 9 29
2010-02-06 <NA> <NA> 12 34
2010-02-13 <NA> <NA> 16 40
2010-02-20 <NA> <NA> 20 <NA>
2010-02-27 <NA> <NA> 15 28
2010-03-06 <NA> <NA> 20 <NA>
2010-03-13 <NA> <NA> 24 <NA>
2010-03-20 <NA> <NA> 24 <NA>
2010-03-27 <NA> <NA> 21 <NA>
2010-04-03 <NA> <NA> 27 <NA>
2010-04-10 <NA> <NA> 24 <NA>
2010-04-17 <NA> <NA> 25 <NA>
2010-04-24 <NA> <NA> 35 <NA>
2010-05-01 <NA> <NA> 40 <NA>
2010-05-08 <NA> <NA> 32 <NA>
2010-05-15 <NA> <NA> <NA> <NA>
2010-05-22 <NA> <NA> 39 <NA>
It will be pointless to use Team B for example because of too many missing observations. The Ranking system does not provide data for rankings below 40. So I want to clean (e.g. Team A,B and D in this example) by dropping columns (variables) that do not have minimum of 8 weeks of continuous observations. So D does not meet the requirement because of a gap in week starting 2010-02-20. Bear in mind that I got over 1000+ columns.
I tried this "Subsetting a unbalanced panel dataset to have at least 2 consecutive observations in R" before but it doesn't give me what I want and unfortunately I am not skilled enough to modify the code to suit my need.
Some possible solutions I can think off:
1) subset the part of each variable that has 8 or more continuous observations
2) set observation value = NA if the continuous run of 8 obs contains a NA, then drop columns that have only NA because columns that do not meet 8 minimum weeks requirement will only have NA values (I hope you get what I mean)
Thanks in advanced for any help, comments and other suggestions please! :)
Edit:
Just out of interest, if the data is organised in long format, would it be more difficult to do the same thing?
#Using MrFlick's data frame
melt(dd,id="Week_Starting")
Week_Starting variable value
1 2010-01-02 Team_A 1
2 2010-01-09 Team_A 2
3 2010-01-16 Team_A 15
4 2010-01-23 Team_A 25
5 2010-01-30 Team_A 38
6 2010-02-06 Team_A NA
7 2010-02-13 Team_A NA
8 2010-02-20 Team_A NA
9 2010-02-27 Team_A NA
10 2010-03-06 Team_A NA
11 2010-03-13 Team_A NA
12 2010-03-20 Team_A NA
13 2010-03-27 Team_A NA
14 2010-04-03 Team_A NA
15 2010-04-10 Team_A NA
16 2010-04-17 Team_A NA
17 2010-04-24 Team_A NA
18 2010-05-01 Team_A NA
19 2010-05-08 Team_A NA
20 2010-05-15 Team_A NA
21 2010-05-22 Team_A NA
22 2010-01-02 Team_B 2
23 2010-01-09 Team_B 40
24 2010-01-16 Team_B NA
25 2010-01-23 Team_B NA
26 2010-01-30 Team_B NA
27 2010-02-06 Team_B NA
28 2010-02-13 Team_B NA
29 2010-02-20 Team_B NA
30 2010-02-27 Team_B NA
31 2010-03-06 Team_B NA
32 2010-03-13 Team_B NA
33 2010-03-20 Team_B NA
34 2010-03-27 Team_B NA
35 2010-04-03 Team_B NA
36 2010-04-10 Team_B NA
37 2010-04-17 Team_B NA
38 2010-04-24 Team_B NA
39 2010-05-01 Team_B NA
40 2010-05-08 Team_B NA
41 2010-05-15 Team_B NA
42 2010-05-22 Team_B NA
43 2010-01-02 Team_C 3
44 2010-01-09 Team_C 1
45 2010-01-16 Team_C 4
46 2010-01-23 Team_C 7
47 2010-01-30 Team_C 9
48 2010-02-06 Team_C 12
49 2010-02-13 Team_C 16
50 2010-02-20 Team_C 20
51 2010-02-27 Team_C 15
52 2010-03-06 Team_C 20
53 2010-03-13 Team_C 24
54 2010-03-20 Team_C 24
55 2010-03-27 Team_C 21
56 2010-04-03 Team_C 27
57 2010-04-10 Team_C 24
58 2010-04-17 Team_C 25
59 2010-04-24 Team_C 35
60 2010-05-01 Team_C 40
61 2010-05-08 Team_C 32
62 2010-05-15 Team_C NA
63 2010-05-22 Team_C 39
64 2010-01-02 Team_D 4
65 2010-01-09 Team_D 5
66 2010-01-16 Team_D 11
67 2010-01-23 Team_D 18
68 2010-01-30 Team_D 29
69 2010-02-06 Team_D 34
70 2010-02-13 Team_D 40
71 2010-02-20 Team_D NA
72 2010-02-27 Team_D 28
73 2010-03-06 Team_D NA
74 2010-03-13 Team_D NA
75 2010-03-20 Team_D NA
76 2010-03-27 Team_D NA
77 2010-04-03 Team_D NA
78 2010-04-10 Team_D NA
79 2010-04-17 Team_D NA
80 2010-04-24 Team_D NA
81 2010-05-01 Team_D NA
82 2010-05-08 Team_D NA
83 2010-05-15 Team_D NA
84 2010-05-22 Team_D NA
Any suggestions? :)
回答1:
You can do this using rle
to calculate lengths of runs of non-NA values. First, here's nice data.frame you can copy/paste with your data.
dd<-structure(list(Week_Starting = structure(1:21, .Label = c("2010-01-02",
"2010-01-09", "2010-01-16", "2010-01-23", "2010-01-30", "2010-02-06",
"2010-02-13", "2010-02-20", "2010-02-27", "2010-03-06", "2010-03-13",
"2010-03-20", "2010-03-27", "2010-04-03", "2010-04-10", "2010-04-17",
"2010-04-24", "2010-05-01", "2010-05-08", "2010-05-15", "2010-05-22"
), class = "factor"), Team_A = c(1L, 2L, 15L, 25L, 38L, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Team_B = c(2L,
40L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), Team_C = c(3L, 1L, 4L, 7L, 9L, 12L, 16L,
20L, 15L, 20L, 24L, 24L, 21L, 27L, 24L, 25L, 35L, 40L, 32L, NA,
39L), Team_D = c(4L, 5L, 11L, 18L, 29L, 34L, 40L, NA, 28L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Week_Starting",
"Team_A", "Team_B", "Team_C", "Team_D"), class = "data.frame", row.names = c(NA,
-21L))
Now we define a function that can calculate the longest run of non-NA values in a vector
consecnonNA <- function(x) {
rr<-rle(is.na(x))
max(rr$lengths[rr$values==FALSE])
}
we can calculate this value for each of the columns and return the names of those columns that have at least 8 consecutive weeks
atleast <- function(i) {function(x) x>=i}
hasatleast8 <- names(Filter(atleast(8), sapply(dd[,-1], consecnonNA)))
And then we can subset with
dd[, c("Week_Starting", hasatleast8), drop=F]
来源:https://stackoverflow.com/questions/24600716/drop-variable-in-panel-data-in-r-conditional-based-on-a-defined-number-of-consec