Dealing with Messy Dates

后端 未结 5 671
南方客
南方客 2021-01-31 08:18

I hope you didn\'t think I was asking for relationship advice.

Infrequently, I have to offer survey respondents the ability to specify when an event occurred. What resu

相关标签:
5条回答
  • 2021-01-31 08:56

    This may be one of the few cases where another tool other than R is the best to use. I know that there are some modules for Perl that have already been developed to parse messy looking dates, on module DateTime::Format::Natural::Lang::EN can parse strings like: "1st tuesday last november". I seem to remember another module that could understand things like "the second tuesday after the first Monday in February".

    There is also a tool at http://www.datasciencetoolkit.org/ that grabs what looks like dates in text and converts them to a standard format.

    0 讨论(0)
  • 2021-01-31 09:00

    My sympathy that your date didn't turn out as pretty as expected. ;-)

    I have constructed a (still partial) solution along the lines suggested by @Rguy.

    (Please note that this code still has a bug: It does't always return the correct time. For some reason, it doesn't always do a greedy match on the digits before the colon, thus sometimes returning 1:00 when the time is 11:00.)

    First, construct a helper function that wraps around gsub and grep. This function takes a character vector as one of its arguments and collapses this into a single string separated by |. The effect of this is to allow you to easily pass multiple patterns to be matched by a regex:

    find.pattern <- function(x, pattern_list){
      pattern <- paste(pattern_list, collapse="|")
      ret <- gsub(paste("^.*(", pattern, ").*", sep=""), "\\1", x, ignore.case=TRUE)
      ret[ret==x] <- NA 
      ret2 <- grepl(paste("^(", pattern, ")$", sep=""), x, ignore.case=TRUE)
      ret[ret2] <- x[ret2] 
      ret
    }
    

    Next, use some built-in variable names to construct a vector of months and abbreviations:

    all.month <- c(month.name, month.abb)
    

    Finally, construct a data frame with different extracts:

    ret <- data.frame(
        data = dat, 
        date1 = find.pattern(dat, "\\d+/\\d+/\\d+"),
        date2 = find.pattern(dat, 
          paste(all.month, "\\s*\\d+[(th)|,]*\\s{0,3}[(2010)|(2011)]*", collapse="|", sep="")),
        year = find.pattern(dat, c(2010, 2011)),
        month = find.pattern(dat, month.abb), #Use base R variable called month.abb for month names
        hour = find.pattern(dat, c("\\d+[\\.:h]\\d+", "12 noon")),
        ampm = find.pattern(dat, c("am", "pm"))
    )
    

    The results:

    head(ret, 50)
                          data  date1        date2 year month  hour ampm
    20   April 4th around 10am   <NA>   April 4th  <NA>   Apr  <NA>   am
    21   April 4th around 10am   <NA>   April 4th  <NA>   Apr  <NA>   am
    22     Mar 18, 2011 9:33am   <NA> Mar 18, 2011 2011   Mar  9:33   am
    23     Mar 18, 2011 9:27am   <NA> Mar 18, 2011 2011   Mar  9:27   am
    24                      df   <NA>         <NA> <NA>  <NA>  <NA> <NA>
    25                      fg   <NA>         <NA> <NA>  <NA>  <NA> <NA>
    26                   12:16   <NA>         <NA> <NA>  <NA> 12:16 <NA>
    27                    9:50   <NA>         <NA> <NA>  <NA>  9:50 <NA>
    28   Feb 8, 2011 / 12:20pm   <NA>  Feb 8, 2011 2011   Feb  2:20   pm
    29         8:34 am  2/4/11 2/4/11         <NA> <NA>  <NA>  8:34   am
    30     Jan 31, 2011 2:50pm   <NA> Jan 31, 2011 2011   Jan  2:50   pm
    31     Jan 31, 2011 2:45pm   <NA> Jan 31, 2011 2011   Jan  2:45   pm
    32     Jan 31, 2011 2:38pm   <NA> Jan 31, 2011 2011   Jan  2:38   pm
    33     Jan 31, 2011 2:26pm   <NA> Jan 31, 2011 2011   Jan  2:26   pm
    34                   11h09   <NA>         <NA> <NA>  <NA> 11h09 <NA>
    35                11:00 am   <NA>         <NA> <NA>  <NA>  1:00   am
    36                 1h02 pm   <NA>         <NA> <NA>  <NA>  1h02   pm
    37                   10h03   <NA>         <NA> <NA>  <NA> 10h03 <NA>
    38                    2h10   <NA>         <NA> <NA>  <NA>  2h10 <NA>
    39 Jan 13, 2011 9:50am Van   <NA> Jan 13, 2011 2011   Jan  9:50   am
    40            Jan 12, 2011   <NA> Jan 12, 2011 2011   Jan  <NA> <NA>
    
    0 讨论(0)
  • 2021-01-31 09:06

    The wolfram alpha http://www.wolframalpha.com/ is definitely a great tool to do that work.

    At least, it successfully interpret some messy input in your data. It would be worth trying.

    I'm not sure if the site is suitable for extremely large dataset, but if the data is not so large, it will be useful.

    It is not difficult to write a automatized script that send query, get data and parse it, although I'm not sure if the site allows such usage.

    0 讨论(0)
  • 2021-01-31 09:20

    I'm not going to try to write the function right now, but I have an idea that might work.

    Search each string for a 4-digit number to call the year.

    Use grep to search each string for the first 3 letters of the abbreviation for the months. It seems almost all of your data (at least above) has an identifier like that. I'd store the value which is found in a "months" vector, and put blanks wherever no value is found. Here's a really ugly version of the code (i'll make this more efficient later, and add the case when the month isn't capitalized!)

    mos <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")   
    blah <- lapply(1:12, function(i) grepl(mos[i], test))   
    lapply(blah, function(i) which(i))   
    months <- 0*(1:length(test))   
    for (i in 1:12) {   
      months[blah[[i]]] <- i   
    }  
    
    
       months
      [1]  5  0  0  4  0  4  4  4  4  4  4  4  4  4  4  4  4  0  4  4  4  3  3  0  0  0  0  2  0  1
     [31]  1  1  1  0  0  0  0  0  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0 12 12 12 12  0
     [61]  0  0  0  0 12 12 12 12  0 12 12 12 12 12 12 12 12 12  0  0  0 12 12 12 12 11 11  0 11 11
     [91] 11  0 11  0 11  0 11  0  0 11 11 11  0 11  0 11 11 11  0 11 11 11 11  0 11  0  0  0 10 10
    [121] 10  0 10 10 10  0  0 10 10 10  0  0  0  0  0 10 10  0  0 10 10 10 10  0 10  0 10  0  0  0
    [151] 10  0 10 10 10 10 10  9  9  9  9  8  0  0 
    

    The "day" most commonly follows the word used for the month immediately. So if there is a one or 2 digit number after the month(which is character), extract that number and call it the day.

    Times most commonly have the ":" or "." symbol in them, and so search each string for that character. If found in a string, create a "Time" vector with all of the digits immediately before and after that character (in theory, including 2 before and 2 after should not cause a problem). Put blanks whenever the symbol is not present. It would be nice if all of the data were definitely confined to a <12 hour period, because then you won't have to worry about AM and PM. If not, Maybe search the string for "AM" and "PM" as well.

    Then, try to convert the strings which have all four of the above to POSIXct. The ones that don't convert, you'll have to manually enter of course. I think it would take me a few hours to code the function described above, and depending on the variability and size of your dataset it may or may not be worth the effort. Also, there is some risk for incorrect outputs, so adding an acceptable time range would help to avoid that.

    In summary, it sounds like you're going to have to code a function with a whole lot of exceptions and then end up hand-coding a good portion of the times anyway. I hope someone can provide a better solution for you, though.

    Good Luck!

    0 讨论(0)
  • 2021-01-31 09:21

    Others have already addressed standard approaches and packages. I'll take a different perspective. Using regular expressions and fixed formats will get you most of the way. For the rest, I'd simply approach it as I would any problem in "pattern matching": statistical methods or machine learning. You've already specified the date and time ranges, and the timestamp of the logs is also informative. By extracting a lot of text features (this is where regular expressions would prove useful), you could then try to map to times of interest.

    There are only three things to do for getting this working:

    1. Feature extraction
    2. Training set generation
    3. Build & deploy models

    Build and deploy models? Let me introduce you to my friend R and the machine learning task view. :) The basic models to explore include multinomial models (take a look at glmnet), decision trees, and support vector machines. You might use decision trees and SVMs as inputs for a multinomial model (and the SVMs might not be necessary after all). To be honest, this part is nebulous: one could do this modeling as disconnected date components or as a process of refinements, e.g. get the year, if possible, then the minutes (because the range is much larger than for hours, days, months), then day of month, and finally hours and months. Essentially, I'd aim for trying to identify "parts of time" (analogous to parts of speech) for the numerical/string components.

    Feature extraction: I'd try splits with colons, commas, slashes, dashes, periods, etc. Anything that is not a numeric value. I would then create data sets based on the features in order and in any order (i.e. an indicator value of features seen, ignoring the positions).

    Training data: Amazon's Mechanical Turk.

    Or, you know what, just ignore all of that programming and statistical mumbo jumbo and send everything to Mechanical Turk. :)

    0 讨论(0)
提交回复
热议问题