Dealing with Messy Dates

后端 未结 5 682
南方客
南方客 2021-01-31 08:18

I hope you didn\'t think I was asking for relationship advice.

Infrequently, I have to offer survey respondents the ability to specify when an event occurred. What resu

5条回答
  •  挽巷
    挽巷 (楼主)
    2021-01-31 09:21

    Others have already addressed standard approaches and packages. I'll take a different perspective. Using regular expressions and fixed formats will get you most of the way. For the rest, I'd simply approach it as I would any problem in "pattern matching": statistical methods or machine learning. You've already specified the date and time ranges, and the timestamp of the logs is also informative. By extracting a lot of text features (this is where regular expressions would prove useful), you could then try to map to times of interest.

    There are only three things to do for getting this working:

    1. Feature extraction
    2. Training set generation
    3. Build & deploy models

    Build and deploy models? Let me introduce you to my friend R and the machine learning task view. :) The basic models to explore include multinomial models (take a look at glmnet), decision trees, and support vector machines. You might use decision trees and SVMs as inputs for a multinomial model (and the SVMs might not be necessary after all). To be honest, this part is nebulous: one could do this modeling as disconnected date components or as a process of refinements, e.g. get the year, if possible, then the minutes (because the range is much larger than for hours, days, months), then day of month, and finally hours and months. Essentially, I'd aim for trying to identify "parts of time" (analogous to parts of speech) for the numerical/string components.

    Feature extraction: I'd try splits with colons, commas, slashes, dashes, periods, etc. Anything that is not a numeric value. I would then create data sets based on the features in order and in any order (i.e. an indicator value of features seen, ignoring the positions).

    Training data: Amazon's Mechanical Turk.

    Or, you know what, just ignore all of that programming and statistical mumbo jumbo and send everything to Mechanical Turk. :)

提交回复
热议问题