Plotting the frequency of string matches over time in R

问题

I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)

id      when            time        day month   year    handle  what
UK1.1   Sat Feb 20 2016 12:34:02    20  2       2016    dave    Great goal by #lfc
UK1.2   Sat Feb 20 2016 15:12:42    20  2       2016    john    Can't wait for the weekend 
UK1.3   Sat Mar 01 2016 12:09:21    1   3       2016    smith   Generic boring tweet

Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.

I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.

The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.

So basically, I'd like to make plots like these, where frequency of string x is plotted against time

Thanks!

回答1:

Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:

library(dplyr)
library(lubridate)

# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000), 
                 what = sample(LETTERS, 10000, replace=TRUE))

tweet.summary = dat %>% group_by(day = date(time)) %>%  # To summarise by month: group_by(month = month(time, label=TRUE))
  summarise(total.tweets = n(),
            A.tweets = sum(grepl("A", what)),
            pct.A = A.tweets/total.tweets,
            B.tweets = sum(grepl("B", what)),
            pct.B = B.tweets/total.tweets)            

tweet.summary

          day total.tweets A.tweets      pct.A B.tweets      pct.B
1  2016-01-01           28        3 0.10714286        0 0.00000000
2  2016-01-02           27        0 0.00000000        1 0.03703704
3  2016-01-03           28        4 0.14285714        1 0.03571429
4  2016-01-04           27        2 0.07407407        2 0.07407407
...

Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:

library(ggplot2)
library(reshape2)
library(scales)

ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
         summarise(A = sum(grepl("A", what))/n(),
                   B = sum(grepl("B", what))/n()) %>%
         melt(id.var="Month"),
       aes(Month, value, colour=variable, group=variable)) +
  geom_line() +
  theme_bw() +
  scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
  labs(colour="", y="")

Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.

# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016, 
                  time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))

# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-", 
                                         sprintf("%02d",month),"-", 
                                         sprintf("%02d", day)," ", 
                                         time)))

# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-", 
                                      sprintf("%02d",month),"-", 
                                      sprintf("%02d", day))))

dat2

   day month year  time          posix.date       date
1   28    10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2   22     6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3    3     4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4   15     8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5    6     2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6    2    12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7    4    11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8   12     3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9   24     5 2016 08:47 2016-05-24 08:47:00 2016-05-24 
10  27     1 2016 04:22 2016-01-27 04:22:00 2016-01-27

You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).

来源：https://stackoverflow.com/questions/37167148/plotting-the-frequency-of-string-matches-over-time-in-r

标签

time

plot

grep

frequency