Roman Cheplyaka

Group data by month in R

Published on February 22, 2017; tags: rstats

I often analyze time series data in R — things like daily expenses or webserver statistics. And just as often I want to aggregate the data by month to see longer-term patterns.

Doing this in base R is a bit awkward, and the internet is littered with working but terrible solutions using strftime or similar date-to-string conversions.

The real gem is the function floor_date from the lubridate package. I found out about it from this old StackOverflow answer by Hadley Wickham. As the name implies, it can be used to round each date down to the month boundary, so that dates in the same month are rounded down to the same date.

Let’s look at an example.

library(dplyr)
library(lubridate)
set.seed(2017)
options(digits=4)

Say these are your daily expenses for 2016:

(expenses <- data_frame(
  date=seq(as.Date("2016-01-01"), as.Date("2016-12-31"), by=1),
  amount=rgamma(length(date), shape = 2, scale = 20)))
## # A tibble: 366 × 2
##          date amount
##        <date>  <dbl>
## 1  2016-01-01  75.42
## 2  2016-01-02  28.14
## 3  2016-01-03  51.12
## 4  2016-01-04  26.12
## 5  2016-01-05  42.09
## 6  2016-01-06  40.99
## 7  2016-01-07  45.59
## 8  2016-01-08  57.55
## 9  2016-01-09  22.98
## 10 2016-01-10  14.50
## # ... with 356 more rows

Then you can summarize them by month like this:

expenses %>% group_by(month=floor_date(date, "month")) %>%
   summarize(amount=sum(amount))
## # A tibble: 12 × 2
##         month amount
##        <date>  <dbl>
## 1  2016-01-01 1200.9
## 2  2016-02-01 1002.9
## 3  2016-03-01 1237.6
## 4  2016-04-01 1120.8
## 5  2016-05-01 1276.6
## 6  2016-06-01 1404.6
## 7  2016-07-01  972.6
## 8  2016-08-01 1245.7
## 9  2016-09-01 1020.2
## 10 2016-10-01  986.2
## 11 2016-11-01 1106.3
## 12 2016-12-01 1235.2

floor_date lets you round dates to various time periods from seconds to years and also multiples of these periods, e.g.

expenses %>% group_by(month=floor_date(date, "14 days")) %>%
   summarize(amount=sum(amount))
## # A tibble: 36 × 2
##         month amount
##        <date>  <dbl>
## 1  2016-01-01  550.0
## 2  2016-01-15  462.1
## 3  2016-01-29  188.8
## 4  2016-02-01  568.2
## 5  2016-02-15  335.1
## 6  2016-02-29   99.6
## 7  2016-03-01  675.7
## 8  2016-03-15  459.8
## 9  2016-03-29  102.2
## 10 2016-04-01  458.3
## # ... with 26 more rows