Dates and times are very common in data analysis—not least for time-series analysis. The bad news is that with different numbers of days in each month, leap years, leap seconds,[33] and time zones, they can be fairly awful to deal with programmatically. The good news is that R has a wide range of capabilities for dealing with times and dates. While these concepts are fairly fundamental to R programming, they’ve been left until now because some of the best ways of using them appear in add-on packages. As you begin reading this chapter, you may feel an awkward sensation that the code is grating on you. At this point, we’ll seek lubrication from the lubridate
package, which makes your date-time code more readable.
After reading this chapter, you should:
POSIXct
, POSIXlt
, and Date
lubridate
package
There are three date and time classes that come with R: POSIXct
, POSIXlt
, and Date
.
POSIX dates and times are classic R: brilliantly thorough in their implementation, navigating all sorts of obscure technical issues, but with awful Unixy names that make everything seem more complicated than it really is.
The two standard date-time classes in R are POSIXct
and POSIXlt
. (I said the names were awful!) POSIX is a set of standards that defines compliance with Unix, including how dates and times should be specified. ct
is short for “calendar time,” and the POSIXct
class stores dates as the number of seconds since the start of 1970, in the Coordinated Universal Time (UTC) zone.[34] POSIXlt
stores dates as a list, with components for seconds, minutes, hours, day of month, etc. POSIXct
is best for storing dates and calculating with them, whereas POSIXlt
is best for extracting specific parts of a date.
The function Sys.time
returns the current date and time in POSIXct
form:
(
now_ct<-
Sys.time())
## [1] "2013-07-17 22:47:01 BST"
The class of now_ct
has two elements. It is a POSIXct
variable, and POSIXct
is inherited from the class POSIXt
:
class(
now_ct)
## [1] "POSIXct" "POSIXt"
When a date is printed, you just see a formatted version of it, so it isn’t obvious how the date is stored. By using unclass
, we can see that it is indeed just a number:
unclass(
now_ct)
## [1] 1.374e+09
When printed, the POSIXlt
date looks exactly the same, but underneath the storage mechanism is very different:
(
now_lt<-
as.POSIXlt(
now_ct))
## [1] "2013-07-17 22:47:01 BST"
class(
now_lt)
## [1] "POSIXlt" "POSIXt"
unclass(
now_lt)
## $sec ## [1] 1.19 ## ## $min ## [1] 47 ## ## $hour ## [1] 22 ## ## $mday ## [1] 17 ## ## $mon ## [1] 6 ## ## $year ## [1] 113 ## ## $wday ## [1] 3 ## ## $yday ## [1] 197 ## ## $isdst ## [1] 1 ## ## attr(,"tzone") ## [1] "" "GMT" "BST"
You can use list indexing to access individual components of a POSIXlt
date:
now_lt$
sec
## [1] 1.19
now_lt[[
"min"
]]
## [1] 47
The third date class in base R is slightly better-named: it is the Date
class. This stores dates as the number of days since the start of 1970.[35] The Date
class is best used when you don’t care about the time of day. Fractional days are possible (and can be generated by calculating a mean Date
, for example), but the POSIX classes are better for those situations:
(
now_date<-
as.Date(
now_ct))
## [1] "2013-07-17"
class(
now_date)
## [1] "Date"
unclass(
now_date)
## [1] 15903
There are lots of other date and time classes scattered through other R classes. If you have a choice of which date-time class to use, you should usually stick to one of the three base classes (POSIXct
, POSIXlt
, and Date
), but you need to be aware of the other classes if you are using other people’s code that may depend upon them.
Other date and time classes from add-on packages include date
, dates
, chron
, yearmon
, yearqtr
, timeDate
, ti
, and jul
.
Many text file formats for data don’t explicitly support specific date types. For example, in a CSV file, each value is just a string. In order to access date functionality in R, you must convert your date strings into variables of one of the date classes. Likewise, to write back to CSV, you must convert the dates back into strings.
When we read in dates from a text or spreadsheet file, they will typically be stored as a character vector or factor. To convert them to dates, we need to parse these strings. This can be done with another appallingly named function, strptime
(short for “string parse time”), which returns POSIXlt
dates. (There are as.POSIXct
and as.POSIXlt
functions too. If you call them on character inputs, then they are just wrappers around strptime
.) To parse the dates, you must tell strptime
which bits of the string correspond to which bits of the date. The date format is specified using a string, with components specified with a percent symbol followed by a letter. For example, the day of the month as a number is specified as %d
. These components can be combined with other fixed characters—such as colons in times, or dashes and slashes in dates—to form a full specification. The time zone specification varies depending upon your operating system. It can get complicated, so the minutiae are discussed later, but you usually want "UTC"
for universal time or ""
to use the time zone in your current locale (as determined from your operating system’s locale settings).
In the following example, %H
is the hour (24-hour system), %M
is the minute, %S
is the second, %m
is the number of the month, %d
(as previously discussed) is the day of the month, and %Y
is the four-digit year. The complete list of component specifiers varies from system to system. See the ?strptime
help page for the details:
moon_landings_str<-
c(
"20:17:40 20/07/1969"
,
"06:54:35 19/11/1969"
,
"09:18:11 05/02/1971"
,
"22:16:29 30/07/1971"
,
"02:23:35 21/04/1972"
,
"19:54:57 11/12/1972"
)
(
moon_landings_lt<-
strptime(
moon_landings_str,
"%H:%M:%S %d/%m/%Y"
,
tz=
"UTC"
))
## [1] "1969-07-20 20:17:40 UTC" "1969-11-19 06:54:35 UTC" ## [3] "1971-02-05 09:18:11 UTC" "1971-07-30 22:16:29 UTC" ## [5] "1972-04-21 02:23:35 UTC" "1972-12-11 19:54:57 UTC"
If a string does not match the format in the format string, it takes the value NA
. For example, specifying dashes instead of slashes makes the parsing fail:
strptime(
moon_landings_str,
"%H:%M:%S %d-%m-%Y"
,
tz=
"UTC"
)
## [1] NA NA NA NA NA NA
The opposite problem of parsing is turning a date variable into a string—that is, formatting it. In this case, we use the same system for specifying a format string, but now we call strftime
(“string format time”) to reverse the parsing operation. In case you struggle to remember the name strftime
, these days the format
function will also happily format dates in a nearly identical manner to strftime
.
In the following example, %I
is the hour (12-hour system), %p
is the AM/PM indicator, %A
is the full name of the day of the week, and %B
is the full name of the month. strftime
works with both POSIXct
and POSIXlt
inputs:
strftime(
now_ct,
"It's %I:%M%p on %A %d %B, %Y."
)
## [1] "It's 10:47PM on Wednesday 17 July, 2013."
Time zones are horrible, complicated things from a programming perspective. Countries often have several, and change the boundaries when some (but not all) switch to daylight savings time. Many time zones have abbreviated names, but they often aren’t unique. For example, “EST” can refer to “Eastern Standard Time” in the United States, Canada, or Australia.
You can specify a time zone when parsing a date string (with strptime
) and change it again when you format it (with strftime
). During parsing, if you don’t specify a time zone (the default is ""
), R will give the dates a default time zone. This is the value returned by Sys.timezone
, which is in turn guessed from your operating system locale settings. You can see the OS date-time settings with Sys.getlocale("LC_TIME")
.
The easiest way to avoid the time zone mess is to always record and then analyze your times in the UTC zone. If you can achieve this, congratulations! You are very lucky. For everyone else—those who deal with other people’s data, for example—the easiest-to-read and most portable way of specifying time zones is to use the Olson form, which is “Continent/City” or similar:
strftime(
now_ct,
tz=
"America/Los_Angeles"
)
## [1] "2013-07-17 14:47:01"
strftime(
now_ct,
tz=
"Africa/Brazzaville"
)
## [1] "2013-07-17 22:47:01"
strftime(
now_ct,
tz=
"Asia/Kolkata"
)
## [1] "2013-07-18 03:17:01"
strftime(
now_ct,
tz=
"Australia/Adelaide"
)
## [1] "2013-07-18 07:17:01"
A list of possible Olson time zones is shipped with R in the file returned by file.path(R.home("share"), "zoneinfo", "zone.tab")
. (That’s a file called zone.tab in a folder called zoneinfo inside the share directory where you installed R.) The lubridate
package described later in this chapter provides convenient access to this file.
The next most reliable method is to give a manual offset from UTC, in the form "UTC"+
n
"
or "UTC"-
n
"
. Negative times are east of UTC, and positive times are west. The manual nature at least makes it clear how the times are altered, but you have to manually do the daylight savings corrections too, so this method should be used with care. Recent versions of R will warn that the time zone is unknown, but will perform the offset correctly:
strftime(
now_ct,
tz=
"UTC-5"
)
## Warning: unknown timezone 'UTC-5'
## [1] "2013-07-18 02:47:01"
strftime(
now_ct,
tz=
"GMT-5"
)
#same
## Warning: unknown timezone 'GMT-5'
## [1] "2013-07-18 02:47:01"
strftime(
now_ct,
tz=
"-5"
)
#same, if supported on your OS
## Warning: unknown timezone '-5'
## [1] "2013-07-18 02:47:01"
strftime(
now_ct,
tz=
"UTC+2:30"
)
## Warning: unknown timezone 'UTC+2:30'
## [1] "2013-07-17 19:17:01"
The third method of specifying time zones is to use an abbreviation—either three letters or three letters, a number, and three more letters. This method is the last resort, for three reasons. First, abbreviations are harder to read, and thus more prone to errors. Second, as previously mentioned, they aren’t unique, so you may not get the time zone that you think you have. Finally, different operating systems support different sets of abbreviations. In particular, the Windows OS’s knowledge of time zone abbreviations is patchy:
strftime(
now_ct,
tz=
"EST"
)
#Canadian Eastern Standard Time
## [1] "2013-07-17 16:47:01"
strftime(
now_ct,
tz=
"PST8PDT"
)
#Pacific Standard Time w/ daylight savings
## [1] "2013-07-17 14:47:01"
One last word of warning about time zones: strftime
ignores time zone changes for POSIXlt
dates. It is best to explicitly convert your dates to POSIXct
before printing:
strftime(
now_ct,
tz=
"Asia/Tokyo"
)
## [1] "2013-07-18 06:47:01"
strftime(
now_lt,
tz=
"Asia/Tokyo"
)
#no zone change!
## [1] "2013-07-17 22:47:01"
strftime(
as.POSIXct(
now_lt),
tz=
"Asia/Tokyo"
)
## [1] "2013-07-18 06:47:01"
Another last warning (really the last one!): if you call the concatenation function, c
, with a POSIXlt
argument, it will change the time zone to your local time zone. Calling c
on a POSIXct
argument, by contrast, will strip its time zone attribute completely. (Most other functions will assume that the date is now local, but be careful!)
R supports arithmetic with each of the three base classes. Adding a number to a POSIX date shifts it by that many seconds. Adding a number to a Date
shifts it by that many days:
now_ct+
86400
#Tomorrow. I wonder what the world will be like!
## [1] "2013-07-18 22:47:01 BST"
now_lt+
86400
#Same behavior for POSIXlt
## [1] "2013-07-18 22:47:01 BST"
now_date+
1
#Date arithmetic is in days
## [1] "2013-07-18"
Adding two dates together doesn’t make much sense, and throws an error. Subtraction is supported, and calculates the difference between the two dates. The behavior is the same for all three date types. In the following example, note that as.Date
will automatically parse dates of the form %Y-%m-%d
or %Y/%m/%d
, if you don’t specify a format:
the_start_of_time<-
#according to POSIX
as.Date(
"1970-01-01"
)
the_end_of_time<-
#according to Mayan conspiracy theorists
as.Date(
"2012-12-21"
)
(
all_time<-
the_end_of_time-
the_start_of_time)
## Time difference of 15695 days
We can use the now (hopefully) familiar combination of class
and unclass
to see how the difference in time is stored:
class(
all_time)
## [1] "difftime"
unclass(
all_time)
## [1] 15695 ## attr(,"units") ## [1] "days"
The difference has class difftime
, and the value is stored as a number with a unit attribute of days. Days were automatically chosen as the “most sensible” unit due to the difference between the times. Differences shorter than one day are given in hours, minutes, or seconds, as appropriate. For more control over the units, you can use the difftime
function:
difftime(
the_end_of_time,
the_start_of_time,
units=
"secs"
)
## Time difference of 1.356e+09 secs
difftime(
the_end_of_time,
the_start_of_time,
units=
"weeks"
)
## Time difference of 2242 weeks
The seq
function for generating sequences also works with dates. This can be particularly useful for creating test datasets of artificial dates. The choice of units in the by
argument differs between the POSIX and Date
types. See the ?seq.POSIXt
and ?seq.Date
help pages for the choices in each case:
seq(
the_start_of_time,
the_end_of_time,
by=
"1 year"
)
## [1] "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01" "1974-01-01" ## [6] "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01" "1979-01-01" ## [11] "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01" "1984-01-01" ## [16] "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01" "1989-01-01" ## [21] "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01" "1994-01-01" ## [26] "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01" "1999-01-01" ## [31] "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01" "2004-01-01" ## [36] "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01" "2009-01-01" ## [41] "2010-01-01" "2011-01-01" "2012-01-01"
seq(
the_start_of_time,
the_end_of_time,
by=
"500 days"
)
#of Summer
## [1] "1970-01-01" "1971-05-16" "1972-09-27" "1974-02-09" "1975-06-24" ## [6] "1976-11-05" "1978-03-20" "1979-08-02" "1980-12-14" "1982-04-28" ## [11] "1983-09-10" "1985-01-22" "1986-06-06" "1987-10-19" "1989-03-02" ## [16] "1990-07-15" "1991-11-27" "1993-04-10" "1994-08-23" "1996-01-05" ## [21] "1997-05-19" "1998-10-01" "2000-02-13" "2001-06-27" "2002-11-09" ## [26] "2004-03-23" "2005-08-05" "2006-12-18" "2008-05-01" "2009-09-13" ## [31] "2011-01-26" "2012-06-09"
Many other base functions allow manipulation of dates. You can repeat
them, round
them, and cut
them. You can also calculate summary statistics with mean
and summary
. Many of the possibilities can be seen with methods(class = "POSIXt")
and methods(class = "Date")
, although some other functions will handle dates without having specific date methods.
If you’ve become disheartened with dates and are considering skipping the rest of the chapter, do not fear! Help is at hand. lubridate
, as the name suggests, adds some much-needed lubrication to the process of date manipulation. It doesn’t add many new features over base R, but it makes your code more readable, and helps you avoid having to think too much.
To replace strptime
, lubridate
has a variety of parsing functions with predetermined formats. ymd
accepts dates in the form year, month, day. There is some flexibility in the specification: several common separators like hyphens, forward and backward slashes, colons, and spaces can be used;[36] months can be specified by number or by full or abbreviated name; and the day of the week can optionally be included. The real beauty is that different elements in the same vector can have different formats (as long as the year is followed by the month, which is followed by the day):
library(
lubridate)
## Attaching package: 'lubridate'
## The following object is masked from 'package:chron': ## ## days, hours, minutes, seconds, years
john_harrison_birth_date<-
c(
#He invented the marine chronometer
"1693-03 24"
,
"1693/03\24"
,
"Tuesday+1693.03*24"
)
ymd(
john_harrison_birth_date)
#All the same
## [1] "1693-03-24 UTC" "1693-03-24 UTC" "1693-03-24 UTC"
The important thing to remember with ymd
is to get the elements of the date in the right order. If your date data is in a different form, then lubridate
provides other functions (ydm
, mdy
, myd
, dmy
, and dym
) to use instead. Each of these functions has relatives that allow the specification of times as well, so you get ymd_h
, ymd_hm
, and ymd_hms
, as well as the equivalents for the other five date orderings. If your dates aren’t in any of these formats, then the lower-level parse_date_time
lets you give a more exact specification.
All the parsing functions in lubridate
return POSIXct
dates and have a default time zone of UTC. Be warned: these behaviors are different from base R’s strptime
! (Although usually more convenient.) In lubridate
terminology, these individual dates are “instants.”
For formatting dates, lubridate
provides stamp
, which lets you specify a format in a more human-readable manner. You specify an example date, and it returns a function that you can call to format your dates:
date_format_function<-
stamp(
"A moon landing occurred on Monday 01 January 1900 at 18:00:00."
)
## Multiple formats matched: "A moon landing occurred on %A %m January %d%y ## at %H:%M:%OS"(1), "A moon landing occurred on %A %m January %Y at ## %d:%H:%M."(1), "A moon landing occurred on %A %d %B %Y at %H:%M:%S."(1)
## Using: "A moon landing occurred on %A %d %B %Y at %H:%M:%S."
date_format_function(
moon_landings_lt)
## [1] "A moon landing occurred on Sunday 20 July 1969 at 20:17:40."
For dealing with ranges of times, lubridate
has three different variable types. “Durations” specify time spans as multiples of seconds, so a duration of a day is always 86,400 seconds (60 * 60 * 24), and a duration of a year is always 31,536,000 seconds (86,400 * 365). This makes it easy to specify ranges of dates that are exactly evenly spaced, but leap years and daylight savings time put them out of sync from clock time. In the following example, notice that the date slips back one day every time there is a leap year. today
gives today’s date:
(
duration_one_to_ten_years<-
dyears(
1
:10
))
## [1] "31536000s (~365 days)" "63072000s (~730 days)" ## [3] "94608000s (~1095 days)" "126144000s (~1460 days)" ## [5] "157680000s (~1825 days)" "189216000s (~2190 days)" ## [7] "220752000s (~2555 days)" "252288000s (~2920 days)" ## [9] "283824000s (~3285 days)" "315360000s (~3650 days)"
today()
+
duration_one_to_ten_years
## [1] "2014-07-17" "2015-07-17" "2016-07-16" "2017-07-16" "2018-07-16" ## [6] "2019-07-16" "2020-07-15" "2021-07-15" "2022-07-15" "2023-07-15"
Other functions for creating durations are dseconds
, dminutes
, and so forth, as well as new_duration
for mixed-component specification.
“Periods” specify time spans according to clock time. That means that their exact length isn’t apparent until you add them to an instant. For example, a period of one year can be 365 or 366 days, depending upon whether or not it is a leap year. In the following example, notice that the date stays the same across leap years:
(
period_one_to_ten_years<-
years(
1
:10
))
## [1] "1y 0m 0d 0H 0M 0S" "2y 0m 0d 0H 0M 0S" "3y 0m 0d 0H 0M 0S" ## [4] "4y 0m 0d 0H 0M 0S" "5y 0m 0d 0H 0M 0S" "6y 0m 0d 0H 0M 0S" ## [7] "7y 0m 0d 0H 0M 0S" "8y 0m 0d 0H 0M 0S" "9y 0m 0d 0H 0M 0S" ## [10] "10y 0m 0d 0H 0M 0S"
today()
+
period_one_to_ten_years
## [1] "2014-07-17" "2015-07-17" "2016-07-17" "2017-07-17" "2018-07-17" ## [6] "2019-07-17" "2020-07-17" "2021-07-17" "2022-07-17" "2023-07-17"
In addition to years
, you can create periods with seconds
, minutes
, etc., as well as new_period
for mixed-component specification.
“Intervals” are defined by the instants at their beginning and end. They aren’t much use on their own—they are most commonly used for specifying durations and periods when you known the start and end dates (rather than how long they should last). They can also be used for converting between durations and periods. For example, given a duration of one year, direct conversion to a period can only be estimated, since periods of a year can be 365 or 366 days (possibly plus a few leap seconds, and possibly plus or minus an hour or two if the rules for daylight savings change):
a_year<-
dyears(
1
)
#exactly 60*60*24*365 seconds
as.period(
a_year)
#only an estimate
## estimate only: convert durations to intervals for accuracy
## [1] "1y 0m 0d 0H 0M 0S"
If we know the start (or end) date of the duration, we can use an interval
and an intermediary to convert exactly from the duration to the period:
start_date<-
ymd(
"2016-02-28"
)
(
interval_over_leap_year<-
new_interval(
start_date,
start_date+
a_year))
## [1] 2016-02-28 UTC--2017-02-27 UTC
as.period(
interval_over_leap_year)
## [1] "11m 30d 0H 0M 0S"
Intervals also have some convenience operators, namely %--%
for defining intervals and %within%
for checking if a date is contained within an interval:
ymd(
"2016-02-28"
)
%--%
ymd(
"2016-03-01"
)
#another way to specify interval
## [1] 2016-02-28 UTC--2016-03-01 UTC
ymd(
"2016-02-29"
)
%
within%
interval_over_leap_year
## [1] TRUE
For dealing with time zones, with_tz
lets you change the time zone of a date without having to print it (unlike strftime
). It also correctly handles POSIXlt
dates (again, unlike strftime
):
with_tz(
now_lt,
tz=
"America/Los_Angeles"
)
## [1] "2013-07-17 14:47:01 PDT"
with_tz(
now_lt,
tz=
"Africa/Brazzaville"
)
## [1] "2013-07-17 22:47:01 WAT"
with_tz(
now_lt,
tz=
"Asia/Kolkata"
)
## [1] "2013-07-18 03:17:01 IST"
with_tz(
now_lt,
tz=
"Australia/Adelaide"
)
## [1] "2013-07-18 07:17:01 CST"
force_tz
is a variant of with_tz
used for updating incorrect time zones.
olson_time_zones
returns a list of all the Olson-style time zone names that R knows about, either alphabetically or by longitude:
head(
olson_time_zones())
## [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa" ## [4] "Africa/Algiers" "Africa/Asmara" "Africa/Bamako"
head(
olson_time_zones(
"longitude"
))
## [1] "Pacific/Midway" "America/Adak" "Pacific/Chatham" ## [4] "Pacific/Wallis" "Pacific/Tongatapu" "Pacific/Enderbury"
Some other utilities are available for arithmetic with dates, particularly floor_date
and ceiling_date
:
floor_date(
today(),
"year"
)
## [1] "2013-01-01"
ceiling_date(
today(),
"year"
)
## [1] "2014-01-01"
POSIXct
, POSIXlt
, and Date
.
strptime
.
strftime
.
lubridate
package makes working with dates a bit easier.
POSIXct
and Date
dates?
POSIXct
date one hour into the future?
lubridate
package, consider two intervals starting on January 1, 2016. Which will finish later, a duration of one year or a period of one year?
Parse the birth dates of the Beatles, and print them in the form “AbbreviatedWeekday DayOfMonth AbbreviatedMonthName TwoDigitYear” (for example, “Wed 09 Oct 40”). Their dates of birth are given in the following table.
Beatle | Birth date |
---|---|
Ringo Starr | 1940-07-07 |
John Lennon | 1940-10-09 |
Paul McCartney | 1942-06-18 |
George Harrison | 1943-02-25 |
[10]
?Sys.timezone
help page demonstrate how to do this. Find the name of the time zone for your location. [10]
Write a function that accepts a date as an input and returns the astrological sign of the zodiac corresponding to that date. The date ranges for each sign are given in the following table. [15]
Zodiac sign | Start date | End date |
---|---|---|
Aries | March 21 | April 19 |
Taurus | April 20 | May 20 |
Gemini | May 21 | June 20 |
Cancer | June 21 | July 22 |
Leo | July 23 | August 22 |
Virgo | August 23 | September 22 |
Libra | September 23 | October 22 |
Scorpio | October 23 | November 21 |
Sagittarius | November 22 | December 21 |
Capricorn | December 22 | January 19 |
Aquarius | January 20 | February 18 |
Pisces | February 19 | March 20 |
[33] The spin of the Earth is slowing down, so it takes slightly longer than 86,400 seconds for a day to happen. This is especially obvious when you are waiting for payday. Leap seconds have been added since 1972 to correct for this. Type .leap.seconds
to see when they have happened.
[34] UTC’s acronym is the wrong way around to make it match other universal time standards (UT0, UT1, etc.). It is essentially identical to (civil) Greenwich Mean Time (GMT), except that Greenwich Mean Time isn’t a scientific standard, and the British government can’t change UTC.
[35] Researchers of historical data might like to note that dates are always in the Gregorian calendar, so you need to double-check your code for anything before 1752.
[36] In fact, most punctuation is allowed.