In this section, we'll cover a concept closely related to vectors—time series. A time series is a sequence of values, each associated with a time index. For convenience, the values are usually ordered from the earliest to latest. The time difference between consecutive time indices can be fixed (in which case we have a regular time series) or variable (in which case we have an irregular time series), although an irregular time series can also be considered as a regular time series with missing data. For example, daily rainfall amounts in New York or Dollar to Euro currency exchange rates for the period of January 1, 2014 to January 15, 2014 would comprise two different time series.
Following its definition, the simplest way to represent a time series would be to have a separate vector of data values and a separate vector of time, with the same length, with each element of the data values vector corresponding to the respective element in the time vector. The only thing you need to learn in order to do this in R is to represent time, which is the topic of the present section.
Several special classes to represent time series exist in R. Basically, such classes encompass the time and data values parts of a time series within a single object. For example, ts
, zoo
, and xts
are different time series classes in R. The ts
class is defined in the base packages, whereas the zoo
and xts
classes are defined in the contributed packages of the same respective names. The concept of working with packages in R will be introduced in the next chapter.
Working with time series objects has certain advantages such as having the ability to use specialized functions (for example, linear or spline interpolation of missing values in a time series using a single function call) or making sure that every object satisfies the class rules (for example, the number of data values and time indices in a time series must be equal). For the purposes of this book, we will stick to the basic manual representation of a time series. This way, we will have a chance to gain a better understanding of R's general principles, while the next step towards specialized time series classes would be easily executed by interested readers. There are numerous resources devoted to the time series analysis with R; for example, Paul S.P. Cowpertwait and Andrew V. Metcalfe in their book Introductory Time Series with R, Springer, (2009), provide an excellent applied introduction on this subject.
You are now going to learn how to use dates in R using our very first real-world example. We are going to use the comma separated values (CSV) file named 338284.csv
, which was downloaded from the National Oceanic and Atmospheric Administration (NOAA) National Climatic Data Center. This file contains daily rainfall and temperature data from a meteorological station at the Albuquerque International Airport, New Mexico, from March 1, 1931 to May 15, 2014.
A CSV file is used to store plain tabular data with no additional features that are common in spreadsheet files such as XLS. This is how the file looks when opened in Excel:
The following three lines of code read the file into R and assign the values in the DATE
and TMAX
columns to two separate vectors named time
(since the data in the DATE
column represents time) and tmax
(which stands for maximum temperature). This involves operations on tables, which will be explained in the next chapter. They are provided here only for completeness:
> dat = read.csv("C:\Data\338284.csv", stringsAsFactors = FALSE) > time = dat$DATE > tmax = dat$TMAX
The important point is that we now have two vectors to work with, time
and tmax
, as an exercise summarizing most of the topics we dealt with in this chapter.
Dates can be represented in R (as in many other types of software) using a special format. This allows certain special operations (such as finding the time difference between two dates) to be performed, which is not possible when dates are represented by simply using characters. There are several classes for date and time data in R. The simplest class (and the only one we will use in this book) is called Date
, and it is used to represent calendar dates. Other classes exist to represent longer intervals of time (for example, monthly) or shorter (for example, date plus the time of day) intervals.
Note that the Date
and factor
objects are not vectors in R terminology since they have additional attributes not present in the vector class. However, from the user's perspective, working with them often follows the same principles as seen in vectors. For example, creating subsets of Date
objects works the same way as creating subsets of vectors.
For example, the Sys.Date
and Sys.time
functions return the current date or date plus the time of day, respectively. The object returned by Sys.Date
belongs to class Date
, while the object returned by Sys.time
is an object of a different class (POSIXct
). Let's take a look at the following examples:
> x = Sys.Date() > x [1] "2014-05-22" > class(x) [1] "Date" > y = Sys.time() > y [1] "2014-05-22 10:04:56 IDT" > class(y) [1] "POSIXct" "POSIXt"
As we can see in the first half of the previous example, a Date
object is printed the same way as a character vector holding the value "2014-05-22"
would. However, as already mentioned, we can conduct calculations involving time intervals with the Date
class, which make it worthwhile to represent dates in such a specialized format. For example, we can tell what date it will be seven days from today or what the date was 1,000 days ago:
> x + 7 [1] "2014-05-29" > x - 1000 [1] "2011-08-26"
We can switch between the character vector and Date
classes, using the as.character
and as.Date
functions. For example, we can convert our Date
object x
to a character vector using as.character
:
> x = as.character(x) > x [1] "2014-05-22" > class(x) [1] "character"
We can convert the character vector back to Date
using as.Date
:
> x = as.Date(x) > x [1] "2014-05-22" > class(x) [1] "Date"
We can create a sequence of consecutive dates using seq
, since this function accepts Date
objects as well:
> seq(from = as.Date("2013-01-01"), + to = as.Date("2013-02-01"), + by = 3) [1] "2013-01-01" "2013-01-04" "2013-01-07" "2013-01-10" [5] "2013-01-13" "2013-01-16" "2013-01-19" "2013-01-22" [9] "2013-01-25" "2013-01-28" "2013-01-31"
This gives us consecutive dates separated by three days from each other, from January 1, 2013 to February 1, 2013.
The latter conversions, from character to date, were made possible so easily since the "2014-05-22"
configuration is a default one. This way, the as.Date
function knew that the first four characters in "2014-05-22"
represent the year, the next two characters (following a hyphen) represent the month, and the last two characters represent the day. When we have characters representing a date in a different configuration, we need to use the format parameter of as.Date
, where we specify the encoding types of the elements, their order, and the characters separating them (if any).
The common encoding types of the year, month, and day elements, and their respective symbols in R, are summarized in the following table:
Symbol |
Meaning |
---|---|
| |
| |
| |
| |
| |
|
Using this symbology, along with the format parameter of the as.Date
function, we can convert character values of other formats to dates. Let's take a look at the following examples:
> as.Date("07/Aug/12") Error in charToDate(x) : character string is not in a standard unambiguous format > as.Date("07/Aug/12", format = "%d/%b/%y") [1] "2012-08-07" > as.Date("2012-August-07") Error in charToDate(x) : character string is not in a standard unambiguous format > as.Date("2012-August-07", format = "%Y-%B-%d") [1] "2012-08-07"
In each of these two example pairs, the first expression resulted in an error since we were trying to convert a character value of a non-standard date format to a Date
without specifying the format, while the second expression worked since we did specify the format.
Once we have a Date
object, we can extract one or two (or all) of its three elements (year, month, and day), and encode them as we wish using the format
function, specifying the required format the same way as shown earlier. Note that the results are no longer Date
objects, but character vectors:
> d = as.Date("1955-11-30") > d [1] "1955-11-30" > format(d, "%d") [1] "30" > format(d, "%B") [1] "November" > format(d, "%Y") [1] "1955" > format(d, "%m/%Y") [1] "11/1955"
We are now ready to proceed with our example involving the time
and tmax
vectors. First, we can find out that both vectors are numeric (integers, numbers without a fractional component, to be precise) as follows:
> class(time) [1] "integer" > class(tmax) [1] "integer"
Then, let's see what the values of these vectors look like by printing the first 10 values from each one of them:
> time[1:10] [1] 19310301 19310302 19310303 19310304 19310305 19310306 [7] 19310307 19310308 19310309 19310310 > tmax[1:10] [1] 72 133 178 183 111 67 78 83 139 156
The time
vector contains dates in the %Y%m%d
configuration (year, month, and day indicated by full numeric values, without separating characters). Therefore, we can convert it to a Date
object, as follows:
> time = as.Date(as.character(time), format = "%Y%m%d") > time[1:10] [1] "1931-03-01" "1931-03-02" "1931-03-03" "1931-03-04" [5] "1931-03-05" "1931-03-06" "1931-03-07" "1931-03-08" [9] "1931-03-09" "1931-03-10" > class(time) [1] "Date"
Note that we first needed to convert the time
vector from numeric to character since the as.Date
function works on character vectors. Now that time
is a vector of dates, we have more freedom to treat the data as a time series.
Looking into the documentation on climatic data from NOAA (which is also provided on the book's website), we can see that the temperature is provided in tenths of Celsius degree, with missing values marked as -9999
. First, we will convert the -9999
values to NA
by selecting the respective subset and making an assignment:
> tmax[tmax == -9999] = NA
Then, to convert the data into degrees Celsius units, we will divide each of the values by 10
:
> tmax = tmax / 10 > tmax[1:10] [1] 7.2 13.3 17.8 18.3 11.1 6.7 7.8 8.3 13.9 15.6
Now, let's check the range of values each vector contains:
> range(time) [1] "1931-03-01" "2014-05-15" > range(tmax, na.rm = TRUE) [1] -14.4 41.7
This means that the range of the measured maximum daily temperatures from March 1, 1931 to May 15, 2014 was -14.4 to 41.7 degrees Celsius.
Regarding the dates of measurement, looking at the first few values of the time
vector (or at the original CSV file in a spreadsheet, for that matter), it seems that the days are consecutive. However, we may want to make sure that all days of the respective period are indeed present in the file. We can do this by comparing a consecutive sequence all_dates
covering the time period from March 1, 1931 to May 15, 2014 with our time
vector:
> range_t = range(time) > all_dates = seq(range_t[1], range_t[length(range_t)], 1) > length(all_dates) [1] 30392 > length(time) [1] 30391
This already indicates that we have an incomplete agreement. Our time
vector contains the 30391
values, while there are 30392
dates during the time period from March 1, 1931 to May 15, 2014. Therefore, the CSV file is missing at least one date.
We will next check how many dates (and which ones) are missing. First, we will verify that, indeed, not all dates appear in the time
vector using the %in%
operator (asking for each element in all_dates
whether it appears in the time
vector) and the all
function (asking whether all of the values in the resulting logical vector are TRUE
).
> all(all_dates %in% time) [1] FALSE
The answer is no; at least one of the dates in the range of March 1, 1931 to May 15, 2014 is indeed missing from the time
vector. The next question would be which one is missing, or which ones are missing? We can get the indices of the dates that appear in all_dates
but not in time
with the which
function:
> which(!(all_dates %in% time)) [1] 5499
The missing date is the 5499th element of the all_dates
vector. Its value is as follows:
> all_dates[which(!(all_dates %in% time))] [1] "1946-03-20"
Manually examining the CSV file in a spreadsheet software will confirm that indeed the date March 20, 1946 was skipped for some reason.
Another interesting question we can ask is on what day the highest temperature (which was 41.7 degree Celsius, as we saw earlier) has been observed:
> max(tmax, na.rm = TRUE) [1] 41.7 time[which.max(tmax)] [1] "1994-06-26"
The highest maximum daily temperature was observed on June 6, 1994.
If we are interested in a particular subset of the time series, say the period from December 31, 2005 to January 1, 2014, we could create a subset of the dates in that period based on the time
vector and a respective subset of data values based on the tmax
vector. We can do this in two steps. First, we will create a logical vector, w
, pointing at those dates we would like to keep:
> w = time > as.Date("2005-12-31") & time < as.Date("2014-1-1")
To find out the ratio between the number of days we would like to keep in the subset and the number of days in the complete series, we can type the following expression:
> sum(w) / length(w) [1] 0.09614689
The amount of data within the subset we are interested in (December 31, 2005 to January 1, 2014) is about 9.6 percent of the total amount of data since the proportion of the TRUE
values count in the logical vector, w
, from the total number of values is 0.096
(remember that before summing a logical vector, it is converted to a numeric one with ones instead of TRUE
and zeroes instead of FALSE
).
Secondly, we will use the w
vector to create subsets of both the time
and tmax
vectors:
> time = time[w] > tmax = tmax[w]
Note that the selection was non-inclusive of the end dates since we used the >
and <
operators:
> range(time) [1] "2006-01-01" "2013-12-31"
If we wanted to include the first and last dates (December 31, 2005 and January 1, 2014), we would rather use the >=
and <=
operators.