Chapter 3. Inspecting Variables and Your Workspace

So far, we’ve run some calculations and assigned some variables. In this chapter, we’ll find out ways to examine the properties of those variables and to manipulate the user workspace that contains them.

Chapter Goals

After reading this chapter, you should:

  • Know what a class is, and the names of some common classes
  • Know how to convert a variable from one type to another
  • Be able to inspect variables to find useful information about them
  • Be able to manipulate the user workspace

Classes

All variables in R have a class, which tells you what kinds of variables they are. For example, most numbers have class numeric (see the next section for the other types), and logical values have class logical. Actually, being picky about it, vectors of numbers are numeric and vectors of logical values are logical, since R has no scalar types. The “smallest” data type in R is a vector.

You can find out what the class of a variable is using class(my_variable):

class(c(TRUE, FALSE))
## [1] "logical"

It’s worth being aware that as well as a class, all variables also have an internal storage type (accessed via typeof), a mode (see mode), and a storage mode (storage.mode). If this sounds complicated, don’t worry! Types, modes, and storage modes mostly exist for legacy purposes, so in practice you should only ever need to use an object’s class (at least until you join the R Core Team). Appendix A has a reference table showing the relationships between class, type, and (storage) mode for many sorts of variables. Don’t bother memorizing it, and don’t worry if you don’t recognize some of the classes. It is simply worth browsing the table to see which things are related to which other things.

From now on, to make things easier, I’m going to use “class” and “type” synonymously (except where noted).

Different Types of Numbers

All the variables that we created in the previous chapter were numbers, but R contains three different classes of numeric variable: numeric for floating point values; integer for, ahem, integers; and complex for complex numbers. We can tell which is which by examining the class of the variable:

class(sqrt(1:10))
## [1] "numeric"
class(3 + 1i)      #"i" creates imaginary components of complex numbers
## [1] "complex"
class(1)           #although this is a whole number, it has class numeric
## [1] "numeric"
class(1L)          #add a suffix of "L" to make the number an integer
## [1] "integer"
class(0.5:4.5)     #the colon operator returns a value that is numeric...
## [1] "numeric"
class(1:5)         #unless all its values are whole numbers
## [1] "integer"

Note that as of the time of writing, all floating point numbers are 32-bit numbers (“double precision”), even when installed on a 64-bit operating system, and 16-bit (“single precision”) numbers don’t exist.

Typing .Machine gives you some information about the properties of R’s numbers. Although the values, in theory, can change from machine to machine, for most builds, most of the values are the same. Many of the values returned by .Machine need never concern you. It’s worth knowing that the largest floating point number that R can represent at full precision is about 1.8e308. This is big enough for everyday purposes, but a lot smaller than infinity! The smallest positive number that can be represented is 2.2e-308. Integers can take values up to 2 ^ 31 - 1, which is a little over two billion, (or down to -2 ^ 31 + 1).[8]

The only other value of much interest is ε, the smallest positive floating point number such that |ε + 1| != 1. That’s a fancy way of saying how close two numbers can be so that R knows that they are different. It’s about 2.2e-16. This value is used by all.equal when you compare two numeric vectors.

In fact, all of this is even easier than you think, since it is perfectly possible to get away with not (knowingly) using integers. R is designed so that anywhere an integer is needed—indexing a vector, for example—a floating point “numeric” number can be used just as well.

Other Common Classes

In addition to the three numeric classes and the logical class that we’ve seen already, there are three more classes of vectors: character for storing text, factors for storing categorical data, and the rarer raw for storing binary data.

In this next example, we create a character vector using the c operator, just like we did for numeric vectors. The class of a character vector is character:

class(c("she", "sells", "seashells", "on", "the", "sea", "shore"))
## [1] "character"

Note that unlike some languages, R doesn’t distinguish between whole strings and individual characters—a string containing one character is treated the same as any other string. Unlike with some other lower-level languages, you don’t need to worry about terminating your strings with a null character (). In fact, it is an error to try to include such a character in your strings.

In many programming languages, categorical data would be represented by integers. For example, gender could be represented as 1 for females and 2 for males. A slightly better solution would be to treat gender as a character variable with the choices “female” and “male.” This is still semantically rather dubious, though, since categorical data is a different concept to plain old text. R has a more sophisticated solution that combines both these ideas in a semantically correct class—factors are integers with labels:

(gender <- factor(c("male", "female", "female", "male", "female")))
## [1] male   female female male   female
## Levels: female male

The contents of the factor look much like their character equivalent—you get readable labels for each value. Those labels are confined to specific values (in this case “female” and “male”) known as the levels of the factor:

levels(gender)
## [1] "female" "male"
nlevels(gender)
## [1] 2

Notice that even though “male” is the first value in gender, the first level is “female.” By default, factor levels are assigned alphabetically.

Underneath the bonnet,[9] the factor values are stored as integers rather than characters. You can see this more clearly by calling as.integer:

as.integer(gender)
## [1] 2 1 1 2 1

This use of integers for storage makes them very memory-efficient compared to character text, at least when there are lots of repeated strings, as there are here. If we exaggerate the situation by generating 10,000 random genders (using the sample function to sample the strings “female” and “male” 10,000 times with replacement), we can see that a factor containing the values takes up less memory than the character equivalent. In the following code, sample returns a character vector—which we convert into a factor using as.factor--and object.size returns the memory allocation for each object:

gender_char <- sample(c("female", "male"), 10000, replace = TRUE)
gender_fac <- as.factor(gender_char)
object.size(gender_char)
## 80136 bytes
object.size(gender_fac)
## 40512 bytes

Note

Variables take up different amounts of memory on 32-bit and 64-bit systems, so object.size will return different values in each case.

For manipulating the contents of factor levels (a common case would be cleaning up names, so all your men have the value “male” rather than “Male”) it is typically best to convert the factors to strings, in order to take advantage of string manipulation functions. You can do this in the obvious way, using as.character:

as.character(gender)
## [1] "male"   "female" "female" "male"   "female"

There is much more to learn about both character vectors and factors; they will be covered in depth in Chapter 7.

The raw class stores vectors of “raw” bytes.[10] Each byte is represented by a two-digit hexadecimal value. They are primarily used for storing the contents of imported binary files, and as such are reasonably rare. The integers 0 to 255 can be converted to raw using as.raw. Fractional and imaginary parts are discarded, and numbers outside this range are treated as 0. For strings, as.raw doesn’t work; you must use charToRaw instead:

as.raw(1:17)
##  [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11
as.raw(c(pi, 1 + 1i, -1, 256))
## Warning: imaginary parts discarded in coercion
## Warning: out-of-range values treated as 0 in coercion to raw
## [1] 03 01 00 00
(sushi <- charToRaw("Fish!"))
## [1] 46 69 73 68 21
class(sushi)
## [1] "raw"

As well as the vector classes that we’ve seen so far, there are many other types of variables; we’ll spend the next few chapters looking at them.

Arrays contain multidimensional data, and matrices (via the matrix class) are the special case of two-dimensional arrays. They will be discussed in Chapter 4.

So far, all these variable types need to contain the same type of thing. For example, a character vector or array must contain all strings, and a logical vector or array must contain only logical values. Lists are flexible in that each item in them can be a different type, including other lists. Data frames are what happens when a matrix and a list have a baby. Like matrices, they are rectangular, and as in lists, each column can have a different type. They are ideal for storing spreadsheet-like data. Lists and data frames are discussed in Chapter 5.

The preceding classes are all for storing data. Environments store the variables that store the data. As well as storing data, we clearly want to do things with it, and for that we need functions. We’ve already seen some functions, like sin and exp. In fact, operators like + are secretly functions too! Environments and functions will be discussed further in Chapter 6.

Chapter 7 discusses strings and factors in more detail, along with some options for storing dates and times.

There are some other types in R that are a little more complicated to understand, and we’ll leave these until later. Formulae will be discussed in Chapter 15, and calls and expressions will be discussed in the section Magic in Chapter 16. Classes will be discussed again in more depth in the section Object-Oriented Programming.

Checking and Changing Classes

Calling the class function is useful to interactively examine our variables at the command prompt, but if we want to test an object’s type in our scripts, it is better to use the is function, or one of its class-specific variants. In a typical situation, our test will look something like:

if(!is(x, "some_class"))
{
  #some corrective measure
}

Most of the common classes have their own is.* functions, and calling these is usually a little bit more efficient than using the general is function. For example:

is.character("red lorry, yellow lorry")
## [1] TRUE
is.logical(FALSE)
## [1] TRUE
is.list(list(a = 1, b = 2))
## [1] TRUE

We can see a complete list of all the is functions in the base package using:

ls(pattern = "^is", baseenv())
##  [1] "is.array"              "is.atomic"
##  [3] "is.call"               "is.character"
##  [5] "is.complex"            "is.data.frame"
##  [7] "is.double"             "is.element"
##  [9] "is.environment"        "is.expression"
## [11] "is.factor"             "is.finite"
## [13] "is.function"           "is.infinite"
## [15] "is.integer"            "is.language"
## [17] "is.list"               "is.loaded"
## [19] "is.logical"            "is.matrix"
## [21] "is.na"                 "is.na.data.frame"
## [23] "is.na.numeric_version" "is.na.POSIXlt"
## [25] "is.na<-"               "is.na<-.default"
## [27] "is.na<-.factor"        "is.name"
## [29] "is.nan"                "is.null"
## [31] "is.numeric"            "is.numeric.Date"
## [33] "is.numeric.difftime"   "is.numeric.POSIXt"
## [35] "is.numeric_version"    "is.object"
## [37] "is.ordered"            "is.package_version"
## [39] "is.pairlist"           "is.primitive"
## [41] "is.qr"                 "is.R"
## [43] "is.raw"                "is.recursive"
## [45] "is.single"             "is.symbol"
## [47] "is.table"              "is.unsorted"
## [49] "is.vector"             "isatty"
## [51] "isBaseNamespace"       "isdebugged"
## [53] "isIncomplete"          "isNamespace"
## [55] "isOpen"                "isRestart"
## [57] "isS4"                  "isSeekable"
## [59] "isSymmetric"           "isSymmetric.matrix"
## [61] "isTRUE"

In the preceding example, ls lists variable names, "^is" is a regular expression that means “match strings that begin with ‘is,”’ and baseenv is a function that simply returns the environment of the base package. Don’t worry what that means right now, since environments are quite an advanced topic; we’ll return to them in Chapter 6.

The assertive package[11] contains more is functions with a consistent naming scheme.

One small oddity is that is.numeric returns TRUE for integers as well as floating point values. If we want to test for only floating point numbers, then we must use is.double. However, this isn’t usually necessary, as R is designed so that floating point and integer values can be used more or less interchangeably. In the following examples, note that adding an L suffix makes the number into an integer:

is.numeric(1)
## [1] TRUE
is.numeric(1L)
## [1] TRUE
is.integer(1)
## [1] FALSE
is.integer(1L)
## [1] TRUE
is.double(1)
## [1] TRUE
is.double(1L)
## [1] FALSE

Sometimes we may wish to change the type of an object. This is called casting, and most is* functions have a corresponding as* function to achieve it. The specialized as* functions should be used over plain as when available, since they are usually more efficient, and often contain extra logic specific to that class. For example, when converting a string to a number, as.numeric is slightly more efficient than plain as, but either can be used:

x <- "123.456"
as(x, "numeric")
## [1] 123.5
as.numeric(x)
## [1] 123.5

Note

The number of decimal places that R prints for numbers depends upon your R setup. You can set a global default using options(digits =n), where n is between 1 and 22. Further control of printing numbers is discussed in Chapter 7.

In this next example, however, note that when converting a vector into a data frame (a variable for spreadsheet-like data), the general as function throws an error:

y <- c(2, 12, 343, 34997)       #See http://oeis.org/A192892
as(y, "data.frame")
as.data.frame(y)

Note

In general, the class-specific variants should always be used over standard as, if they are available.

It is also possible to change the type of an object by directly assigning it a new class, though this isn’t recommended (class assignment has a different use; see the section Object-Oriented Programming):

x <- "123.456"
class(x) <- "numeric"
x
## [1] 123.5
is.numeric(x)
## [1] TRUE

Examining Variables

Whenever we’ve typed a calculation or the name of a variable at the console, the result has been printed. This happens because R implicitly calls the print method of the object.

Note

As a side note on terminology: “method” and “function” are basically interchangeable. Functions in R are sometimes called methods in an object-oriented context. There are different versions of the print function for different types of object, making matrices print differently from vectors, which is why I said “print method” here.

So, typing 1 + 1 at the command prompt does the same thing as print(1 + 1).

Inside loops or functions,[12] the automatic printing doesn’t happen, so we have to explicitly call print:

ulams_spiral <- c(1, 8, 23, 46, 77)  #See http://oeis.org/A033951
for(i in ulams_spiral) i             #uh-oh, the values aren't printed
for(i in ulams_spiral) print(i)
## [1] 1
## [1] 8
## [1] 23
## [1] 46
## [1] 77

This is also true on some systems if you run R from a terminal rather than using a GUI or IDE. In this case you will always need to explicitly call the print function.

Most print functions are built upon calls to the lower-level cat function. You should almost never have to call cat directly (print and message are the user-level equivalents), but it is worth knowing in case you ever need to write your own print function.[13]

Note

Both the c and cat functions are short for concatenate, though they perform quite different tasks! cat is named after a Unix function.

As well as viewing the printout of a variable, it is often helpful to see some sort of summary of the object. The summary function does just that, giving appropriate information for different data types. Numeric variables are summarized as mean, median, and some quantiles. Here, the runif function generates 30 random numbers that are uniformly distributed between 0 and 1:

num <- runif(30)
summary(num)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0211  0.2960  0.5060  0.5290  0.7810  0.9920

Categorical and logical vectors are summarized by the counts of each value. In this next example, letters is a built-in constant that contains the lowercase values from “a” to “z” (LETTERS contains the uppercase equivalents, “A” to “Z”). Here, letters[1:5] uses indexing to restrict the letters to “a” to “e.” The sample function randomly samples these values, with replace, 30 times:

fac <- factor(sample(letters[1:5], 30, replace = TRUE))
summary(fac)
## a b c d e
## 6 7 5 9 3
bool <- sample(c(TRUE, FALSE, NA), 30, replace = TRUE)
summary(bool)
##    Mode   FALSE    TRUE    NA's
## logical      12      11       7

Multidimensional objects, like matrices and data frames, are summarized by column (we’ll look at these in more detail in the next two chapters). The data frame dfr that we create here is quite large to display, having 30 rows. For large objects like this,[14] the head function can be used to display only the first few rows (six by default):

dfr <- data.frame(num, fac, bool)
head(dfr)
##       num fac  bool
## 1 0.47316   b    NA
## 2 0.56782   d FALSE
## 3 0.46205   d FALSE
## 4 0.02114   b  TRUE
## 5 0.27963   a  TRUE
## 6 0.46690   a  TRUE

The summary function for data frames works like calling summary on each individual column:

summary(dfr)
##       num         fac      bool
##  Min.   :0.0211   a:6   Mode :logical
##  1st Qu.:0.2958   b:7   FALSE:12
##  Median :0.5061   c:5   TRUE :11
##  Mean   :0.5285   d:9   NA's :7
##  3rd Qu.:0.7808   e:3
##  Max.   :0.9916

Similarly, the str function shows the object’s structure. It isn’t that interesting for vectors (since they are so simple), but str is exceedingly useful for data frames and nested lists:

str(num)
##  num [1:30] 0.4732 0.5678 0.462 0.0211 0.2796 ...
str(dfr)
## 'data.frame':        30 obs. of  3 variables:
##  $ num : num  0.4732 0.5678 0.462 0.0211 0.2796 ...
##  $ fac : Factor w/ 5 levels "a","b","c","d",..: 2 4 4 2 1 1 4 2 1 4 ...
##  $ bool: logi  NA FALSE FALSE TRUE TRUE TRUE ...

As mentioned previously, each class typically has its own print method that controls how it is displayed to the console. Sometimes this printing obscures its internal structure, or omits useful information. The unclass function can be used to bypass this, letting you see how a variable is constructed. For example, calling unclass on a factor reveals that it is just an integer vector, with an attribute called levels:

unclass(fac)
##  [1] 2 4 4 2 1 1 4 2 1 4 3 3 1 5 4 5 1 5 1 2 2 3 4 2 4 3 4 2 3 4
## attr(,"levels")
## [1] "a" "b" "c" "d" "e"

We’ll look into attributes later on, but for now, it is useful to know that the attributes function gives you a list of all the attributes belonging to an object:

attributes(fac)
## $levels
## [1] "a" "b" "c" "d" "e"
##
## $class
## [1] "factor"

For visualizing two-dimensional variables such as matrices and data frames, the View function (notice the capital “V”) displays the variable as a (read-only) spreadsheet. The edit and fix functions work similarly to View, but let us manually change data values. While this may sound more useful, it is usually a supremely awful idea to edit data in this way, since we lose all traceability of where our data came from. It is almost always better to edit data programmatically:

View(dfr)               #no changes allowed
new_dfr <- edit(dfr)    #changes stored in new_dfr
fix(dfr)                #changes stored in dfr

A useful trick is to view the first few rows of a data frame by combining View with head:

View(head(dfr, 50))     #view first 50 rows

The Workspace

While we’re working, it’s often nice to know the names of the variables that we’ve created and what they contain. To list the names of existing variables, use the function ls. This is named after the equivalent Unix command, and follows the same convention: by default, variable names that begin with a . are hidden. To see them, pass the argument all.names = TRUE:

#Create some variables to find
peach <- 1
plum <- "fruity"
pear <- TRUE
ls()
##  [1] "a_vector"          "all_true"          "bool"
##  [4] "dfr"               "fac"               "fname"
##  [7] "gender"            "gender_char"       "gender_fac"
## [10] "i"                 "input"             "my_local_variable"
## [13] "none_true"         "num"               "output"
## [16] "peach"             "pear"              "plum"
## [19] "remove_package"    "some_true"         "sushi"
## [22] "ulams_spiral"      "x"                 "xy"
## [25] "y"                 "z"                 "zz"
ls(pattern = "ea")
## [1] "peach" "pear"

For more information about our workspace, we can see the structure of our variables using ls.str. This is, as you might expect, a combination of the ls and str functions, and is very useful during debugging sessions (see Debugging in Chapter 16). browseEnv provides a similar capability, but displays its output in an HTML page in our web browser:

browseEnv()

After working for a while, especially while exploring data, our workspace can become quite cluttered. We can clean it up by using the rm function to remove variables:

rm(peach, plum, pear)
rm(list = ls())       #Removes everything. Use with caution!

Summary

  • All variables have a class.
  • You test if an object has a particular class using the is function, or one of its class-specific variants.
  • You can change the class of an object using the as function, or one of its class-specific variants.
  • There are several functions that let you inspect variables, including summary, head, str, unclass, attributes, and View.
  • ls lists the names of your variables and ls.str lists them along with their structure.
  • rm removes your variables.

Test Your Knowledge: Quiz

Question 3-1
What are the names of the three built-in classes of numbers?
Question 3-2
What function would you call to find out the number of levels of a factor?
Question 3-3
How might you convert the string “6.283185” to a number?
Question 3-4
Name at least three functions for inspecting the contents of a variable.
Question 3-5
How would you remove all the variables in the user workspace?

Test Your Knowledge: Exercises

Exercise 3-1
Find the class, type, mode, and storage mode of the following values: Inf, NA, NaN, "". [5]
Exercise 3-2
Randomly generate 1,000 pets, from the choices “dog,” “cat,” “hamster,” and “goldfish,” with equal probability of each being chosen. Display the first few values of the resultant variable, and count the number of each type of pet. [5]
Exercise 3-3
Create some variables named after vegetables. List the names of all the variables in the user workspace that contain the letter “a.” [5]


[8] If these limits aren’t good enough for you, higher-precision values are available via the Rmpfr package, and very large numbers are available in the brobdingnab package. These are fairly niche requirements, though; the three built-in classes of R numbers should be fine for almost all purposes.

[9] Or hood, if you prefer.

[10] It is unclear what a cooked byte would entail.

[11] Disclosure: I wrote it.

[12] Except for the value being returned from the function.

[13] Like in Exercise 16-3, perhaps.

[14] These days, 30 rows isn’t usually considered to be “big data,” but it’s still a screenful when printed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset