Chapter 3. An Introduction to Data Processing in R

In this chapter, basic techniques to fulfill the reorganization of data (cleaning, processing, and so on) will be covered. These will be the key factors while developing web applications with R and Shiny because, unlike traditional web application commercial software, they provide the possibility of performing any operation on your data and consequently, displaying it in the exact way that was imagined with no boundaries. By presenting several useful tools, this chapter will help the reader to gain skills in data manipulation.

The chapter is divided into the following seven sections:

  • Sorting elements
  • Basic summary functions
  • grep and regular expressions
  • The apply-like functions
  • The plyr package
  • The data.table package
  • The reshape2 package

Sorting elements

There are mainly two functions in the base R package (that is, the package that comes by default when installing R) to display ordered elements—sort() and order().

  • sort(): This is a function that returns the passed vector in decreasing or increasing order:
    > vector1 <- c(2,5,3,4,1)
    > sort(vector1)
    [1] 1 2 3 4 5
    

    If the vector passed is of the character type, the function returns it in alphabetical order and if it is logical, it will first return the FALSE elements and then the TRUE elements:

    > sort(c(T,T,F,F))
    [1] FALSE FALSE TRUE TRUE
    
  • order(): This returns the index number of the ordered elements according to their values:
    > vector1 <- c(2,5,3,4,1)
    > order(vector1)
    [1] 5 1 3 4 2
    

    In the preceding example, for the vector1 object, the function returns the fifth element first, then the first, then the third, and so on. For character or logical vectors, the criterion is the same as in sort():

    > sort(vector1,decreasing=T)
    [1] 5 4 3 2 1
    

    Tip

    To return elements in decreasing order in both the sort() and order() functions, include decreasing=T in the function call, as the default is decreasing=F, which means increasing order.

sort() versus order()

To obtain an identical result of sort(object) with order(), the object could be indexed by the output of its order() function:

> vector1[order(vector1)]
[1] 1 2 3 4 5

As explained in the previous chapter, the elements of a vector can be accessed by index numbers, and as order() returns the index numbers according to their value, indexing a vector by its order() output will result in an ordered vector.

Unlike sort(), order() can handle multiple input vectors where ordering criteria is applied in the order the vectors are passed. For example, if there was a tie in the ordering by the first criteria, the second vector will be used:

> vector1 <- c(2,2,3,3,1)
> vector2 <- c(2,5,4,3,1)
> order(vector1,vector2,decreasing = c(T,F))
[1] 3 4 2 1 5

Note that with multiple vectors, a logical vector has to be passed to the decreasing argument. As it happens with element selection, in the case the length of the logical vector is smaller than the number of vectors being ordered, the logical vector will be recycled, as explained in Chapter 2, First Steps towards Programming in R.

The order() function is particularly useful to order matrices or data frames by indexing per row by the output of the order based upon any of its columns:

> data(iris)
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> iris.ordered <- iris[order(iris$Sepal.Length, iris$Sepal.Width),]

Note

Iris is a dataset that comes from Sir Ronald Fisher's investigation in 1936 about different species of Iris. It is considered a standard dataset and is available in the R base package. It can be loaded just by typing data(iris).

After loading the iris data frame object, a new iris.ordered object is created. It is the same dataset as iris but ordered by Sepal.Length and Sepal.Width as the order function returns the corresponding indexes, which are then applied to the iris dataset. Note that, as it is a data frame, the indexing has a comma that separates rows from columns (in the case of data frames, observations from variables). As there is nothing after the comma, R returns all the variables:

sort() versus order()

Tip

Unlike other languages, dots in R are allowed in variable, column, or row names.

In conclusion, order() particularly is a very useful function, especially because it is the best way to order data frames and matrices.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset