Manipulating data

Before you can extract some information from your data, you need to understand how the data is stored. First, you need to understand data structures in R.

Scalars and vectors are the most basic data structures. In R terminology, you analyze a dataset. A dataset consists of rows with cases or observations to analyze and columns representing the variables or attributes of the cases. This definition of a dataset looks like a SQL Server table. However, R does not work with tables in the relational sense. For example, in a relational, the order of rows and columns is not defined. In order to get a value, you need the column name and the key of the row. However, in R, you can use the position of a cell for most of the data structures. You have already seen this position reference for vectors.

In this section, you will learn about the data structures in R and the basic manipulation of datasets, including:

  • Arrays and matrices
  • Factors
  • Data frames
  • Lists
  • Creating new variables
  • Recoding variables
  • Dealing with missing values
  • Type conversions
  • Merging datasets

Introducing data structures in R

A matrix is a two-dimensional array of values of the same type, or mode. You generate matrices from vectors with the matrix() function. Columns and rows can have labels. You can generate a matrix from a vector by rows or by columns (default). The following code shows some matrices and the difference of generation by rows or by columns:

x = c(1,2,3,4,5,6); x;          
Y = array(x, dim=c(2,3)); Y;    
Z = matrix(x,2,3,byrow=F); Z 
U = matrix(x,2,3,byrow=T); U;  # A matrix - fill by rows 
rnames = c("Row1", "Row2"); 
cnames = c("Col1", "Col2", "Col3"); 
V = matrix(x,2,3,byrow=T, dimnames = list(rnames, cnames)); V;   

The first line generates and shows a one-dimensional vector. The second line creates a two-dimensional array, which is the same as a matrix, with two rows and three columns. The matrix is filled from a vector column by column. The third row actually uses the matrix() function to create a matrix, and fill it by columns. The matrix is equivalent to the previous one. The fourth row fills the matrix by rows. The fifth and sixth rows define row and column names. The last row again creates a matrix filled by rows; however, this time it adds row and column names in a list of two vectors. Here are the results:

[1] 1 2 3 4 5 6
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
     Col1 Col2 Col3
Row1    1    2    3
Row2    4    5    6

You can see the difference between filling by rows or by columns. The following code shows how you can refer to matrix elements by position, or even by name, if you've named columns and rows:

U[1,]; 
U[1,c(2,3)]; 
U[,c(2,3)]; 
V[,c("Col2", "Col3")]; 

The results are:

[1] 1 2 3
[1] 2 3
 [,1] [,2]
[1,]    2    3
[2,]    5    6
 Col2 Col3
Row1    2    3
Row2    5    6

As you can see from the matrix examples, a matrix is just a two-dimensional array. You generate arrays with the array() function. This function again accepts a vector of values as the first input parameter, then a vector specifying the number of elements on dimensions, and then a list of vectors for the names of the dimensions' elements. An array is filled by columns, then by rows, then by the third dimension (let's call it pages), and so on. Here is an example that generates a three-dimensional array:

rnames = c("Row1", "Row2"); 
cnames = c("Col1", "Col2", "Col3"); 
pnames = c("Page1", "Page2", "Page3"); 
Y = array(1:18, dim=c(2,3,3), dimnames = list(rnames, cnames,  pnames)); Y;

The result is as follows:

, , Page1
     Col1 Col2 Col3
Row1    1    3    5
Row2    2    4    6
, , Page2
     Col1 Col2 Col3
Row1    7    9   11
Row2    8   10   12
, , Page3
     Col1 Col2 Col3
Row1   13   15   17
Row2   14   16   18

Variables can store discrete or continuous values. Discrete values can be nominal, or categorical, where they represent labels only, or ordinal, where there is an intrinsic order in the values. In R, factors represent nominal and ordinal variables. Levels of a factor represent distinct values. You create factors from vectors with the factor() function. It is important to properly determine the factors because advanced data mining and machine learning algorithms treat discrete and continuous variables differently. Here are some examples of factors:

x = c("good", "moderate", "good", "bad", "bad", "good"); 
y = factor(x); y;   
z = factor(x, order=TRUE); z; 
w = factor(x, order=TRUE,  
           levels=c("bad", "moderate","good")); w; 

The first line defines a vector of six values denoting whether the observed person was in a good, moderate or bad mood. The second line generates a factor from the vector and shows it. The third line generates an ordinal variable. Note the results—the order is defined alphabetically. The last commands in the last two lines generate another ordinal variable from the same vector, just this time specifying the order explicitly. Here are the results:

[1] good     moderate good     bad      bad      good    
Levels: bad good moderate
[1] good     moderate good     bad      bad      good    
Levels: bad < good < moderate
[1] good     moderate good     bad      bad      good    
Levels: bad < moderate < good

Lists are the most complex data structures in R. Lists are ordered collections of different data structures. You typically do not work with them a lot. You need to know them because some functions return multiple results, or complex results, packed in a list, and you need to extract specific parts. You create lists with the list() function. You refer to objects of a list by position, using the index number enclosed in double parentheses. If an element is a vector or a matrix, you can additionally use the position of a value in a vector enclosed in single parentheses. Here is an example:

L = list(name1="ABC", name2="DEF", 
         no.children=2, children.ages=c(3,6)); 
L; 
L[[1]]; 
L[[4]]; 
L[[4]][2]; 

The example produces the following result:

$name1
[1] "ABC"
$name2
[1] "DEF"
$no.children
[1] 2
$children.ages
[1] 3 6
[1] "ABC"
[1] 3 6
[1] 6

Finally, the most important data structure is a data frame. Most of the time, you analyze data stored in a data frame. Data frames are matrices where each variable can be of a different type. Remember, a variable is stored in a column, and all values of a single variable must be of the same type. Data frames are very similar to SQL Server tables. However, they are still matrices, meaning that you can refer to the elements by position, and that they are ordered. You create a data frame with the data.frame() function from multiple vectors of the same length. Here is an example of generating a data frame:

CategoryId = c(1,2,3,4); 
CategoryName = c("Bikes", "Components", "Clothing", "Accessories"); 
ProductCategories = data.frame(CategoryId, CategoryName); 
ProductCategories; 

The result is:

 CategoryId CategoryName
1          1        Bikes
2          2   Components
3          3     Clothing
4          4  Accessories

Most of the time, you get a data frame from your data source, for example from a SQL Server database. The results of the earlier example that was reading from SQL Server can be actually stored in a data frame. You can also enter the data manually, or read it from many other sources, including text files, Excel, and many more. The following code retrieves the data from a comma-separated values (CSV) file in a data frame, and then displays the first four columns for the first five rows of the data frame. The CSV file is provided in the download for the accompanying code for this book, as described in the preface of the book:

TM = read.table("C:\SQL2016DevGuide\Chapter13_TM.csv", 
                sep=",", header=TRUE, row.names = "CustomerKey", 
                stringsAsFactors = TRUE); 
TM[1:5,1:4]; 

The code specifies that the first row of the file holds column names (header), and that the CustomerKey column represents the row names (or row identifications). If you are interested, the data comes from the dbo.vTargetMail view from the AdventureWorksDW2014 demo SQL Server database you can download from Microsoft CodePlex at https://msftdbprodsamples.codeplex.com/. The first five rows are presented here:

      MaritalStatus Gender TotalChildren NumberChildrenAtHome
11000             M      M             2                    0
11001             S      M             3                    3
11002             M      M             3                    3
11003             S      F             0                    0
11004             S      F             5                    5

Getting sorted with data management

After you've imported the data to analyze in a data frame, you need to prepare it for further analysis. There are quite a few possibilities of how to retrieve values from a data frame. You can refer to the data by position, or by using the subscript notation. However, if you use a single subscript only, then you retrieve columns defined by the subscript and all rows. The same is true if you use a vector of column names only, without specifying the rows. If you specify two indexes or index ranges, then the first one is used for rows and the second one for columns. The following code shows these options:

TM[1:2];                              # Two columns 
TM[c("MaritalStatus", "Gender")];     # Two columns 
TM[1:3,1:2];                          # Three rows, two columns 
TM[1:3,c("MaritalStatus", "Gender")]; 

The first command returns all rows and two columns only. The second command produces the same result. The third command returns three rows, again for the MaritalStatus and Gender columns only. The fourth row produces the same result as the third one.

The most common notation is using the data frame name and column name, separated by the dollar ($) sign, like TM$Gender. You can also avoid excessive writing of the data frame name by using the attach() or with() functions. With the attach() function, you add the data frame to the search path that R uses to find the objects. You can refer to a variable name directly without the data frame name if the variable name is unique in the search path. The detach() function removes the data frame from the search path, to avoid possible ambiguity with duplicate variable names later. The with() function allows you to name the data frame only once, and then use the variables in a set of statements enclosed in {} brackets inside the body of the function. The following code shows these approaches:

table(TM$MaritalStatus, TM$Gender); 
attach(TM); 
table(MaritalStatus, Gender); 
detach(TM); 
with(TM, 
     {table(MaritalStatus, Gender)}); 

The first line produces a cross-tabulation of the MaritalStatus and Gender. Please note the dataframe$variable notation. The second line adds the data frame to the search path. The third command produces the same cross-tabulation as the first one, however, this time referring to variable names only. The fourth command removes the data frame from the search path. The last command uses the with() function to allow you to define the data frame name only once and then use only variable names in the commands inside the function. Note that, because there is only one command in the function, the brackets {} can be omitted. All three cross-tabulations return the same result:

             Gender
MaritalStatus    F    M
            M 4745 5266
            S 4388 4085

Sometimes, you get numeric categorical variable values and you want to use character labels. The factor() function can help you here. For example, in the TM data frame, there is the BikeBuyer variable. For this variable, 0 means the person never purchased a bike and 1 means this person is a bike buyer. The following code shows you how to add labels to the numerical values:

table(TM$BikeBuyer, TM$Gender); 
TM$BikeBuyer <- factor(TM$BikeBuyer, 
                       levels = c(0,1), 
                       labels = c("No","Yes")); 
table(TM$BikeBuyer, TM$Gender); 

The results are shown here. Note that, the second time, the labels for the values of the BikeBuyer variable are used:

     F    M
0 4536 4816
1 4597 4535
       F    M
No  4536 4816
Yes 4597 4535

You can easily get the metadata about your objects. Some useful functions that give you information about your objects include the following:

  • class(): This function returns the type of object
  • names(): This function returns the names of the components, such as variable names in a data frame
  • length(): This function returns the number of elements, for example, the number of variables in a data frame
  • dim(): This function returns the dimensionality of an object, for example, the number of rows and columns in a data frame
  • str(): This function gives details about the structure of an object

Here are examples of using these metadata functions:

class(TM); 
names(TM); 
length(TM); 
dim(TM); 
str(TM); 

The results are as follows:

 [1] "data.frame"
 [1] "MaritalStatus"        "Gender"               "TotalChildren"       
 [4] "NumberChildrenAtHome" "Education"            "Occupation"          
 [7] "HouseOwnerFlag"       "NumberCarsOwned"      "CommuteDistance"     
[10] "Region"               "BikeBuyer"            "YearlyIncome"        
[13] "Age"                 
 [1] 13
[1] 18484    13
'data.frame':     18484 obs. of  13 variables:
$ MaritalStatus       : Factor w/ 2 levels "M","S": 1 2 1 2 2 2 2 1 2 2 ...
$ Gender              : Factor w/ 2 levels "F","M": 2 2 2 1 1 2 1 2 1 2 ...
$ TotalChildren       : int  2 3 3 0 5 0 0 3 4 0 ...

Note that the CustomerKey column is not listed among the thirteen columns of the data frame, because when the data was imported, this column was set to row names. In addition, only string variables were converted to factors. In the abbreviated result of the last command, you can see that the TotalChildren is an integer and not a factor, although it can occupy only values from zero to five.

Many times, calculated variables are much more meaningful for an analysis than just the base ones you read from your data source. For example, in medicine, the body mass index (BMI), defined as weight divided by the square of the height, is much more meaningful than the base variables of height and weight it is derived from. You can add new variables to a data frame, recode continuous values to a list of discrete values, change the data type of variable values, and more. The following example uses the within() function, which is similar to the with() function. It's just that it allows updates of a data frame, to add a new variable MartitalStatusInt, derived from the MaritalStatus as an integer. This variable tells us the number of additional people in the household of the case observed:

TM <- within(TM, { 
  MaritalStatusInt <- NA 
  MaritalStatusInt[MaritalStatus == "S"] <- 0 
  MaritalStatusInt[MaritalStatus == "M"] <- 1 
}); 
str(TM); 

In the body of the function, firstly the new variable is defined as missing. Then, the MaritalStatus values are used to define the number of additional persons in the household; if the person in the case observed is married, then the value is 1, if the person is single, then 0. The last line of the code shows the new structure of the data frame. The abbreviated structure is:

'data.frame':    18484 obs. of  14 variables:
$ MaritalStatus       : Factor w/ 2 levels "M","S": 1 2 1 2 2 2 2 1 2 2 ...
...
$ MaritalStatusInt    : num  1 0 1 0 0 0 0 1 0 0 ...

You can see that the new variable values are correct; however, the mode is defined as numeric. You can change the data type with one of the as.targettype() functions, where targettype() is a placeholder for the actual target type function, as the following example shows:

TM$MaritalStatusInt <- as.integer(TM$MaritalStatusInt); 
str(TM); 

Now, the abbreviated structure of the TM data frame shows that the mode of the new column is integer.

In the next example, a new variable is added to the data frame, just as a simple calculation. The new variable, HouseholdNumber, is used to define the total number of people in the household of the person in the case observed. The calculation summarizes the number of children at home plus one if the person is married plus one for the person herself/himself. Finally, the mode is changed to an integer:

TM$HouseholdNumber = as.integer( 
  1 + TM$MaritalStatusInt + TM$NumberChildrenAtHome); 
str(TM); 

The structure of the data frame shows that the calculation is correct:

'data.frame':    18484 obs. of  15 variables:
$ MaritalStatus       : Factor w/ 2 levels "M","S": 1 2 1 2 2 2 2 1 2 2 ...
...
$ MaritalStatusInt    : int  1 0 1 0 0 0 0 1 0 0 ...
$ HouseholdNumber     : int  2 4 5 1 6 1 1 5 5 1 ...

On many occasions, you have to deal with missing values, denoted with a literal NA. R treats missing values as completely unknown. This influences the results of the calculations. For example, adding a missing value to a known integer produces a missing value. You have to decide how to deal with missing data. You can exclude the rows with missing data completely, you can recode the missing values to a predefined value, or you can exclude the missing values from a single calculation. The following code defines a vector of six values; however, the last value is missing. You can use the is.na() function to check for each value, whether it is missing or not. Then, the code tries to calculate the mean value for all values of the vector. The last line of code tries to calculate the mean again, just this time by disregarding the missing value by using the na.rm = TRUE option. This option is available in most numeric functions and simply removes the missing values from the calculation:

x <- c(1,2,3,4,5,NA); 
is.na(x); 
mean(x); 
mean(x, na.rm = TRUE); 

The results of the previous code are here:

[1] FALSE FALSE FALSE FALSE FALSE  TRUE
[1] NA
[1] 3

You can see that the is.na() function evaluated each element separately, and returned a vector of the same dimensionality as the vector checked for the missing values. The mean() function returned a missing value when in the calculation a missing value was present, and the result you might have expected when the missing values were removed.

Frequently, you need to merge two datasets, or to define a new dataset as a projection of an existing one. Merging is similar to joining two tables in SQL Server, and a projection means selecting a subset of variables only. The merge() function joins two data frames based on a common identification of each case. Of course, the identification must be unique. The following code shows how to do the projection:

TM = read.table("C:\SQL2016DevGuide\Chapter13_TM.csv", 
                sep=",", header=TRUE, 
                stringsAsFactors = TRUE); 
TM[1:3,1:3]; 
cols1 <- c("CustomerKey", "MaritalStatus"); 
TM1 <- TM[cols1]; 
cols2 <- c("CustomerKey", "Gender"); 
TM2 <- TM[cols2]; 
TM1[1:3, 1:2]; 
TM2[1:3, 1:2]; 

The code first re-reads the TM data frame, just this time without using the CustomerKey column for the row names. This column must be available in the data frame, because this is the unique identification of each case. Then the code defines the columns for the two projection data frames and shows the first three rows of each new data frame, as you can see in the results of the last two commands:

  CustomerKey MaritalStatus
1       11000             M
2       11001             S
3       11002             M
  CustomerKey Gender
1       11000      M
2       11001      M
3       11002      M

Now, let's join the two new datasets.

TM3 <- merge(TM1, TM2, by = "CustomerKey"); 
TM3[1:3, 1:3]; 

The results show that the join was done correctly:

  CustomerKey MaritalStatus Gender
1       11000             M      M
2       11001             S      M
3       11002             M      M

A data frame is a matrix. Sort order is important. Instead of merging two data frames by columns, you can bind them by columns. However, you need to be sure that both data frames are sorted in the same way; otherwise you might bind variables from one case with variables from another case. The following code binds two data frames by columns:

TM4 <- cbind(TM1, TM2); 
TM4[1:3, 1:4]; 

The results show that, unlike the merge() function, the cbind() function did not use the CustomerKey for a common identification. It has blindly bound columns case by case, and preserved all variables from both source data frames. That's why the CustomerKey column appears twice in the result:

  CustomerKey MaritalStatus CustomerKey.1 Gender
1       11000             M         11000      M
2       11001             S         11001      M
3       11002             M         11002      M

You can also use the rbind() function to bind two data frames by rows. This is equal to the union of two rowsets in SQL Server. The following code shows how to filter a dataset by creating two new datasets, each one with two rows and two columns only. Then the code uses the rbind() function to union both data frames:

TM1 <- TM[TM$CustomerKey < 11002, cols1]; 
TM2 <- TM[TM$CustomerKey > 29481, cols1]; 
TM5 <- rbind(TM1, TM2); 
TM5; 

The results are here:

      CustomerKey MaritalStatus
1           11000             M
2           11001             S
18483       29482             M
18484       29483             M

Finally, what happens if you want to bind two data frames by columns but you are not sure about the ordering? Of course, you can sort the data. The following code shows how to create a new data frame from the TM data frame, this time sorted by the Age column descending. Note the usage of the minus (-) sign in the order() function to achieve the descending sort:

TMSortedByAge <- TM[order(-TM$Age),c("CustomerKey", "Age")]; 
TMSortedByAge[1:5,1:2]; 

The result is shown here:

     CustomerKey Age
1726       12725  99
5456       16455  98
3842       14841  97
3993       14992  97
7035       18034  97
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset