Creating transactional data

In the world of the Internet of Things, you receive a ton of data. As you monitor devices for anomalies or failures, let's say you get some fault codes. How would you put the raw data into something meaningful for analysis in R? Well, here's a case study. We'll put together a random dataset and turn it into the proper form for use with R's arules package. Here's the dataframe:

> set.seed(270)

> faults <- data.frame(
serialNumber = sample(1:20, 100, replace = T),
faultCode = paste("fc", sample(1:12, 100, replace = T), sep = "")
)

This gives us 20 different serial numbers, which tells us which devices being monitored have had faults. Each device has a possibility of 12 different fault codes. The limitation of association analysis as we're doing it is the fact that the transaction order isn't included. Let's assume that isn't an issue in this example and proceed. First, given the random generation of this data, we will remove the duplicates:

> faults <- unique(faults)

The structure of the dataframe before turning it into transactions is critical. The identifier column needs to be as an integer. So, if you have a customer or equipment identifier such as 123abc, you must turn it into an integer. Then, the item of interest must be a factor. Here, we confirm that we have the proper dataframe structure:

> str(faults)
'data.frame': 80 obs. of 2 variables:
$ serialNumber: int 9 8 1 18 11 20 2 16 10 20 ...
$ faultCode : Factor w/ 12 levels "fc1","fc10","fc11",..: 2 5 1 12 1 3 6 10 11 1 ...

Notice that this data is in the long format, which is usually how it's produced. As such, create a column where all values are TRUE and use tidyverse to reshape the data into the wide format:

> faults$indicator <- TRUE

> faults_wide <- tidyr::spread(faults, key = faultCode, value = indicator)

We now have a dataframe with the associated faults labeled as TRUE for each item of interest. Next, turn the data into a matrix while dropping the ID:

> faults_matrix <- as.matrix(faults_wide[,-1])

You must turn the missing na into something understood, so let's make them FALSE:

> faults_matrix[is.na(faults_matrix)] <- FALSE

Finally, we can turn this data into the transactions class:

> faults_transactions <- as(faults_matrix, "transactions")

To confirm it all worked, create a plot of the top 10 item frequency:

> arules::itemFrequencyPlot(faults_transactions, topN = 10)

The output of the preceding code is as follows:

Success! Following the preceding process will get you from raw data to the appropriate structure. We'll transition to an example using data from the arules package itself, which you can apply to any analysis you want.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset