Real-world data is frequently dirty and unstructured, and must be reworked before it is usable. Data may contain errors, have duplicate entries, exist in the wrong format, or be inconsistent. The process of addressing these types of issues is called data cleaning. Data cleaning is also referred to as data wrangling, massaging, reshaping , or munging. Data merging, where data from multiple sources is combined, is often considered to be a data cleaning activity.
We need to clean data because any analysis based on inaccurate data can produce misleading results. We want to ensure that the data we work with is quality data. Data quality involves:
There are several techniques and tools used to clean data. We will examine the following approaches:
In addition, we will briefly examine several image enhancement techniques.
There are often many ways to accomplish the same cleaning task. For example, there are a number of GUI tools that support data cleaning, such as OpenRefine (http://openrefine.org/). This tool allows a user to read in a dataset and clean it using a variety of techniques. However, it requires a user to interact with the application for each dataset that needs to be cleaned. It is not conducive to automation.
We will focus on how to clean data using Java code. Even then, there may be different techniques to clean the data. We will show multiple approaches to provide the reader with insights on how it can be done. Sometimes, this will use core Java string classes, and at other time, it may use specialized libraries.
These libraries often are more expressive and efficient. However, there are times when using a simple string function is more than adequate to address the problem. Showing complimentary techniques will improve the reader's skill set.
The basic text based tasks include:
In this chapter, we are interested in cleaning data. However, part of this process is extracting information from various data sources. The data may be stored in plaintext or in binary form. We need to understand the various formats used to store data before we can begin the cleaning process. Many of these formats were introduced in Chapter 2, Data Acquisition, but we will go into greater detail in the following sections.
Data comes in all types of forms. We will examine the more commonly used formats and show how they can be extracted from various data sources. Before we can clean data it needs to be extracted from a data source such as a file. In this section, we will build upon the introduction to data formats found in Chapter 2, Data Acquisition, and show how to extract all or part of a dataset. For example, from an HTML page we may want to extract only the text without markup. Or perhaps we are only interested in its figures.
These data formats can be quite complex. The intent of this section is to illustrate the basic techniques commonly used with that data format. Full treatment of a specific data format is beyond the scope of this book. Specifically, we will introduce how the following data formats can be processed from Java:
There are many other file types not addressed here. For example, jsoup is useful for parsing HTML documents. Since we introduced how this is done in the Web scraping in Java section of Chapter 2, Data Acquisition, we will not duplicate the effort here.
A common technique for separating information is to use commas or similar separators. Knowing how to work with CSV data allows us to utilize this type of data in our analysis efforts. When we deal with CSV data there are several issues including escaped data and embedded commas.
We will examine a few basic techniques for processing comma-separated data. Due to the row-column structure of CSV data, these techniques will read data from a file and place the data in a two-dimensional array. First, we will use a combination of the Scanner
class to read in tokens and the String
class split
method to separate the data and store it in the array. Next, we will explore using the third-party library, OpenCSV, which offers a more efficient technique.
However, the first approach may only be appropriate for quick and dirty processing of data. We will discuss each of these techniques since they are useful in different situations.
We will use a dataset downloaded from https://www.data.gov/ containing U.S. demographic statistics sorted by ZIP code. This dataset can be downloaded at https://catalog.data.gov/dataset/demographic-statistics-by-zip-code-acfc9. For our purposes, this dataset has been stored in the file Demographics.csv
. In this particular file, every row contains the same number of columns. However, not all data will be this clean and the solutions shown next take into account the possibility for jagged arrays.
First, we use the Scanner
class to read in data from our data file. We will temporarily store the data in an ArrayList
since we will not always know how many rows our data contains.
try (Scanner csvData = new Scanner(new File("Demographics.csv"))) { ArrayList<String> list = new ArrayList<String>(); while (csvData.hasNext()) { list.add(csvData.nextLine()); } catch (FileNotFoundException ex) { // Handle exceptions }
The list is converted to an array using the toArray
method. This version of the method uses a String
array as an argument so that the method will know what type of array to create. A two-dimension array is then created to hold the CSV data.
String[] tempArray = list.toArray(new String[1]); String[][] csvArray = new String[tempArray.length][];
The split
method is used to create an array of String
s for each row. This array is assigned to a row of the csvArray
.
for(int i=0; i<tempArray.length; i++) { csvArray[i] = tempArray[i].split(","); }
Our next technique will use a third-party library to read in and process CSV data. There are multiple options available, but we will focus on the popular OpenCSV (http://opencsv.sourceforge.net). This library offers several advantages over our previous technique. We can have an arbitrary number of items on each row without worrying about handling exceptions. We also do not need to worry about embedded commas or embedded carriage returns within the data tokens. The library also allows us to choose between reading the entire file at once or using an iterator to process data line-by-line.
First, we need to create an instance of the CSVReader
class. Notice the second parameter allows us to specify the delimiter, a useful feature if we have similar file format delimited by tabs or dashes, for example. If we want to read the entire file at one time, we use the readAll
method.
CSVReader dataReader = new CSVReader(new FileReader("Demographics.csv"),','); ArrayList<String> holdData = (ArrayList)dataReader.readAll();
We can then process the data as we did above, by splitting the data into a two-dimension array using String
class methods. Alternatively, we can process the data one line at a time. In the example that follows, each token is printed out individually but the tokens can also be stored in a two-dimension array or other data structure as appropriate.
CSVReader dataReader = new CSVReader(new FileReader("Demographics.csv"),','); String[] nextLine; while ((nextLine = dataReader.readNext()) != null){ for(String token : nextLine){ out.println(token); } } dataReader.close();
We can now clean or otherwise process the array.
Spreadsheets have proven to be a very popular tool for processing numeric and textual data. Due to the wealth of information that has been stored in spreadsheets over the past decades, knowing how to extract information from spreadsheets enables us to take advantage of this widely available data source. In this section, we will demonstrate how this is accomplished using the Apache POI API.
Open Office also supports a spreadsheet application. Open Office documents are stored in XML format which makes it readily accessible using XML parsing technologies. However, the Apache ODF Toolkit (http://incubator.apache.org/odftoolkit/) provides a means of accessing data within a document without knowing the format of the OpenOffice document. This is currently an incubator project and is not fully mature. There are a number of other APIs that can assist in processing OpenOffice documents as detailed on the Open Document Format (ODF) for developers (http://www.opendocumentformat.org/developers/) page.
Apache POI (http://poi.apache.org/index.html) is a set of APIs providing access to many Microsoft products including Excel and Word. It consists of a series of components designed to access a specific Microsoft product. An overview of these components is found at http://poi.apache.org/overview.html.
In this section we will demonstrate how to read a simple Excel spreadsheet using the XSSF component to access Excel 2007+ spreadsheets. The Javadocs for the Apache POI API is found at https://poi.apache.org/apidocs/index.html.
We will use a simple Excel spreadsheet consisting of a series of rows containing an ID along with minimum, maximum, and average values. These numbers are not intended to represent any specific type of data. The spreadsheet follows:
ID |
Minimum |
Maximum |
Average |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We start with a try-with-resources block to handle any IOExceptions
that may occur:
try (FileInputStream file = new FileInputStream( new File("Sample.xlsx"))) { ... } } catch (IOException e) { // Handle exceptions }
An instance of a XSSFWorkbook
class is created using the spreadsheet. Since a workbook may consists of multiple spreadsheets, we select the first one using the getSheetAt
method.
XSSFWorkbook workbook = new XSSFWorkbook(file); XSSFSheet sheet = workbook.getSheetAt(0);
The next step is to iterate through the rows, and then each column, of the spreadsheet:
for(Row row : sheet) { for (Cell cell : row) { ... } out.println();
Each cell of the spreadsheet may use a different format. We use the getCellType
method to determine its type and then use the appropriate method to extract the data in the cell. In this example we are only working with numeric and text data.
switch (cell.getCellType()) { case Cell.CELL_TYPE_NUMERIC: out.print(cell.getNumericCellValue() + " "); break; case Cell.CELL_TYPE_STRING: out.print(cell.getStringCellValue() + " "); break; }
When executed we get the following output:
ID Minimum Maximum Average 12345.0 45.0 89.0 65.55 23456.0 78.0 96.0 86.75 34567.0 56.0 89.0 67.44 45678.0 86.0 99.0 95.67
POI supports other more sophisticated classes and methods to extract data.
There are several APIs supporting the extraction of text from a PDF file. Here we will use PDFBox. The Apache PDFBox (https://pdfbox.apache.org/) is an open source API that allows Java programmers to work with PDF documents. In this section we will illustrate how to extract simple text from a PDF document. Javadocs for the PDFBox API is found at https://pdfbox.apache.org/docs/2.0.1/javadocs/.
This is a simple PDF file. It consists of several bullets:
This is the end of the document.
A try
block is used to catch IOExceptions
. The PDDocument
class will represent the PDF document being processed. Its load
method will load in the PDF file specified by the File
object:
try { PDDocument document = PDDocument.load(new File("PDF File.pdf")); ... } catch (Exception e) { // Handle exceptions }
Once loaded, the PDFTextStripper
class getText
method will extract the text from the file. The text is then displayed as shown here:
PDFTextStripper Tstripper = new PDFTextStripper(); String documentText = Tstripper.getText(document); System.out.println(documentText);
The output of this example follows. Notice that the bullets are returned as question marks.
This is a simple PDF file. It consists of several bullets: ? Line 1 ? Line 2 ? Line 3 This is the end of the document.
This is a brief introduction to the use of PDFBox. It is a very powerful tool when we need to extract and otherwise manipulate PDF documents.
In Chapter 2, Data Acquisition we learned that certain YouTube searches return JSON formatted results. Specifically, the SearchResult
class holds information relating to a specific search. In that section we illustrate how to use YouTube specific techniques to extract information. In this section we will illustrate how to extract JSON information using the Jackson JSON implementation.
JSON supports three models for processing data:
We will illustrate the first two approaches. The first approach is more efficient and is used when a large amount of data is processed. The second technique is convenient but the data must not be too large. The third technique is useful when it is more convenient to use specific Java classes to process data. For example, if the JSON data represent an address then a specific Java address class cane be defined to hold and process the data.
There are several Java libraries that support JSON processing including:
We will use the Jackson Project (https://github.com/FasterXML/jackson). Documentation is found at https://github.com/FasterXML/jackson-docs. We will use two JSON files to demonstrate how it can be used. The first file, Person.json
, is shown next where a single person data is stored. It consists of four fields where the last field is an array of location information.
{ "firstname":"Smith", "lastname":"Peter", "phone":8475552222, "address":["100 Main Street","Corpus","Oklahoma"] }
The code sequence that follows shows how to extract the values for each of the fields. Within the try-catch block a JsonFactory
instance is created which then creates a JsonParser
instance based on the Person.json
file.
try { JsonFactory jsonfactory = new JsonFactory(); JsonParser parser = jsonfactory.createParser(new File("Person.json")); ... parser.close(); } catch (IOException ex) { // Handle exceptions }
The nextToken
method returns a token
. However, the JsonParser
object keeps track of the current token. In the while
loop the nextToken
method returns and advances the parser to the next token. The getCurrentName
method returns the field name for the token
. The while
loop terminates when the last token is reached.
while (parser.nextToken() != JsonToken.END_OBJECT) { String token = parser.getCurrentName(); ... }
The body of the loop consists of a series of if
statements that processes the field by its name. Since the address
field is an array, another loop will extract each of its elements until the ending array token
is reached.
if ("firstname".equals(token)) { parser.nextToken(); String fname = parser.getText(); out.println("firstname : " + fname); } if ("lastname".equals(token)) { parser.nextToken(); String lname = parser.getText(); out.println("lastname : " + lname); } if ("phone".equals(token)) { parser.nextToken(); long phone = parser.getLongValue(); out.println("phone : " + phone); } if ("address".equals(token)) { out.println("address :"); parser.nextToken(); while (parser.nextToken() != JsonToken.END_ARRAY) { out.println(parser.getText()); } }
The output of this example follows:
firstname : Smith lastname : Peter phone : 8475552222 address : 100 Main Street Corpus Oklahoma
However, JSON objects are frequently more complex than the previous example. Here a Persons.json
file consists of an array of three persons
:
{ "persons": { "groupname": "school", "person": [ {"firstname":"Smith", "lastname":"Peter", "phone":8475552222, "address":["100 Main Street","Corpus","Oklahoma"] }, {"firstname":"King", "lastname":"Sarah", "phone":8475551111, "address":["200 Main Street","Corpus","Oklahoma"] }, {"firstname":"Frost", "lastname":"Nathan", "phone":8475553333, "address":["300 Main Street","Corpus","Oklahoma"] } ] } }
To process this file, we use a similar set of code as shown previously. We create the parser and then enter a loop as before:
try { JsonFactory jsonfactory = new JsonFactory(); JsonParser parser = jsonfactory.createParser(new File("Person.json")); while (parser.nextToken() != JsonToken.END_OBJECT) { String token = parser.getCurrentName(); ... } parser.close(); } catch (IOException ex) { // Handle exceptions }
However, we need to find the persons
field and then extract each of its elements. The groupname
field is extracted and displayed as shown here:
if ("persons".equals(token)) { JsonToken jsonToken = parser.nextToken(); jsonToken = parser.nextToken(); token = parser.getCurrentName(); if ("groupname".equals(token)) { parser.nextToken(); String groupname = parser.getText(); out.println("Group : " + groupname); ... } }
Next, we find the person
field and call a parsePerson
method to better organize the code:
parser.nextToken(); token = parser.getCurrentName(); if ("person".equals(token)) { out.println("Found person"); parsePerson(parser); }
The parsePerson
method follows which is very similar to the process used in the first example.
public void parsePerson(JsonParser parser) throws IOException { while (parser.nextToken() != JsonToken.END_ARRAY) { String token = parser.getCurrentName(); if ("firstname".equals(token)) { parser.nextToken(); String fname = parser.getText(); out.println("firstname : " + fname); } if ("lastname".equals(token)) { parser.nextToken(); String lname = parser.getText(); out.println("lastname : " + lname); } if ("phone".equals(token)) { parser.nextToken(); long phone = parser.getLongValue(); out.println("phone : " + phone); } if ("address".equals(token)) { out.println("address :"); parser.nextToken(); while (parser.nextToken() != JsonToken.END_ARRAY) { out.println(parser.getText()); } } } }
The output follows:
Group : school Found person firstname : Smith lastname : Peter phone : 8475552222 address : 100 Main Street Corpus Oklahoma firstname : King lastname : Sarah phone : 8475551111 address : 200 Main Street Corpus Oklahoma firstname : Frost lastname : Nathan phone : 8475553333address : 300 Main Street Corpus Oklahoma
The second approach is to use the tree model. An ObjectMapper
instance is used to create a JsonNode
instance using the Persons.json
file. The fieldNames
method returns Iterator
allowing us to process each element of the file.
try { ObjectMapper mapper = new ObjectMapper(); JsonNode node = mapper.readTree(new File("Persons.json")); Iterator<String> fieldNames = node.fieldNames(); while (fieldNames.hasNext()) { ... fieldNames.next(); } } catch (IOException ex) { // Handle exceptions }
Since the JSON file contains a persons
field, we will obtain a JsonNode
instance representing the field and then iterate over each of its elements.
JsonNode personsNode = node.get("persons"); Iterator<JsonNode> elements = personsNode.iterator(); while (elements.hasNext()) { ... }
Each element is processed one at a time. If the element type is a string, we assume that this is the groupname
field.
JsonNode element = elements.next(); JsonNodeType nodeType = element.getNodeType(); if (nodeType == JsonNodeType.STRING) { out.println("Group: " + element.textValue()); }
If the element is an array, we assume it contains a series of persons where each person is processed by the parsePerson
method:
if (nodeType == JsonNodeType.ARRAY) { Iterator<JsonNode> fields = element.iterator(); while (fields.hasNext()) { parsePerson(fields.next()); } }
The parsePerson
method is shown next:
public void parsePerson(JsonNode node) { Iterator<JsonNode> fields = node.iterator(); while(fields.hasNext()) { JsonNode subNode = fields.next(); out.println(subNode.asText()); } }
The output follows:
Group: school Smith Peter 8475552222 King Sarah 8475551111 Frost Nathan 8475553333
There is much more to JSON than we are able to illustrate here. However, this should give you an idea of how this type of data can be handled.