Strings are used to support text processing so using a good string library is important. Unfortunately, the java.lang.String
class has some limitations. To address these limitations, you can either implement your own special string functions as needed or you can use a third-party library.
Creating your own library can be useful, but you will basically be reinventing the wheel. It may be faster to write a simple code sequence to implement some functionality, but to do things right, you will need to test them. Third-party libraries have already been tested and have been used on hundreds of projects. They provide a more efficient way of processing text.
There are several text processing APIs in addition to those found in Java. We will demonstrate two of these:
Java provides many supports for cleaning text data, including methods in the String
class. These methods are ideal for simple text cleaning and small amounts of data but can also be efficient with larger, complex datasets. We will demonstrate several String
class methods in a moment. Some of the most helpful String
class methods are summarized in the following table:
Method Name |
Return Type |
Description |
|
|
Removes leading and trailing blank spaces |
|
|
Changes the casing of the entire string |
|
|
Replaces all occurrences of a character sequence within the string |
|
|
Determines whether a given character sequence exists within the string |
|
|
Compares two strings lexographically and returns an integer representing their relationship |
|
|
Determines whether the string matches a given regular expression |
|
|
Combines two or more strings with a specified delimiter |
|
|
Separates elements of a given string using a specified delimiter |
Many text operations are simplified by the use of regular expressions. Regular expressions use standardized syntax to represent patterns in text, which can be used to locate and manipulate text matching the pattern.
A regular expression is simply a string itself. For example, the string Hello, my name is Sally
can be used as a regular expression to find those exact words within a given text. This is very specific and not broadly applicable, but we can use a different regular expression to make our code more effective. Hello, my name is \w
will match any text that starts with Hello, my name is
and ends with a word character.
We will use several examples of more complex regular expressions, and some of the more useful syntax options are summarized in the following table. Note each must be double-escaped when used in a Java application.
Option |
Description |
|
Any digit: 0-9 |
|
Any non-digit |
|
Any whitespace character |
|
Any non-whitespace character |
|
Any word character (including digits): A-Z, a-z, and 0-9 |
|
Any non-word character |
The size and source of text data varies wildly from application to application but the methods used to transform the data remain the same. You may actually need to read data from a file, but for simplicity's sake, we will be using a string containing the beginning sentences of Herman Melville's Moby Dick for several examples within this chapter. Unless otherwise specified, the text will assumed to be as shown next:
String dirtyText = "Call me Ishmael. Some years ago- never mind how"; dirtyText += " long precisely - having little or no money in my purse,"; dirtyText += " and nothing particular to interest me on shore, I thought"; dirtyText += " I would sail about a little and see the watery part of the world.";
Often it is most efficient to analyze text data as tokens. There are multiple tokenizers available in the core Java libraries as well as third-party tokenizers. We will demonstrate various tokenizers throughout this chapter. The ideal tokenizer will depend upon the limitations and requirements of an individual application.
StringTokenizer
was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String
class's split
method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer
class that splits a string on spaces:
StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); while(tokenizer.hasMoreTokens()){ out.print(tokenizer.nextToken() + " "); }
When we set the dirtyText
variable to hold our text from Moby Dick, shown previously, we get the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely...
StreamTokenizer
is another core Java tokenizer. StreamTokenizer
grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer
or the split
method. The String
class split
method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.
The Scanner
class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.
Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer
. This class provides more advanced support than the standard StringTokenizer
class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer
:
StrTokenizer tokenizer = new StrTokenizer(text); while (tokenizer.hasNext()) { out.print(tokenizer.next() + " "); }
This operates in a similar fashion to StringTokenizer
and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.
When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...
We can modify our constructor as follows:
StrTokenizer tokenizer = new StrTokenizer(text,",");
The output for this implementation is:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse and nothing particular to interest me on shore I thought I would sail about a little and see the watery part of the world.
Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher
object.
Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner
class and the Splitter
class. Tokenization is accomplished in Guava using its Splitter
class's split
method. The following is a simple example:
Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); Iterable<String> words = simpleSplit.split(dirtyText); for(String token: words){ out.print(token); }
This splits each token on commas and produces output like our last example. We can modify the parameter of the on
method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.
LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory
interface. We implement a LingPipe IndoEuropeanTokenizerFactory
tokenizer in the Simple text cleaning section.
Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.
We will use the string shown before from Moby Dick to demonstrate some of the basic String
class methods. Notice the use of the toLowerCase
and trim
methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll
method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:
out.println(dirtyText); dirtyText = dirtyText.toLowerCase().replaceAll("[\d[^\w\s]]+", " "); dirtyText = dirtyText.trim(); while(dirtyText.contains(" ")){ dirtyText = dirtyText.replaceAll(" ", " "); } out.println(dirtyText);
When executed, the code produces the following output, truncated:
Call me Ishmael. Some years ago- never mind how long precisely - call me ishmael some years ago never mind how long precisely
Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String
array. The split
method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \W
, which represents anything that is not a word character:
out.println(dirtyText); dirtyText = dirtyText.replaceAll("[\d[^\w\s]]+", ""); String[] cleanText = dirtyText.toLowerCase().trim().split("[\W]+"); for(String clean : cleanText){ out.print(clean + " "); }
This code produces the same output as shown previously.
Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join
method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join
method joins every word in the array words
and inserts a space between each word:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\W\d]+"); String cleanText = String.join(" ", words); out.println(cleanText);
Again, this code produces the same output as shown previously. An alternate version of the join
method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner
class:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\W\d]+"); String cleanText = Joiner.on(" ").skipNulls().join(words); out.println(cleanText);
This version provides additional options, including skipping nulls, as shown before. The output remains the same.
Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner
object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList
using the Arrays
class's asList
method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String
class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:
Scanner readStop = new Scanner(new File("C://stopwords.txt")); ArrayList<String> words = new ArrayList<String>(Arrays.asList((dirtyText)); out.println("Original clean text: " + words.toString());
We also create a new ArrayList
to hold a list of stop words actually found in our text. This will allow us to use the ArrayList
class removeAll
method shortly. Next, we use our Scanner
to read through our file of stop words. Notice how we also call the toLowerCase
and trim
methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains
method to determine whether our text contains the given stop word. If so, we add it to our foundWords
ArrayList. Once we have processed all the stop words, we call removeAll
to remove them from our text:
ArrayList<String> foundWords = new ArrayList(); while(readStop.hasNextLine()){ String stopWord = readStop.nextLine().toLowerCase(); if(words.contains(stopWord)){ foundWords.add(stopWord); } } words.removeAll(foundWords); out.println("Text without stop words: " + words.toString());
The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:
Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world] Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely
There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory
class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer
method uses a char
array, so we call toCharArray
against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:
text = text.toLowerCase().trim(); TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; fact = new EnglishStopTokenizerFactory(fact); Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); for(String word : tok){ out.print(word + " "); }
The output follows:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .
Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.
The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches
method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains
method and the equals
method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); int count = 0;
Next, we call the contains
method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals
method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:
if(dirtyText.contains(toFind)){ String[] words = dirtyText.split(" "); for(String word : words){ if(word.equals(toFind)){ count++; } } out.println("Found " + toFind + " " + count + " times in the text."); }
In this example, we set toFind
to the letter I
. This produced the following output:
Found i 2 times in the text.
We also have the option to use the Scanner
class to search through an entire file. One helpful method is the findWithinHorizon
method. This uses a Scanner
to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner
will be searched by default:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); Scanner textLine = new Scanner(dirtyText); out.println("Found " + textLine.findWithinHorizon(toFind, 10));
This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.
It can also be more efficient to search an entire file using a BufferedReader
. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader
object from our path and process our file as long as the next line is not empty:
String path = "C:// MobyDick.txt"; try { String textLine = ""; toFind = toFind.toLowerCase().trim(); BufferedReader textToClean = new BufferedReader( new FileReader(path)); while((textLine = textToClean.readLine()) != null){ line++; if(textLine.toLowerCase().trim().contains(toFind)){ out.println("Found " + toFind + " in " + textLine); } } textToClean.close(); } catch (IOException ex) { // Handle exceptions }
We again test our data by searching for the word I
in the first sentences of Moby Dick. The truncated output follows:
Found i in Call me Ishmael...
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\W\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\." + "[0-9]{1,3}\])|(([a-zAZ\-0-9]+\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("[email protected]")); out.println(validateEmail("[email protected]")); out.println(validateEmail("myEmail"));
The output follows:
[email protected] is a valid email address [email protected] is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \p{L}
provides this flexibility. We also use \s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\p{L}\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.