Chapter 9. Files and Regular Expressions

In this chapter, you will learn how to carry out common file processing tasks, such as reading all lines or words from a file or reading a file containing numbers.

Chapter highlights:

Source.fromFile(...).getLines.toArray yields all lines of a file.

Source.fromFile(...).mkString yields the file contents as a string.

• To convert a string into a number, use the toInt or toDouble method.

• Use the Java PrintWriter to write text files.

"regex".r is a Regex object.

• Use """...""" if your regular expression contains backslashes or quotes.

• If a regex pattern has groups, you can extract their contents using the syntax for (regex(var1, ...,varn) <- string).

9.1 Reading Lines

To read all lines from a file, call the getLines method on a scala.io.Source object:

import scala.io.Source
val source = Source.fromFile("myfile.txt", "UTF-8")
  // The first argument can be a string or a java.io.File
  // You can omit the encoding if you know that the file uses
  // the default platform encoding
val lineIterator = source.getLines

The result is an iterator (see Chapter 13). You can use it to process the lines one at a time:

for (l <- lineIterator) process l

Or you can put the lines into an array or array buffer by applying the toArray or toBuffer method to the iterator:

val lines = source.getLines.toArray

Sometimes, you just want to read an entire file into a string. That’s even simpler:

val contents = source.mkString


Image Caution

Call close when you are done using the Source object.


9.2 Reading Characters

To read individual characters from a file, you can use a Source object directly as an iterator since the Source class extends Iterator[Char]:

for (c <- source) process c

If you want to be able to peek at a character without consuming it (like istream::peek in C++ or a PushbackInputStreamReader in Java), call the buffered method on the source object. Then you can peek at the next input character with the head method without consuming it.

val source = Source.fromFile("myfile.txt", "UTF-8")
val iter = source.buffered
while (iter.hasNext) {
  if (iter.head is nice)
    process iter.next
  else
    ...
}
source.close()

Alternatively, if your file isn’t large, you can just read it into a string and process that:

val contents = source.mkString

9.3 Reading Tokens and Numbers

Here is a quick-and-dirty way of reading all whitespace-separated tokens in a source:

val tokens = source.mkString.split("\s+")

To convert a string into a number, use the toInt or toDouble method. For example, if you have a file containing floating-point numbers, you can read them all into an array by

val numbers = for (w <- tokens) yield w.toDouble

or

val numbers = tokens.map(_.toDouble)


Image Tip

Remember—you can always use the java.util.Scanner class to process a file that contains a mixture of text and numbers.


Finally, note that you can read numbers from scala.io.StdIn:

print("How old are you? ")
val age = Scala.io.readInt()
  // Or use readDouble or readLong


Image Caution

These methods assume that the next input line contains a single number, without leading or trailing whitespace. Otherwise, a NumberFormatException occurs.


9.4 Reading from URLs and Other Sources

The Source object has methods to read from sources other than files:

val source1 = Source.fromURL("http://horstmann.com", "UTF-8")
val source2 = Source.fromString("Hello, World!")
  // Reads from the given string—useful for debugging
val source3 = Source.stdin
  // Reads from standard input


Image Caution

When you read from a URL, you need to know the character set in advance, perhaps from an HTTP header. See www.w3.org/International/O-charset for more information.


9.5 Reading Binary Files

Scala has no provision for reading binary files. You’ll need to use the Java library. Here is how you can read a file into a byte array:

val file = new File(filename)
val in = new FileInputStream(file)
val bytes = new Array[Byte](file.length.toInt)
in.read(bytes)
in.close()

9.6 Writing Text Files

Scala has no built-in support for writing files. To write a text file, use a java.io.PrintWriter, for example:

val out = new PrintWriter("numbers.txt")
for (i <- 1 to 100) out.println(i)
out.close()

Everything works as expected, except for the printf method. When you pass a number to printf, the compiler will complain that you need to convert it to an AnyRef:

out.printf("%6d %10.2f",
  quantity.asInstanceOf[AnyRef], price.asInstanceOf[AnyRef]) // Ugh

Instead, use the f interpolator:

out.print(f"$quantity%6d $price%10.2f")

9.7 Visiting Directories

There are no “official” Scala classes for visiting all files in a directory, or for recursively traversing directories.

The simplest approach is to use the Files.list and Files.walk methods of the java.nio.file package. The list method only visits the children of a directory, and the walk method visits all descendants. These methods yield Java streams of Path objects. You can visit them as follows:

import java.nio.file._
String dirname = "/home/cay/scala-impatient/code"
val entries = Files.walk(Paths.get(dirname)) // or Files.list
try {
  entries.forEach(p => Process the path p)
} finally {
  entries.close()
}

9.8 Serialization

In Java, serialization is used to transmit objects to other virtual machines or for short-term storage. (For long-term storage, serialization can be awkward—it is tedious to deal with different object versions as classes evolve over time.)

Here is how you declare a serializable class in Java and Scala.

Java:

public class Person implements java.io.Serializable {
  private static final long serialVersionUID = 42L;
  ...
}

Scala:

@SerialVersionUID(42L) class Person extends Serializable

The Serializable trait is defined in the scala package and does not require an import.


Image Note

You can omit the @SerialVersionUID annotation if you are OK with the default ID.


Serialize and deserialize objects in the usual way:

val fred = new Person(...)
import java.io._
val out = new ObjectOutputStream(new FileOutputStream("/tmp/test.obj"))
out.writeObject(fred)
out.close()
val in = new ObjectInputStream(new FileInputStream("/tmp/test.obj"))
val savedFred = in.readObject().asInstanceOf[Person]

The Scala collections are serializable, so you can have them as members of your serializable classes:

class Person extends Serializable {
  private val friends = new ArrayBuffer[Person] // OK—ArrayBuffer is serializable
  ...
}

9.9 Process Control Image

Traditionally, programmers use shell scripts to carry out mundane processing tasks, such as moving files from one place to another, or combining a set of files. The shell language makes it easy to specify subsets of files and to pipe the output of one program into the input of another. However, as programming languages, most shell languages leave much to be desired.

Scala was designed to scale from humble scripting tasks to massive programs. The scala.sys.process package provides utilities to interact with shell programs. You can write your shell scripts in Scala, with all the power that the Scala language puts at your disposal.

Here is a simple example:

import scala.sys.process._
"ls -al ..".!

As a result, the ls -al .. command is executed, showing all files in the parent directory. The result is printed to standard output.

The scala.sys.process package contains an implicit conversion from strings to ProcessBuilder objects. The ! method executes the ProcessBuilder object.

The result of the ! method is the exit code of the executed program: 0 if the program was successful, or a nonzero failure indicator otherwise.

If you use !! instead of !, the output is returned as a string:

val result = "ls -al /".!!


Image Note

The ! and !! operators were originally intended to be used as postfix operators without the method invocation syntax:

"ls -al /" !!

However, as you will see in Chapter 11, the postfix syntax is being deprecated since it can lead to parsing errors.


You can pipe the output of one program into the input of another, using the #| method:

("ls -al /" #| "grep u").!


Image Note

As you can see, the process library uses the commands of the underlying operating system. Here, I use bash commands because bash is available on Linux, Mac OS X, and Windows.


To redirect the output to a file, use the #> method:

("ls -al /" #> new File("filelist.txt")).!

To append to a file, use #>> instead:

("ls -al /etc" #>> new File("filelist.txt")).!

To redirect input from a file, use #<:

("grep u" #< new File("filelist.txt")).!

You can also redirect input from a URL:

("grep Scala" #< new URL("http://horstmann.com/index.html")).!

You can combine processes with p #&& q (execute q if p was successful) and p #|| q (execute q if p was unsuccessful). But frankly, Scala is better at control flow than the shell, so why not implement the control flow in Scala?


Image Note

The process library uses the familiar shell operators | > >> < && ||, but it prefixes them with a # so that they all have the same precedence.


If you need to run a process in a different directory, or with different environment variables, construct a ProcessBuilder with the apply method of the Process object. Supply the command, the starting directory, and a sequence of (name, value) pairs for environment settings:

val p = Process(cmd, new File(dirName), ("LANG", "en_US"))

Then execute it with the ! method:

("echo 42" #| p).!


Image Note

If you want to use Scala for shell scripts in a UNIX/Linux/MacOS environment, start your script files like this:

#!/bin/sh
exec scala "$0" "$@"
!#
Scala commands



Image Note

You can also run Scala scripts from Java programs with the scripting integration of the javax.script package. To get a script engine, call

ScriptEngine engine =
  new ScriptEngineManager().getScriptEngineByName("scala")


9.10 Regular Expressions

When you process input, you often want to use regular expressions to analyze it. The scala.util.matching.Regex class makes this simple. To construct a Regex object, use the r method of the String class:

val numPattern = "[0-9]+".r

If the regular expression contains backslashes or quotation marks, then it is a good idea to use the “raw” string syntax, """...""". For example:

val wsnumwsPattern = """s+[0-9]+s+""".r
  // A bit easier to read than "\s+[0-9]+\s+".r

The findAllIn method returns an Iterator[String] through all matches. You can use it in a for loop:

for (matchString <- numPattern.findAllIn("99 bottles, 98 bottles"))
  println(matchString)

Alternatively, turn the iterator into an array:

val matches = numPattern.findAllIn("99 bottles, 98 bottles").toArray
  // Array("99", "98")

To find the first match in a string, use findFirstIn. You get an Option[String]. (See Chapter 14 for the Option class.)

val firstMatch = wsnumwsPattern.findFirstIn("99 bottles, 98 bottles")
  // Some(" 98 ")


Image Note

There is no method to test whether a string matches the regex in its entirety, but you can add anchors:

val anchoredPattern = "^[0-9]+$".r
if (anchoredPattern.findFirstIn(str) != None) ...

Alternatively, use the String.matches method:

if (str.matches("[0-9]+")) ...


You can replace the first match, all matches, or some matches. In the latter case, supply a function Match => Option[String]. The Match class has information about the match (see the next section for details). If the function returns Some(str), the match is replaced with str.

numPattern.replaceFirstIn("99 bottles, 98 bottles", "XX")
  // "XX bottles, 98 bottles"
numPattern.replaceAllIn("99 bottles, 98 bottles", "XX")
  // "XX bottles, XX bottles"
numPattern.replaceSomeIn("99 bottles, 98 bottles",
  m => if (m.matched.toInt % 2 == 0) Some("XX") else None)
  // "99 bottles, XX bottles"

Here is a more useful application of the replaceSomeIn method. We want to replace placeholders $0, $1, and so on, in a message string with values from an argument sequence. Make a pattern for the variable with a group for the index, and then map the group to the sequence element.

val varPattern = """$[0-9]+""".r
def format(message: String, vars: String*) =
  varPattern.replaceSomeIn(message, m => vars.lift(
    m.matched.tail.toInt))
format("At $1, there was $2 on $0.",
  "planet 7", "12:30 pm", "a disturbance of the force")
  // At 12:30 pm, there was a disturbance of the force on planet 7.

The lift method turns a Seq[String] into a function. The expression vars.lift(i) is Some(vars(i)) if i is a valid index or None if it is not.

9.11 Regular Expression Groups

Groups are useful to get subexpressions of regular expressions. Add parentheses around the subexpressions that you want to extract, for example:

val numitemPattern = "([0-9]+) ([a-z]+)".r

You can get the group contents from a Match object. The methods findAllMatchIn and findFirstMatchIn are analogs of the findAllIn and findFirstIn methods that return an Iterator[Match] or Option[Match].

If m is a Match object, then m.matched is the entire match string and m.group(i) is the ith group. The start and end indices of these substrings in the original string are m.start, m.end, m.start(i), and m.end(i).

for (m <- numitemPattern.findAllMatchIn("99 bottles, 98 bottles"))
  println(m.group(1)) // Prints 99 and 98


Image Caution

The Match class has methods for retrieving groups by name. However, this does not work with group names inside regular expressions, such as "(?<num>[0-9]+) (?<item>[a-z]+)".r. Instead, one needs to supply names to the r method: "([0-9]+) ([a-z]+)".r("num", "item")


There is another convenient way of extracting matches. Use a regular expression variable as an “extractor” (see Chapter 14), like this:

val numitemPattern(num, item) = "99 bottles"
  // Sets num to "99", item to "bottles"

When you use a pattern as an extractor, it must match the string from which you extract the matches, and there must be a group for each variable.

To extract groups from multiple matches, you can use a for statement like this:

for (numitemPattern(num, item) <- numitemPattern.findAllIn("99 bottles, 98 bottles"))
  process num and item

Exercises

1. Write a Scala code snippet that reverses the lines in a file (making the last line the first one, and so on).

2. Write a Scala program that reads a file with tabs, replaces each tab with spaces so that tab stops are at n-column boundaries, and writes the result to the same file.

3. Write a Scala code snippet that reads a file and prints all words with more than 12 characters to the console. Extra credit if you can do this in a single line.

4. Write a Scala program that reads a text file containing only floating-point numbers. Print the sum, average, maximum, and minimum of the numbers in the file.

5. Write a Scala program that writes the powers of 2 and their reciprocals to a file, with the exponent ranging from 0 to 20. Line up the columns:

          1               1
          2               0.5
          4               0.25
        ...               ...

6. Make a regular expression searching for quoted strings "like this, maybe with " or \" in a Java or C++ program. Write a Scala program that prints out all such strings in a source file.

7. Write a Scala program that reads a text file and prints all tokens in the file that are not floating-point numbers. Use a regular expression.

8. Write a Scala program that prints the src attributes of all img tags of a web page. Use regular expressions and groups.

9. Write a Scala program that counts how many files with .class extension are in a given directory and its subdirectories.

10. Expand the example in Section 9.8, “Serialization,” on page 113. Construct a few Person objects, make some of them friends of others, and save an Array[Person] to a file. Read the array back in and verify that the friend relations are intact.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset