Topics in This Chapter
9.3 Reading Tokens and Numbers
9.4 Reading from URLs and Other Sources
9.11 Regular Expression Groups
In this chapter, you will learn how to carry out common file processing tasks, such as reading all lines or words from a file or reading a file containing numbers.
Chapter highlights:
• Source.fromFile(...).getLines.toArray
yields all lines of a file.
• Source.fromFile(...).mkString
yields the file contents as a string.
• To convert a string into a number, use the toInt
or toDouble
method.
• Use the Java PrintWriter
to write text files.
• "
regex".r
is a Regex
object.
• Use """..."""
if your regular expression contains backslashes or quotes.
• If a regex pattern has groups, you can extract their contents using the syntax for (regex(
var1, ...,
varn) <-
string)
.
To read all lines from a file, call the getLines
method on a scala.io.Source
object:
import scala.io.Source
val source = Source.fromFile("myfile.txt", "UTF-8")
// The first argument can be a string or a java.io.File
// You can omit the encoding if you know that the file uses
// the default platform encoding
val lineIterator = source.getLines
The result is an iterator (see Chapter 13). You can use it to process the lines one at a time:
for (l <- lineIterator) process l
Or you can put the lines into an array or array buffer by applying the toArray
or toBuffer
method to the iterator:
val lines = source.getLines.toArray
Sometimes, you just want to read an entire file into a string. That’s even simpler:
val contents = source.mkString
Caution
Call close
when you are done using the Source
object.
To read individual characters from a file, you can use a Source
object directly as an iterator since the Source
class extends Iterator[Char]
:
for (c <- source) process c
If you want to be able to peek at a character without consuming it (like istream::peek
in C++ or a PushbackInputStreamReader
in Java), call the buffered
method on the source
object. Then you can peek at the next input character with the head
method without consuming it.
val source = Source.fromFile("myfile.txt", "UTF-8")
val iter = source.buffered
while (iter.hasNext) {
if (iter.head is nice)
process iter.next
else
...
}
source.close()
Alternatively, if your file isn’t large, you can just read it into a string and process that:
val contents = source.mkString
Here is a quick-and-dirty way of reading all whitespace-separated tokens in a source:
val tokens = source.mkString.split("\s+")
To convert a string into a number, use the toInt
or toDouble
method. For example, if you have a file containing floating-point numbers, you can read them all into an array by
val numbers = for (w <- tokens) yield w.toDouble
or
val numbers = tokens.map(_.toDouble)
Tip
Remember—you can always use the java.util.Scanner
class to process a file that contains a mixture of text and numbers.
Finally, note that you can read numbers from scala.io.StdIn
:
print("How old are you? ")
val age = Scala.io.readInt()
// Or use readDouble or readLong
Caution
These methods assume that the next input line contains a single number, without leading or trailing whitespace. Otherwise, a NumberFormatException
occurs.
The Source
object has methods to read from sources other than files:
val source1 = Source.fromURL("http://horstmann.com", "UTF-8")
val source2 = Source.fromString("Hello, World!")
// Reads from the given string—useful for debugging
val source3 = Source.stdin
// Reads from standard input
Caution
When you read from a URL, you need to know the character set in advance, perhaps from an HTTP header. See www.w3.org/International/O-charset
for more information.
Scala has no provision for reading binary files. You’ll need to use the Java library. Here is how you can read a file into a byte array:
val file = new File(filename)
val in = new FileInputStream(file)
val bytes = new Array[Byte](file.length.toInt)
in.read(bytes)
in.close()
Scala has no built-in support for writing files. To write a text file, use a java.io.PrintWriter
, for example:
val out = new PrintWriter("numbers.txt")
for (i <- 1 to 100) out.println(i)
out.close()
Everything works as expected, except for the printf
method. When you pass a number to printf
, the compiler will complain that you need to convert it to an AnyRef
:
out.printf("%6d %10.2f",
quantity.asInstanceOf[AnyRef], price.asInstanceOf[AnyRef]) // Ugh
Instead, use the f
interpolator:
out.print(f"$quantity%6d $price%10.2f")
There are no “official” Scala classes for visiting all files in a directory, or for recursively traversing directories.
The simplest approach is to use the Files.list
and Files.walk
methods of the java.nio.file
package. The list
method only visits the children of a directory, and the walk
method visits all descendants. These methods yield Java streams of Path
objects. You can visit them as follows:
import java.nio.file._
String dirname = "/home/cay/scala-impatient/code"
val entries = Files.walk(Paths.get(dirname)) // or Files.list
try {
entries.forEach(p => Process the path p)
} finally {
entries.close()
}
In Java, serialization is used to transmit objects to other virtual machines or for short-term storage. (For long-term storage, serialization can be awkward—it is tedious to deal with different object versions as classes evolve over time.)
Here is how you declare a serializable class in Java and Scala.
Java:
public class Person implements java.io.Serializable {
private static final long serialVersionUID = 42L;
...
}
Scala:
@SerialVersionUID(42L) class Person extends Serializable
The Serializable
trait is defined in the scala
package and does not require an import.
Note
You can omit the @SerialVersionUID
annotation if you are OK with the default ID.
Serialize and deserialize objects in the usual way:
val fred = new Person(...)
import java.io._
val out = new ObjectOutputStream(new FileOutputStream("/tmp/test.obj"))
out.writeObject(fred)
out.close()
val in = new ObjectInputStream(new FileInputStream("/tmp/test.obj"))
val savedFred = in.readObject().asInstanceOf[Person]
The Scala collections are serializable, so you can have them as members of your serializable classes:
class Person extends Serializable {
private val friends = new ArrayBuffer[Person] // OK—ArrayBuffer is serializable
...
}
Traditionally, programmers use shell scripts to carry out mundane processing tasks, such as moving files from one place to another, or combining a set of files. The shell language makes it easy to specify subsets of files and to pipe the output of one program into the input of another. However, as programming languages, most shell languages leave much to be desired.
Scala was designed to scale from humble scripting tasks to massive programs. The scala.sys.process
package provides utilities to interact with shell programs. You can write your shell scripts in Scala, with all the power that the Scala language puts at your disposal.
Here is a simple example:
import scala.sys.process._
"ls -al ..".!
As a result, the ls -al ..
command is executed, showing all files in the parent directory. The result is printed to standard output.
The scala.sys.process
package contains an implicit conversion from strings to ProcessBuilder
objects. The !
method executes the ProcessBuilder
object.
The result of the !
method is the exit code of the executed program: 0
if the program was successful, or a nonzero failure indicator otherwise.
If you use !!
instead of !
, the output is returned as a string:
val result = "ls -al /".!!
Note
The !
and !!
operators were originally intended to be used as postfix operators without the method invocation syntax:
"ls -al /" !!
However, as you will see in Chapter 11, the postfix syntax is being deprecated since it can lead to parsing errors.
You can pipe the output of one program into the input of another, using the #|
method:
("ls -al /" #| "grep u").!
Note
As you can see, the process library uses the commands of the underlying operating system. Here, I use bash
commands because bash
is available on Linux, Mac OS X, and Windows.
To redirect the output to a file, use the #>
method:
("ls -al /" #> new File("filelist.txt")).!
To append to a file, use #>>
instead:
("ls -al /etc" #>> new File("filelist.txt")).!
To redirect input from a file, use #<
:
("grep u" #< new File("filelist.txt")).!
You can also redirect input from a URL:
("grep Scala" #< new URL("http://horstmann.com/index.html")).!
You can combine processes with p #&& q
(execute q
if p
was successful) and p #|| q
(execute q
if p
was unsuccessful). But frankly, Scala is better at control flow than the shell, so why not implement the control flow in Scala?
Note
The process library uses the familiar shell operators | > >> < && ||
, but it prefixes them with a #
so that they all have the same precedence.
If you need to run a process in a different directory, or with different environment variables, construct a ProcessBuilder
with the apply
method of the Process
object. Supply the command, the starting directory, and a sequence of (
name,
value)
pairs for environment settings:
val p = Process(cmd, new File(dirName), ("LANG", "en_US"))
Then execute it with the !
method:
("echo 42" #| p).!
Note
If you want to use Scala for shell scripts in a UNIX/Linux/MacOS environment, start your script files like this:
#!/bin/sh
exec scala "$0" "$@"
!#
Scala commands
Note
You can also run Scala scripts from Java programs with the scripting integration of the javax.script
package. To get a script engine, call
ScriptEngine engine =
new ScriptEngineManager().getScriptEngineByName("scala")
When you process input, you often want to use regular expressions to analyze it. The scala.util.matching.Regex
class makes this simple. To construct a Regex
object, use the r
method of the String
class:
val numPattern = "[0-9]+".r
If the regular expression contains backslashes or quotation marks, then it is a good idea to use the “raw” string syntax, """..."""
. For example:
val wsnumwsPattern = """s+[0-9]+s+""".r
// A bit easier to read than "\s+[0-9]+\s+".r
The findAllIn
method returns an Iterator[String]
through all matches. You can use it in a for
loop:
for (matchString <- numPattern.findAllIn("99 bottles, 98 bottles"))
println(matchString)
Alternatively, turn the iterator into an array:
val matches = numPattern.findAllIn("99 bottles, 98 bottles").toArray
// Array("99", "98")
To find the first match in a string, use findFirstIn
. You get an Option[String]
. (See Chapter 14 for the Option
class.)
val firstMatch = wsnumwsPattern.findFirstIn("99 bottles, 98 bottles")
// Some(" 98 ")
Note
There is no method to test whether a string matches the regex in its entirety, but you can add anchors:
val anchoredPattern = "^[0-9]+$".r
if (anchoredPattern.findFirstIn(str) != None) ...
Alternatively, use the String.matches
method:
if (str.matches("[0-9]+")) ...
You can replace the first match, all matches, or some matches. In the latter case, supply a function Match => Option[String]
. The Match
class has information about the match (see the next section for details). If the function returns Some(str)
, the match is replaced with str
.
numPattern.replaceFirstIn("99 bottles, 98 bottles", "XX")
// "XX bottles, 98 bottles"
numPattern.replaceAllIn("99 bottles, 98 bottles", "XX")
// "XX bottles, XX bottles"
numPattern.replaceSomeIn("99 bottles, 98 bottles",
m => if (m.matched.toInt % 2 == 0) Some("XX") else None)
// "99 bottles, XX bottles"
Here is a more useful application of the replaceSomeIn
method. We want to replace placeholders $0
, $1
, and so on, in a message string with values from an argument sequence. Make a pattern for the variable with a group for the index, and then map the group to the sequence element.
val varPattern = """$[0-9]+""".r
def format(message: String, vars: String*) =
varPattern.replaceSomeIn(message, m => vars.lift(
m.matched.tail.toInt))
format("At $1, there was $2 on $0.",
"planet 7", "12:30 pm", "a disturbance of the force")
// At 12:30 pm, there was a disturbance of the force on planet 7.
The lift
method turns a Seq[String]
into a function. The expression vars.lift(i)
is Some(vars(i))
if i
is a valid index or None
if it is not.
Groups are useful to get subexpressions of regular expressions. Add parentheses around the subexpressions that you want to extract, for example:
val numitemPattern = "([0-9]+) ([a-z]+)".r
You can get the group contents from a Match
object. The methods findAllMatchIn
and findFirstMatchIn
are analogs of the findAllIn
and findFirstIn
methods that return an Iterator[Match]
or Option[Match]
.
If m
is a Match
object, then m.matched
is the entire match string and m.group(i)
is the i
th group. The start and end indices of these substrings in the original string are m.start
, m.end
, m.start(i)
, and m.end(i)
.
for (m <- numitemPattern.findAllMatchIn("99 bottles, 98 bottles"))
println(m.group(1)) // Prints 99 and 98
Caution
The Match
class has methods for retrieving groups by name. However, this does not work with group names inside regular expressions, such as "(?<num>[0-9]+) (?<item>[a-z]+)".r
. Instead, one needs to supply names to the r
method: "([0-9]+) ([a-z]+)".r("num", "item")
There is another convenient way of extracting matches. Use a regular expression variable as an “extractor” (see Chapter 14), like this:
val numitemPattern(num, item) = "99 bottles"
// Sets num to "99", item to "bottles"
When you use a pattern as an extractor, it must match the string from which you extract the matches, and there must be a group for each variable.
To extract groups from multiple matches, you can use a for
statement like this:
for (numitemPattern(num, item) <- numitemPattern.findAllIn("99 bottles, 98 bottles"))
process num and item
1. Write a Scala code snippet that reverses the lines in a file (making the last line the first one, and so on).
2. Write a Scala program that reads a file with tabs, replaces each tab with spaces so that tab stops are at n-column boundaries, and writes the result to the same file.
3. Write a Scala code snippet that reads a file and prints all words with more than 12 characters to the console. Extra credit if you can do this in a single line.
4. Write a Scala program that reads a text file containing only floating-point numbers. Print the sum, average, maximum, and minimum of the numbers in the file.
5. Write a Scala program that writes the powers of 2 and their reciprocals to a file, with the exponent ranging from 0 to 20. Line up the columns:
1 1
2 0.5
4 0.25
... ...
6. Make a regular expression searching for quoted strings "like this, maybe with " or \"
in a Java or C++ program. Write a Scala program that prints out all such strings in a source file.
7. Write a Scala program that reads a text file and prints all tokens in the file that are not floating-point numbers. Use a regular expression.
8. Write a Scala program that prints the src
attributes of all img
tags of a web page. Use regular expressions and groups.
9. Write a Scala program that counts how many files with .class
extension are in a given directory and its subdirectories.
10. Expand the example in Section 9.8, “Serialization,” on page 113. Construct a few Person
objects, make some of them friends of others, and save an Array[Person]
to a file. Read the array back in and verify that the friend relations are intact.