Command-Line Building Blocks

Learning Objectives

By the end of the chapter, you will be able to:

  • Use redirection to control command input and output
  • Construct pipelines between commands
  • Use commands for text filtering
  • Use text transformation commands
  • Analyze tabular data using data-processing commands

This chapter introduces you to the two main composing mechanisms of command lines: redirection and piping. You will also expand your vocabulary of commands to be able to perform a wide variety of data-processing tasks.

Introduction

So far, we have learned the basics of how to work with the filesystem with the shell. We also looked at some shell mechanisms such as wildcards and completion that simplify life in the command line. In this chapter, we will examine the building blocks that are used to perform data-processing tasks on the shell.

The Unix approach is to favor small, single-purpose utilities with very well-defined interfaces. Redirection and pipes let us connect these small commands and files together so that we can compose them like the elements of an electronic circuit to perform complex tasks. This concept of joining together small units into a more complex mechanism is a very powerful technique.

Most data that we typically work with is textual in nature, so we will study the most useful text-oriented commands in this chapter, along with various practical examples of their usage.

Redirection

Redirection is a method of connecting files to a command. This mechanism is used to capture the output of a command or to feed input to it.

Note

During this section, we will introduce a few commands briefly, in order to illustrate some concepts. The commands are only used as examples, and their usage does not have any connection to the main topics being covered here. The detailed descriptions of all the features and uses of those commands will be covered in the topic on text-processing commands.

Input and Output Streams

Every command that is run has a channel for data input, termed standard input (stdin), data output, termed standard output (stdout) and standard error (stderr). A command reads data from stdin and writes its results to stdout. If any error occurs, the error messages are written to stderr. These channels can also be thought of as streams through which data flows.

By convention, stdin, stdout, and stderr are assigned the numbers 0, 1, and 2, which are called file descriptors (FDs). We will not go into the technical details of these, but remember the association between these streams and their FD numbers.

When a command is run interactively, the shell attaches the input and output streams to the console input and output, respectively. Note that, by default, both stdout and stderr go to the console display.

Note

The terms console, terminal, and TTY are often used interchangeably. In essence, they refer to the interface where commands are typed, and output is produced as text. Console or terminal output refers to what the command prints out. Console or terminal input refers to what the user types in.

For instance, when we use ls to list an existing and non-existing folder from the dataset we used in the previous chapter, we get the following output:

robin ~/Lesson1/data $ ls podocarpaceae/ nonexistent

ls: cannot access 'nonexistent': No such file or directory

podocarpaceae/:

acmopyle dacrydium lagarostrobos margbensonia parasitaxus podocarpus saxegothaea

afrocarpus falcatifolium lepidothamnus microcachrys pherosphaera prumnopitys stachycarpus

dacrycarpus halocarpus manoao nageia phyllocladus retrophyllum sundacarpus

Note the error message that appears on the first line of the output. This is due to the stderr stream reaching the console, whereas the remaining output is from stdout. The outputs from both channels are combined.

Use of Operators for Redirection

We can tell the shell to connect any of the aforementioned streams to a file using the following operators:

  • The > or greater-than symbol is used to specify output redirection to a file, and is used as follows:

    command >file.txt

    This instructs the shell to redirect the stdout of a command into a file. If the file already exists, its content is overwritten.

  • The >> symbol is used to specify output redirection with append, appending data to a file, as follows:

    command >>file.txt

    This instructs the shell to redirect the stdout of command into a file. If the file does not exist, it is created, but if the file exists, then it gets appended to, rather than overwritten, unlike the previous case.

  • The < or less-than symbol is used to specify input redirection from a file. The syntax is as follows:

    command <file.txt

    This instructs the shell to redirect a file to the stdin of the command.

In the preceding syntax, you can prefix the number of an FD before <, >, or >>, which lets us redirect the stream corresponding to that FD. If no FD is specified, the defaults are stdin (FD 0) and stdout (FD 1) for input and output, respectively. Typically, the FD prefix 2 is used to redirect stderr.

Note

To redirect both stdout and stderr to the same file, the special operator &> is used.

When the shell runs one of the preceding commands, it first opens the files to/from which redirection is requested, attaches the streams to the command, and then runs it. Even if a command produces no output, if output redirection is requested, then the file is created (or truncated).

Using Multiple Redirections

Both input and output redirection can be specified for the same command, so a command can be considered as consisting of the following parts (of which the redirections are optional):

  • stdin redirection
  • stdout redirection
  • stderr redirection
  • The command itself, along with its options and other arguments

A key insight is that the order in which these parts appear in the command line does not matter. For example, consider the following sort commands (the sort command reads a file via stdin, sorts the lines, and writes it to stdout):

sort >sorted.txt <data.txt

sort <data.txt >sorted.txt

<data.txt >sorted.txt sort

All of these three commands are valid syntax and perform the exact same action, that is, the content of data.txt is redirected into the sort command and the output of that in turn is written into the sorted.txt file.

It is a matter of individual style or preference as to how these redirection elements are ordered. The conventional way of writing it, which is the style encountered most frequently and considered most readable, is in this order:

command <input_redirect_file >output_redirect_file

A space is sometimes added between the redirection symbol and the filename. For example, we could write the following:

sort < data.txt > sorted.txt

This is perfectly valid. However, from a conceptual level, it is convenient to think of the symbol and the file as a single command-line element, and to write it as follows:

sort <data.txt >sorted.txt

This is to emphasize that the filenames data.txt and sorted.txt are attached to the respective streams, that is, stdin and stdout. Remember that the symbol is always written first, followed by the filename. The symbol points to the direction of the data flow, which is either from or into the file.

Heredocs and Herestrings

A useful convention that most commands follow is that they accept a command-line argument for the input file, but if the argument is omitted, the input data is read from stdin instead. This lets commands easily be used both with redirection (and piping) as well as in a standalone manner. For example, consider the following:

less file.txt

This less command gets the filename file.txt passed as an argument, which it then opens and displays. Now, consider the following:

less <file.txt

For this second case, the shell does not treat <file.txt as an argument to be passed to less—instead, it is an instruction to the shell to redirect file.txt to the stdin of less. When less runs, it sees no arguments passed to it, and therefore, it defaults to reading data from stdin to display, rather than opening any file. Since stdin was connected to file.txt, it achieves the same function as the first command.

When a command reads input from stdin and is not redirected, it accepts lines of text typed by the user until the user presses Ctrl + D. The shell interprets that keystroke and signals an end-of-file (EOF) to the program, causing the input stream to stop, and the program exits.

Here documents (also called heredocs) is a special form of redirection that lets you feed multiple lines of text to a command in a similar way, but rather than requiring Ctrl + D to signal EOF, you can specify an arbitrary string instead. This feature is especially useful in shell scripts, which we will cover in later chapters. A traditional example of using here documents is to type a mail directly into the sendmail program to compose and send a mail from the command line itself, without using an editor.

The syntax for heredocs is as follows:

command <<LIMITSTRING

Here, LIMITSTRING can be any arbitrary string. Upon typing this, the shell prompts for multiple lines of text, until the limit string is typed on its own line, after which the shell runs the command, passing in all the lines that were typed into the command.

Here strings (also called herestrings) is yet another form of input redirection. It allows passage of a single string as input to a command, as if it were redirected from a file. The syntax for herestrings is as follows:

command <<< INPUT

Here, INPUT is the string to be passed into the program's stdin. If the string is quoted, it can extend over multiple lines.

Buffering

Now, let's talk about a concept called buffering, which applies to both redirection and piping. A buffer can be considered analogous to a flush tank for data. Water flows in until the tank is full, after which the valve closes, and the flow of water is blocked. When water drains out of the tank, the valve opens to let the water fill the tank again. The following are the buffers connected to the input and output streams.

stdout buffer

Every program has a buffer connected to its stdout, and the program writes its output data into this buffer. When the buffer gets full, the program cannot write any more output, and is put to sleep. We say that this program has blocked on a write.

The other end of a command's stdout buffer could be connected to the following:

  • A file, when redirecting output
  • The console, when running directly
  • Another command, when pipes are used (we will cover pipes in the next section)

In each case, an independent process deals with taking the data out of this buffer and moving it to the other end. When the buffer has enough space for the blocked write to succeed, the program is woken up or unblocked.

stdin buffer

In a symmetric fashion, every program has a buffer connected to its stdin, and the program reads data from this buffer. When the buffer is empty, the program cannot read any more input, and is put to sleep. We say that this program has blocked on a read.

The other end of a command's stdin buffer could be connected to the following:

  • A file, when redirecting input
  • The input typed by the user, when running directly
  • Another command, when pipes are used

Once again, an independent process deals with filling this buffer with data. When the buffer has enough data for the blocked read to succeed, the program is woken up or unblocked.

The main reason to use buffering is for efficiency—moving data in larger chunks across an input or output channel is efficient. A process can quickly read or write an entire chunk of data from/to a buffer, and continue working, rather than getting blocked. This ensures maximum throughput and parallelism.

Flushing

A program can request a flush on its output buffer, which makes it sleep until the output buffer gets emptied. Flushing an input buffer works in a different way. It simply causes all the data in the input buffer to get discarded.

A command has a buffer for each of its input as well as output streams. These buffers can operate in three modes:

  • Unbuffered: A process immediately blocks when reading or writing if the other end of the stream is not writing or reading, respectively. This is equivalent to flushing immediately after every write.
  • Line buffered: The buffer is flushed whenever a newline character is encountered.
  • Fully buffered: The buffer is not flushed. The program tries to write as much as possible and blocks when it can't. The size of the buffer can be set to any arbitrary value by the program.

The choice of buffering modes of a command depends on the task that a program does; for example, almost all text files are typically processed line by line. Text-based commands tend to always use line buffering so that a processed line of text is immediately displayed.

Consider a program that reads the error logs of a web server and filters out the lines that refer to a certain error (for example, a failure of authentication). It would typically read an entire line, check if it met the criteria, and if so print the whole line at once. The output would never contain partial lines.

On the other hand, there are some commands that deal with binary data that is not divisible into lines—these use full buffering, so that transfer speed is maximized.

There are a few applications where a completely unbuffered I/O is used. It can be useful in some very narrow situations. For example, some programs draw user interfaces on text screens using ANSI codes. In such cases, the display needs to be updated instantly in order to provide a usable interface, which unbuffered output allows. Another example is the SSH (secure shell) program, which lets a user access a command-line shell on another computer across the internet. Every keystroke the user types is instantaneously sent to the remote end, and the resultant output is sent back. Here, unbuffered I/O is essential for SSH to provide the feeling of interactivity.

Note

By default, the shell sets up the stderr of a command to be unbuffered, since error messages need to be displayed or logged immediately. stdout is set up to be line buffered when writing to the console, but fully buffered when being redirected to a file.

In the next section, which covers pipes, we will learn how shell pipelines connect multiple processes as a chain, each of which has its own buffers for stdin and stdout.

Exercise 9: Working with Command Redirection

We will now use input and output redirection with basic commands. After this exercise, you should be able to capture the output of any command to a file, or conversely feed a file into a command that requires input:

  1. Open the command-line shell, navigate to the data folder from Chapter 1, Introduction to the Command Line, and get a listing:

    robin ~ $ cd Lesson1/data

    robin ~/Lesson1/data $ ls -l

    total 16

    drwxr-xr-x 36 robin robin 4096 Aug 20 14:01 cupressaceae

    drwxr-xr-x 15 robin robin 4096 Aug 20 14:01 pinaceae

    drwxr-xr-x 23 robin robin 4096 Aug 20 14:01 podocarpaceae

    drwxr-xr-x 8 robin robin 4096 Aug 20 14:01 taxaceae

  2. Redirect the standard output of ls into a file called dir.txt and use cat to view its contents:

    robin ~/Lesson1/data $ ls -l >dir.txt

    robin ~/Lesson1/data $ cat dir.txt

    total 16

    drwxr-xr-x 36 robin robin 4096 Aug 20 14:01 cupressaceae

    -rw-r--r-- 1 robin robin 0 Aug 27 17:13 dir.txt

    drwxr-xr-x 15 robin robin 4096 Aug 20 14:01 pinaceae

    drwxr-xr-x 23 robin robin 4096 Aug 20 14:01 podocarpaceae

    drwxr-xr-x 8 robin robin 4096 Aug 20 14:01 taxaceae

    Note that when we print the contents of dir.txt, we can see an entry for dir.txt itself, with a size of zero. Yet obviously dir.txt is not empty, since we just printed it. This is a little confusing, but the explanation is simple. The shell first creates an empty dir.txt file, and then runs ls, redirecting its stdout to that file. ls gets the list of this directory, which at this point includes an empty dir.txt, and writes the list into its stdout, which in turn ends up as the content of dir.txt. Hence the contents of dir.txt reflect the state of the directory at the instant when ls ran.

  3. Next, run ls with a bogus directory, as shown in the following code, and observe what happens:

    robin ~/Lesson1/data $ ls -l nonexistent >dir.txt

    ls: cannot access 'nonexistent': No such file or directory

    robin ~/Lesson1/data $ ls -l

    total 16

    drwxr-xr-x 36 robin robin 4096 Aug 20 14:01 cupressaceae

    -rw-r--r-- 1 robin robin 0 Aug 27 17:19 dir.txt

    drwxr-xr-x 15 robin robin 4096 Aug 20 14:01 pinaceae

    drwxr-xr-x 23 robin robin 4096 Aug 20 14:01 podocarpaceae

    drwxr-xr-x 8 robin robin 4096 Aug 20 14:01 taxaceae

    From the preceding output, we can observe that the error message arrived on the console, but did not get redirected into the file. This is because we only redirected stdout to dir.txt. Note that dir.txt is empty, because there was no data written to stdout by ls.

  4. Next, use > to create a file with the listing of the pinaceae folder, and then use >> to append the listing of the taxaceae folder to it:

    robin ~/Lesson1/data $ ls -l pinaceae/ >dir.txt

    robin ~/Lesson1/data $ ls -l taxaceae/ >>dir.txt

    robin ~/Lesson1/data $ cat dir.txt

    You will see the following output displayed on the console:

    Figure 2.1: Contents of dir.txt after append redirection
    Figure 2.1: Contents of dir.txt after append redirection
  5. Now, use 2> to redirect stderr to a file. Try the following command:

    robin ~/Lesson1/data $ ls -l nonexistent taxaceae 2>dir.txt

    taxaceae/:

    total 24

    drwxr-xr-x 8 robin robin 4096 Aug 20 14:01 amentotaxus

    drwxr-xr-x 3 robin robin 4096 Aug 20 14:01 austrotaxus

    drwxr-xr-x 12 robin robin 4096 Aug 20 14:01 cephalotaxus

    drwxr-xr-x 3 robin robin 4096 Aug 20 14:01 pseudotaxus

    drwxr-xr-x 14 robin robin 4096 Aug 20 14:01 taxus

    drwxr-xr-x 9 robin robin 4096 Aug 20 14:01 torreya

    robin ~/Lesson1/data $ cat dir.txt

    ls: cannot access 'nonexistent': No such file or directory

    Note that only the error message on stderr got redirected into dir.txt.

  6. You can also redirect stderr and stdout to separate files, as shown in the following code:

    robin ~/Lesson1/data $ ls pinaceae nosuchthing >out.txt 2>err.txt

    Use the cat command to view the output of the two files:

    Figure 2.2: Contents of out.txt and err.txt showing independent redirection of stdout and stderr
    Figure 2.2: Contents of out.txt and err.txt showing independent redirection of stdout and stderr
  7. Alternatively, you can redirect both stdout and stderr to the same file. Try the following command:

    robin ~/Lesson1/data $ ls pinaceae nothing &>dir.txt

    You will see the following output if you view the contents of dir.txt with the cat command:

    Figure 2.3: Contents of the file with both stdout and stderr redirected
    Figure 2.3: Contents of the file with both stdout and stderr redirected

    Note

    The error message precedes the output because ls attempts to list the directories in lexicographical order. The nothing folder was attempted to be listed first, and then pinaceae.

  8. Now, let's use input redirection with the cat command. When passed the -n flag, it adds line numbers to each line. Type the following commands:

    robin ~/Lesson1/data $ cat -n <pinaceae/pinus/sylvestris/data.txt >numbered.txt

    robin ~/Lesson1/data $ less numbered.txt

  9. Run the cat command without any arguments, type out the lines of text, and finally use Ctrl + D to end the process:

    robin ~/Lesson1/data $ cat

    Hello

    Hello

    Bye

    Bye

    ^D

  10. Run cat in a similar fashion, but use a here document syntax, with the limit string DONE, as follows:

    robin ~/Lesson1/data $ cat <<DONE

    > This is some text

    > Some more text

    > OK, enough

    > DONE

    This is some text

    Some more text

    OK, enough

    Note

    Observe the difference between steps 9 and 10. In step 9, cat processes each line that is typed and prints it back immediately. This is because the TTY (which is connected to the stdin of cat) waits for the Enter key to be pressed before it writes the complete line into the stdin of cat. Thereupon, cat outputs that line, emptying its input buffer, and goes to sleep until the next line arrives. In step 10, the TTY is connected to the shell itself, rather than to the cat process. The shell is, in turn, connected to cat and does not send any data to it until the limit string is encountered, after which the entire text that was typed goes into cat at once.

  11. The bc command is an interactive calculator. We can use a herestring to make it do a simple calculation. Type the following to get the seventh power of 1234:

    robin ~/Lesson1/data $ bc <<< 1234^7

    4357186184021382204544

    When run directly, bc accepts multiple expressions and prints the result. In this case, the herestring is treated as a file's content by the shell and is passed into bc via stdin redirection.

  12. Finally, delete the temporary files that we created:

    robin ~/Lesson1/data $ rm *.txt

In this exercise, we learned how to redirect the input and output of shell commands to files. Using files as the input and output of commands is essential to performing more complex shell tasks.

Pipes

A shell pipeline or simply a pipeline refers to a construct where data is pushed from one command to another in an assembly line fashion. It is expressed as a series of commands separated by a pipe symbol |. These pipes connect the stdout of each command to the stdin of the subsequent command. Internally, a pipe is a special memory FIFO (first in, first out) buffer provided by the OS.

The basic syntax of a pipeline is as follows:

command1 | command2

Any number of commands can be linked:

command1 | command2 | command3 | command4

Pipelines are analogous to assembly lines in a factory. Like an assembly line lets multiple workers simultaneously do one designated job repeatedly, ending up with a finished product, a pipeline lets a series of commands work on a stream of data, each doing one task, eventually leading to the desired output.

Pipelines ensure maximum throughput and optimal usage of computing power. The time taken for a pipeline task in most cases will be close to the time taken by the slowest command in it, and not the sum of the times for all the commands.

Note

While it is not a very common use case, you can pipe both stdout and stderr of one command into another command's stdin using the |& operator.

Exercise 10: Working with Pipes

In this exercise, we will explore the use of pipes to pass data between commands. Some new commands will be introduced briefly, which will be explained in more detail later:

  1. Open the command-line shell and navigate to the data folder from the first exercise:

    robin ~ $ cd Lesson1/data

    robin ~/Lesson1/data $

  2. The simplest use of pipes is to pipe the output of a command to less, as follows:

    robin ~/Lesson1/data $ tree | less

    You will get the following output:

    Figure 2.4: Partial output of the less command
    Figure 2.4: Partial output of the less command
  3. The tr command (which stands for translate) can change the contents of each line based on a rule. Here, we will be converting lowercase text to uppercase text. Type the following command to view the output of ls in uppercase:

    robin ~/Lesson1/data $ ls | tr '[:lower:]' '[:upper:]'

    CUPRESSACEAE

    PINACEAE

    PODOCARPACEAE

    TAXACEAE

    As a general guideline, we should use single quotes for string arguments, unless we have a reason to use other kinds.

  4. You can use multiple commands in the same pipeline. Pipe tree into tr (to convert to uppercase) and then into less for viewing, as shown here:

    robin ~/Lesson1/data $ tree | tr '[:lower:]' '[:upper:]' | less

    You will get the following output:

    Figure 2.5: Partial output of the less command
    Figure 2.5: Partial output of the less command
  5. You can also combine pipes and redirection. Redirect a file into sort and pipe its output to uniq, and then into less for viewing. The uniq command removes repeated lines (we will study it in more detail in the next topic):

    robin ~/Lesson1/data $ cd pinaceae/nothotsuga/longibracteata/

    robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ sort <data.txt | uniq | less

    The output is as follows:

    Figure 2.6: Partial output of the less command
    Figure 2.6: Partial output of the less command
  6. Create a similar pipeline by first using tr to convert text from the data.txt file to uppercase before sorting, and then redirect the output to the test.txt file. View the file with less and delete it afterward. Type the following commands:

    robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ tr '[:lower:]' '[:upper:]' <data.txt | sort | uniq >test.txt

    robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ less test.txt

    robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ rm test.txt

    You will get the following output:

    Figure 2.7: Partial output of the less command
  7. Navigate back to the data folder, then run ls with the -R flag (recursive list) and provide the current directory, as well as a non-existent directory, as arguments to make it generate a listing as well as an error message. The |& operator pipes both stdout and stderr into less:

    robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ cd -

    /home/robin/Lesson1/data

    robin ~/Lesson1/data $ ls -R . nosuchthing |& less

    Both the listing and the error message are visible in less. This would not be the case when using | rather than |&.

In this exercise, we have looked at how to build shell pipelines to stream data through multiple commands. Pipes are probably the single most important feature of shells and must be understood thoroughly. So far, we have studied the following:

  • Input and output streams
  • Redirecting them to and from files
  • Buffering
  • Use of pipes for connecting a series of commands
  • Combining both redirection and pipes

Redirection and piping are essentially like performing plumbing with data instead of water. One major difference is that this command pipelining does not have multiple branches in the flow, unlike plumbing. After a little practice, using pipes becomes second nature to a shell user. The fact that shell commands often work in parallel means that some data-processing tasks can be done faster with the shell than with specialized GUI tools. This topic prepares us for the next one, where we will learn a number of text-processing commands, and how to combine them for various tasks.

Text-Processing Commands

In the previous sections, we learned about the two main composing mechanisms of command lines: redirection and piping. In this topic, we will expand our vocabulary of commands that, when combined, let us do a wide variety of data-processing tasks.

We focus on text-processing because it applies to a wide range of real-life data, and will be useful to professionals in any field. A huge amount of data on the internet is in textual form, and text happens to be the easiest way to share data in a portable way as simple columnar CSV (comma-separated values) or TSV (tab-separated values) files. Once you learn these commands, you do not have to rely on the knowledge of any specific software GUI tool, and you can run complex tasks on the shell itself. In many cases, running a quick shell pipeline is much faster than setting up the data in a more complex GUI tool. In a later chapter, you will learn how to save your commands as a sort of automatic recipe, with shell scripts, which can be reused whenever you need, and shared with others too.

Note

Some of the commands we will learn now are quite complex and versatile. In fact, there are entire books devoted to them. In this book, we will only look at some of the most commonly used commands' features.

Shell Input Concepts

Before moving on to the commands, let's get familiar with some nuances of how the shell command line works when we need to input text.

Escaping

Often, we need to enter strings that contain special characters in the command line:

  • Characters that mean something to the shell, such as * ? & < > | " ' ( ) { } [ ] ! $
  • Characters that are not printable, such as spaces, tabs, newlines, and so on

If we were to use these characters directly, the shell would interpret them, causing undesirable behavior since each of these has a special meaning to the shell. To handle this, we use a mechanism called escaping (which is also used in most programming languages). Escaping involves prefixing with an escape character, which allows us to escape to a different context, where the meaning of a symbol is no longer interpreted but treated as a literal character.

On the command line, the backslash serves as the escape character. You need to simply prefix a backslash to enter a special character literally as part of a string. For example, consider the following command:

echo * *

Since the asterisk is a special character (a wildcard symbol), instead of printing two asterisks, the command will print all the names of all the files and directories in the current directory twice. To print the asterisk character literally, you need to write the following:

echo * *

The two asterisks have now been escaped, and will not be interpreted by the shell. They are printed literally, but the backslashes are not printed. This escaping with a backslash works for any character.

Escaping works with spaces, too. For example, the following code will pass the entire filename My little pony.txt to cat:

cat My little pony.txt

It should now be obvious that you can create filenames and paths that contain special characters, but in general, this is best avoided. The unintended consequences of special characters in filenames can be quite perplexing.

Quoting

The shell provides another means of typing a string with special characters. Simply enclose it with single quotes (strong quoting) or double quotes (weak quoting). For example, look at the following commands:

ls "one two"

ls 'one two'

ls one two

In the preceding commands, the first two pass a single argument to ls, whereas the third passes two arguments. The basic use of single and double quotes is as follows:

  • Single quotes tell the shell to consider everything within them literally, and not interpret anything, not even escaping. This means that a single quote itself cannot be embedded within a single quoted string. The string starts at the opening quote character and goes on until the closing quote. Any backslashes within are printed literally. Look at the example below:

    echo 'This is a backslash '

    This is a backslash

  • Double quotes, on the other hand, allow escape sequences within so that you can embed a double quote within a double-quoted string. For example, look at thr following command and its output:

    echo "This " was escaped"

    This " was escaped

    Within double quotes, the shell treats everything as literal, except backslashes (escaping), exclamation marks (history expansion), dollars (shell expansion), or the backtick (command substitution). We will deal with the latter three special symbols and their effects in later chapters.

Yet another quoting mechanism provided by Bash is called dollar-single-quotes, which are described in the following section.

Escape Sequences

Escape sequences are a generalized way of inserting any character into a string. Normally, we can only type the symbols available on the keyboard, but escape sequences allow us to insert any possible character. These escape sequences only work in strings with dollar-single-quotes. These strings are expressed by prefixing a dollar symbol before a single-quoted string. , for example, as shown below:

$'Hello'

Within such strings, the same rules as those for single quotes apply, except that escape sequences are expanded by the shell.

Note

When using escape sequences in arguments, if we use strings with dollar-single-quotes, the shell inserts the corresponding symbols before passing the argument to the command. However, some commands can understand escape sequences themselves. In such cases, we may notice that escape sequences are used within single-quoted strings. In these cases, the command itself expands them internally.

Escape sequences originated in the C programming language, and the Bash shell uses the same convention. An escape sequence consists of a backslash followed by a character or a special code. Now, let's look at the list of escape sequences and their meaning, classified into two categories:

The first category is literal characters:

  • " produces a double-quote
  • ' produces a single-quote
  • \ produces a backslash
  • NNN produces the ASCII character whose value is the octal value NNN (one to three octal digits)
  • xHH produces the ASCII character whose value is the hexadecimal value HH (one or two hex digits)
  • uXXXX interprets XXXX as a hexadecimal number and prints the corresponding Unicode character in the range U+0000 to U+FFFF
  • UXXXXXXXX interprets XXXXXXXX as a hexadecimal number and prints the corresponding Unicode character in the range U+10000 onward

When the ASCII code was initially standardized, it consisted of seven-bit numbers, which represent 128 symbols: 32 control codes, 26 English alphabets, 10 digits, and several symbols. Later, ASCII was extended to eight bits (one byte), which added another 128 characters and was the common standard for decades until the necessity for non-English text ushered in Unicode. Unicode has several thousand symbols defined by a 32-bit or 16-bit number to represent all the characters of almost all written languages of the world.

The second category of escape sequences are ASCII control characters:

In the original ASCII code, the symbols 0 to 31 (which are not printable characters) were control codes that were sent to a mechanical TTY. Traditionally, TTY terminals were like electric typewriters. To wrap around at the end of a line, the carriage had to return to the right (the home position) and the platen had to rotate to advance the paper to the next line. These operations are called carriage return and line feed with the mnemonics CR and LF, respectively. Similarly, tab, form feed, vertical tab, and so on referred to instructions to move the TTY's carriage and platen.

The following escape codes produce these control characters:

  • a produces a terminal alert or bell. The system makes an audible beep when this character is printed to the console.
  •  produces a backspace.
  • e produces an ASCII 27 character. This is called an ESCAPE character, but it has no relation to the term escape that we are discussing here in the context of Bash.
  • f produces a form feed.
  • produces a newline or linefeed.
  • produces a carriage return.
  • produces a horizontal tab.
  • v produces a vertical tab.

The effect of printing these control characters to a console has a similar effect to what it would have on a mechanical TTY with a roll of paper—even the bell character, which activated an actual physical bell or beeper, is still emulated today on a modern shell console window.

An important practical detail to learn in this context is the difference between the DOS/Windows family operating systems and UNIX-based operating systems when they deal with line endings in text files. Traditionally, UNIX always used only the LF character (ASCII 10) to represent the end of a line. If an LF is printed to a console on UNIX-like operating systems, the cursor moves to the start of the next line. In the DOS/Windows world, however, the end of a line is represented by two characters, CR LF. This means that when a text file is created on one OS family and opened or processed on the other, they may fail to display or process correctly, since the definition of what represents a line ending differs. These conventions came into being due to complex historical events, but we are forever stuck with them.

Interestingly, the original Apple macOS used the convention of using only CR, but thankfully, modern MacOS and iOS are derived from FreeBSD Unix, so we don't have to deal with this third variety of line ending. The consequence of this is that our commands and scripts based on UNIX lineage may go haywire if they encounter text files created on the Windows OS family. If we need our scripts to work with data from all sources, we must take care of that explicitly.

For a file that originated on a Windows system, we must replace the sequence (CR LF) with the sequence (LF) before processing it with commands that work on a line-by-line basis. We may also have to do the inverse before we transfer that file back to a Windows system. Most editors, web browsers, and other tools that deal with text files are smart enough to allow for the display or editing of both kinds of files properly.

Another application of escape sequences is to send special characters to the console, called ANSI codes, which can change the text color, move the cursor to arbitrary locations, and so on. ANSI codes are an extension of the TTY-based control codes for early video-based displays called virtual terminals. They were typically CRT screens that showed 80 columns and 25 lines of text. They are expressed as an ASCII 27 (ESCAPE) or e, followed by several characters describing the action to be performed.

These ANSI codes work just the same even today and are useful for producing colorized text or crude graphical elements on the console, such as progress bars. Commands such as ls produce their colored output by the same mechanism by simply printing out the right ANSI codes.

We will not cover the details of ANSI codes and their usage, as it is beyond the scope of this book.

Multiline Input

The final feature of shell input we will discuss is multiline commands and multiline strings. When entering an extremely long command line, for readability's sake, we might like to split it into multiple lines. We can achieve this by typing a single backslash at any time and pressing the Enter key. Instead of executing the command, the shell prompts for the continuation of the command on the next line. Look at the example below:

robin ~ $ echo this is a very long command, let me extend

> to the next line and then

> once again

this is a very long command, let me extend to the next line and then once again

The backslash must be the last character of the line for this to work, and a command can be divided into as many lines as desired, with the line breaks having no effect on the command.

In a similar fashion, we can enter a literal multiline string containing newlines, simply by using quotes. Although they appear like multiline commands, multiline strings do not ignore the newlines that are typed. The rules for single and double quoted strings described earlier apply for multiline strings as well. For example:

robin ~ $ echo 'First line

> Second line

> Last line'

First line

Second line

Last line

Filtering Commands

Commands of this category operate by reading the input line by line, transforming it, and (optionally) producing an output line for each input line. They can be considered analogous to a filtering process.

Concatenate Files: cat

The cat command is primarily meant for concatenating files and for viewing small files, but it can also perform some useful line-oriented transformations on the input data. We have used cat before, but there are some options it provides that are quite useful. The long and short versions of some of these options are as follows:

  • -n, --number: Numbers output lines. The numbers are padded with spaces and followed by a tab.
  • -b, --number-nonblank: Numbers nonempty output lines, and overrides -n.
  • -s, --squeeze-blank: Removes repeated empty output lines.
  • -E, --show-ends: Displays $ at the end of each line.
  • -T, --show-tabs: Displays tab characters as ^I.

Among the preceding options, the numbering options are particularly useful.

Translate: tr

The tr command works like a translator, reading the input stream and producing a translated output according to the rules specified in the arguments. The basic syntax is as follows:

tr SET1 SET2

This translates characters from SET1 into corresponding ones from SET2.

Note

The tr command always works only on its standard input and does not take an argument for an input file.

There are three basic uses of tr that can be selected with a command-line flag:

  • No flag: Replaces any character that belongs to one set with the corresponding ones from another set.
  • -d or --delete: Removes all characters that belong to a given set.
  • -s or --squeeze-repeats: Elides repeated occurrences of any character that belongs to a given set, leaving only one occurrence.
  • -c: Uses the complement of the first set.

The character sets passed to tr can be passed in various ways:

  • With a list of characters written as a string, such as abcde
  • As a range such as a-z or 0-9
  • As multiple ranges such as a-zA-Z
  • As one of the following special character classes that consist of an expression in square brackets (only the most common are listed here):

    (a) [:alnum:] for all letters and digits

    (b) [:alpha:] for all letters

    (c) [:blank:] for all horizontal whitespaces

    (d) [:cntrl:] for all control characters

    (e) [:digit:] for all digits

    (f) [:graph:] for all printable characters, not including space

    (g) [:lower:] for all lowercase letters

    (h) [:print:] for all printable characters, including space

    (i) [:punct:] for all punctuation characters

    (j) [:space:] for all horizontal or vertical whitespaces

    (k) [:upper:] for all uppercase letters

    (l) [:xdigit:] for all hexadecimal digits

    Note

    Character classes are used in many commands, so it's useful to remember the common ones.

Stream Editor: sed

The sed command is a very comprehensive tool that can transform text in various ways. It could be considered a mini programming language in itself. However, we will restrict ourselves to using it for the most common function: search and replace.

sed reads from stdin and writes transformed output to stdout based on the rules passed to it as an argument. In its basic form for replacing text in the stream, the syntax that's used is shown here:

sed 'pattern'

Here, pattern is a string such as s/day/night/FLAGS, which consists of several parts. In the preceding code, for example:

  • s is the operation that sed is to perform. s stands for substitute.
  • / is the delimiter which indicates that everything after this until the next delimiter is to be treated as one string.
  • day is the string that sed searches for.
  • / again is a delimiter, indicating the end of the search string.
  • night is the string that sed should replace the search string with.
  • / is again a delimiter, indicating the end of the replacement string.
  • FLAGS is an optional list of characters that modify how the search and replace is done. The most common characters are as follows:

    (a) g stands for global, which tells sed to replace all matches of the search string (the default behavior is to replace only the first).

    (b) i stands for case-insensitive, which tells sed to ignore case when matching.

    (c) A number, N, specifies that the Nth match alone should be replaced. Combining the g flag with this specifies that all matches including and after the Nth one are to be replaced.

The delimiter is not mandated to be the / character. Any character can be used, as long as the same one is used at all three locations. Thus, all the following patterns are equivalent:

's#day#night#'

's1day1night1'

's:day:night:'

's day night '

'sAdayAnightA'

Multiple patterns can be combined in a pattern by separating them with a semicolon. For instance, the following pattern tells sed to replace day with night and long with short:

's/day/night/ ; s/long/short/'

Character classes can be used for the search string, but they need to be enclosed in an extra pair of square brackets. The reason for this will be apparent when we learn regular expressions in a later chapter.

The following pattern tells sed to replace all alphanumeric characters with an asterisk symbol:

's/[[:alnum:]]/*/g'

Cut Columns: cut

The cut command interprets each line of its input as a series of fields and prints out a subset of those fields based on the specified flags. The effect of this is to select a certain set of columns from a file containing columnar data.

The following is a partial list of the flags that can be used with cut:

  • -d DELIM, --delimiter=DELIM: Uses DELIM as the field delimiter (the default is the TAB character).
  • -b LIST, --bytes=LIST: Selects only specified bytes.
  • -f LIST, --fields=LIST: Selects only the fields specified by LIST and prints any line that contains no delimiter character, unless the -s option is specified.
  • -s, --only-delimited: Does not print lines not containing delimiters.
  • --complement: Complements the set of selected bytes, characters, or fields.
  • --output-delimiter=DELIM: When printing the output, DELIM is used as the field delimiter. By default, it uses the input delimiter.

Here, the syntax of LIST is a comma-separated list of one or more of the following expressions (M and N are numbers):

  • N: The Nth element is selected
  • M-N: Elements starting from the Mth up to Nth inclusive are selected
  • M-: Elements starting from the Mth up to the last element are selected
  • -N: Elements from the beginning up to the Nth inclusive are selected

Let's look at an example of using cut. The sample data for this chapter has a file called pinaceae.csv, which contains a list of tree species with comma-separated fields. This file has data separated by commas, with some values that are empty, and looks like this (only a few lines are shown):

Figure 2.8: View of the first few lines of the data from the pinaceae.csv file
Figure 2.8: View of the first few lines of the data from the pinaceae.csv file

Here, cut is used to extract data from the third column onward, using the comma character as the delimiter, and display the output with tabs as a delimiter (only a few lines are shown):

robin ~/Lesson2 $ cut -s -d',' -f 3- --output-delimiter=$' ' pinaceae.csv | less

The output is as follows:

Figure 2.9: Partial output of the cut command
Figure 2.9: Partial output of the cut command

Note the usage of dollar-single-quotes to pass in the tab character to cut as a delimiter.

Paste Columns from Files Together: paste

paste works like the opposite of cut. While cut can extract one or more columns from a file, paste combines files that have columnar data. It does the equivalent of pasting a set of columns of data side by side in the output. The basic syntax of paste is as follows:

paste filenames

The preceding command instructs the command to read a line from each file specified and produce a line of output that has each of those lines combined, delimited by a tab character. Think of it like pasting files side by side in columns.

The paste command has one option that is commonly used:

  • -d DELIMS, --delimiters=DELIMS: Uses DELIMS as field delimiters (the default is the tab character)

    DELIMS specifies individual delimiters for each field. For example, if it is set to XYZ, then X, Y, and Z are used as the delimiters after each column, respectively.

Since paste works with multiple input files, typically it is used on its own without pipes, because we can only pipe one stream of data into a command.

A combination of cut and paste can be used to reorder the columns of a file by first extracting the columns to separate files with cut, and then using paste to recombine them.

Globally Search a Regular Expression and Print: grep

grep is one of the most useful and versatile tools on UNIX-like systems. The basic purpose of grep is to search for a pattern within a file. This command is so widely used that the term grep is officially a verb meaning to search in the Oxford dictionary.

Note

A complete description of grep would be quite overwhelming. In this book, we will instead focus on the smallest useful subset of its features.

The basic syntax of grep is as follows:

grep pattern filenames

The preceding command instructs the shell to search for the specified pattern within the files listed as arguments. This pattern can be any string or a regular expression. Also, multiple files can be specified as arguments. Omitting the filename argument(s) makes grep read from the stdin, as with most commands.

The default action of grep is to print out the lines that contain the pattern. Here is a list of the most commonly used flags for grep:

  • -i, --ignore-case: Matches lines case-insensitively
  • -v, --invert-match: Selects non-matching lines
  • -n, --line-number: For every match, shows the line number in the file as a prefix
  • -c, --count: Only prints the number of matches per file
  • -w, --word-regexp: Only matches a pattern if it appears as a complete word
  • -x, --line-regexp: Only matches a pattern if it appears as a complete line
  • --color, --colour: Displays results in color on the terminal (no effect will be observed if the output is not to a TTY console)
  • -L, --files-without-match: Only shows the names of files that do not have a match
  • -l, --files-with-matches: Only shows the names of files that have a match
  • -m NUM, --max-count=NUM: Stops after NUM matching lines
  • -A NUM, --after-context=NUM: Prints NUM lines that succeed each matching line
  • -B NUM, --before-context=NUM: Prints NUM lines that precede each matching line
  • -C NUM, --context=NUM: Prints NUM lines that precede as well as NUM lines that succeed each matching line
  • --group-separator=STRING: When -A, -B, or -C are used, print the string instead of --- between groups of lines
  • --no-group-separator: When -A, -B, or -C are in use, do not print a separator between groups of lines
  • -R: Search all files within a folder recursively

For an example of how grep works, we will use the man command (which stands for manual), since it's a handy place to get a bunch of English text as test data. The man command outputs the built-in documentation for any command or common terminology. Try the following command:

man ascii | grep -n --color 'the'

Here, we ask man to show the manual page for ascii, which includes the ASCII code and some supplementary information. The output of that is piped to grep, which searches for the string "the" and prints the matching lines as numbered and colorized:

Figure 2.10: A screenshot displaying the output of the grep command
Figure 2.10: A screenshot displaying the output of the grep command

man uses the system pager (which is less) to display the manual, so the keyboard shortcuts are the same as less. The output that man provides for a command is called a man page.

Note

Students are encouraged to read man pages to learn more about any command; however, the material is written in a style more suited for people who are already quite used to the command line, so watch out for unfamiliar or complex material.

Print Unique Lines: uniq

The basic function of the uniq command is to remove duplicate lines in a file. In other words, all the lines in the output are unique. The commonly used options of uniq are as follows:

  • -d, --repeated: Prints the lines that occur more than once, but only prints those lines once.
  • -D: Prints every occurrence of a line that occurs more than once.
  • -u, --unique: Only prints unique lines; does not print lines that have any duplicates.
  • -c, --count: Shows the number of occurrences for each line at the start of the line.
  • -i, --ignore-case: Compares lines case-insensitively.
  • -f N, --skip-fields=N: Avoids comparing the first N fields.
  • -s N, --skip-chars=N: Avoids comparing the first N characters.
  • -w N, --check-chars=N: Compares only N characters in lines.

As you can see, uniq has several modes of operation, apart from the default, and can be used in many ways to analyze data.

Note

Note that uniq requires that the input file be sorted for it to work correctly.

Exercise 11: Working with Filtering Commands

In this exercise, we will walk through some text-processing tasks using the commands we learned previously. The test data for this chapter contains three main datasets (available publicly on the internet):

  • Records of the percentage of land area that was agricultural, in every country (and region) for 1961-2015, with about 12,000 rows
  • Records of the population of every country (and region) for 1961-2015, with about 14,000 rows
  • Payroll records of public workers in NYC for the year 2017, with about 560,000 rows

These datasets are large enough to demonstrate how well the shell can deal with big data. It is possible to efficiently process files of many gigabytes on the shell, even on limited hardware such as a small laptop. We will first do some simple tasks with the data from earlier chapters and then try some more complex commands to filter the aforementioned data.

Note

Many commands in this exercise and the ones to follow print many lines of data, but we will only show a few lines here for brevity's sake.

  1. Open the command-line shell and navigate to the data folder from the first exercise:

    robin ~ $ cd Lesson1/data/

    robin ~/Lesson1/data $

  2. Use cat to number the lines of ls output, as follows:

    robin ~/Lesson1/data $ ls -l | cat -n

      1 total 16

      2 drwxr-xr-x 36 robin robin 4096 Sep 5 15:49 cupressaceae

      3 drwxr-xr-x 15 robin robin 4096 Sep 5 15:49 pinaceae

      4 drwxr-xr-x 23 robin robin 4096 Sep 5 15:49 podocarpaceae

      5 drwxr-xr-x 8 robin robin 4096 Sep 5 15:49 taxaceae

  3. Use tr on the output of ls to transform it into uppercase using the range syntax:

    robin ~/Lesson1/data $ ls | tr 'a-z' 'A-Z'

    CUPRESSACEAE

    PINACEAE

    PODOCARPACEAE

    TAXACEAE

  4. Use tr to convert only vowels to their uppercase form:

    robin ~/Lesson1/data $ ls | tr 'aeiou' 'AEIOU'

    cUprEssAcEAE

    pInAcEAE

    pOdOcArpAcEAE

    tAxAcEAE

  5. Navigate to the folder ~/Lesson2 which contains the test data for this chapter:

    robin ~/Lesson1/data $ cd

    robin ~ $ cd Lesson2

  6. The land.csv file contains the historical records we mentioned previously. View this file with less to understand its format:

    robin ~/Lesson2 $ less land.csv

    The file is in CSV format. The first line describes the field names, and the remaining lines contain data. Here is what the file looks like:

    Country Name,Country Code,Year,Value

    Arab World,ARB,1961,30.9442924784889

    Arab World,ARB,1962,30.9441456790578

    Arab World,ARB,1963,30.967119790024

  7. Use grep to select the data for Austria, as follows:

    robin ~/Lesson2 $ grep -w 'Austria' <land.csv

    Austria,AUT,1961,43.0540082344393

    Austria,AUT,1962,42.7585371760717

    Austria,AUT,1963,42.2596270283362

    As a general rule, we use the -w flag with grep when looking for data that is in columnar form. This ensures that the search term is matched only if it is the entire field, otherwise it may match a substring of a field. It is still possible that this will match something like "Republic Of Austria". In this case, we know that "Austria" is always written as a single word, so it works. To handle such ambiguities, we can use regular expressions (described in the next chapter) to exactly specify the matching logic.

    We have used input redirection to grep instead of just passing the file as an argument. This is only because we are emphasizing the use of redirection and piping. The command would work exactly the same if we passed land.csv as an argument instead of redirecting it.

  8. Select the data that is after the year 2000 by using grep again to look for the lines with "20". Use the following code (the complete output is not shown):

    robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep '20'

    Austria,AUT,1971,40.2143376120126

    Austria,AUT,1978,38.5178009203197

    Austria,AUT,1979,38.0588520222814

    Austria,AUT,1981,38.2102203923468

    Austria,AUT,1983,36.6420440784694

    Austria,AUT,1984,36.7304432065876

    Austria,AUT,1992,36.4858319205619

    Austria,AUT,2000,35.604262533301

    Austria,AUT,2001,35.312424315815

  9. Note that we still got many lines for years before 2000. This is because some of the values for percentage have the string "20" in them. We can work around this by searching for ",20" instead (only a few lines of output are shown here):

    robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep ',20'

    Austria,AUT,2000,35.604262533301

    Austria,AUT,2001,35.312424315815

    Austria,AUT,2002,35.1441026883023

    Austria,AUT,2003,34.9386049891015

    We used this slightly hack-ish approach to filter data by searching for ",20". In this case, it worked because none of the percentage values started with "20". However, that is not true for the data in many other countries, and this would have included rows we did not need.

    Ideally, we should use some other method to extract these values. The best general option is to use the awk tool. awk is a general text-processing language, which is too complex to cover in this brief book. The students are encouraged to learn about that tool, as it is extremely powerful.

    We can also use grep with a simple regular expression to match four digits starting with "20". This will not be subject to the issue of matching wrong lines (partial output is shown here):

    robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep -w '20[[:digit:]][[:digit:]]'

    Austria,AUT,2000,35.604262533301

    Austria,AUT,2001,35.312424315815

    Austria,AUT,2002,35.1441026883023

    Austria,AUT,2003,34.9386049891015

    The -w flag matches an entire word, and the expression '20[[:digit:]][[:digit:]]' matches 20, followed by two digits. We use character classes, which were described earlier in this topic. We will learn regular expressions in more detail in the next chapter, but for the remaining examples and activities, use this regular expression to match the years for 2000 onward.

  10. Next, use cut to get rid of the second column, which has the country code, as follows:

    robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep -w '20[[:digit:]][[:digit:]]' | cut -d, --complement -f2

    Austria,2000,35.604262533301

    Austria,2001,35.312424315815

    Austria,2002,35.1441026883023

    Austria,2003,34.9386049891015

    We used the --complement flag to specify that we want everything except field #2. The input field delimiter was set to the comma character.

  11. You can pass the --output-delimiter flag to cut so that it's more readable. Set the delimiter to the tab character using dollar-single-quotes and an escape sequence:

    robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep -w '20[[:digit:]][[:digit:]]' | cut -d, --complement -f2 --output-delimiter=$' '

    Austria 2000 35.604262533301

    Austria 2001 35.312424315815

    Austria 2002 35.1441026883023

    Austria 2003 34.9386049891015

    Note

    When a tab character is printed on the console, it does not print a fixed number of spaces. Instead, it tries to print as many spaces as required to reach the next column that is a multiple of the tab width (usually 8). This means that when you print a file with tabs as delimiters, it can appear to have an arbitrary number of spaces between fields. Even though the number of spaces printed varies, there is only one tab character between the fields.

  12. Use cut to separate out the year and percentage columns into two files. We have to run it twice. We will first redirect the intermediate results from grep to a temporary file, then extract the column with the year, and then with the percentage:

    robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep -w '20[[:digit:]][[:digit:]]' >austria.txt

    robin ~/Lesson2 $ cut -d, -f3 <austria.txt >year.txt

    robin ~/Lesson2 $ cut -d, -f4 <austria.txt >percent.txt

  13. Now, let's join the columns back using paste, with percentage first and year second, using a space as a delimiter:

    robin ~/Lesson2 $ paste -d' ' percent.txt year.txt

    The output will appear as follows:

    Figure 2.11: A screenshot displaying the data with percentage first and year second
    Figure 2.11: A screenshot displaying the data with percentage first and year second
  14. Sort the preceding command's output to see the data ordered by percentage:

    robin ~/Lesson2 $ paste -d' ' percent.txt year.txt | sort -n

    The output will appear as follows:

    Figure 2.12: A screenshot displaying the data ordered by percentage
    Figure 2.12: A screenshot displaying the data ordered by percentage

    We will examine the sort command in detail in the next section.

  15. The payroll.tsv file has data about public workers in NYC in 2017. View the file and observe that the third and fourth columns contain the last and first names of the workers, respectively. Extract those two columns with cut. You can also remove the first line which has the field description "Last Name" and "First Name" with grep and finally sort it:

    robin ~/Lesson2 $ less payroll.tsv

    robin ~/Lesson2 $ cut -f3,4 payroll.tsv | grep -v 'First' | sort >names.tsv

    Note

    grep is not quite the right tool for this job because it will end up processing every line of the file looking for the string "First", which is unnecessary work. What we really need is just to skip the first line. This can be done with the tail command, which we will learn about later.

  16. The wc command can be used to count the lines in a file. Let's use it to see how many workers are listed:

    robin ~/Lesson2 $ wc -l names.tsv

    562266 names.tsv

  17. Let's use uniq to see how many distinct names exist:

    robin ~/Lesson2 $ uniq names.tsv | wc -l

    410141

  18. What about people who do not share their name with anyone? Pass -u to uniq to get only those names that occur once:

    robin ~/Lesson2 $ uniq -u names.tsv | wc -l

    301172

In this exercise, we have learned some commands that work on a line-by-line basis on files. Next, we will examine commands that work on entire files at a time.

Transformation Commands

Commands of this category operate by reading more than one line at a time and producing output. These can be thought of as a complete transformation of contents.

Sort Lines: sort

The sort command orders the lines of a file (we have seen this in some exercises before). Here is the list of the commonly used options for sort:

  • -o FILE, --output=FILE: Writes output to FILE rather than to stdout.

    Since sort reads its entire input into memory before writing anything, this flag can be used to sort a file in place. Always specify this as the first option if it is being used, since some systems mandate it. This flag should not be used with -m since merging is a different operation altogether and happens line by line.

  • -b, --ignore-leading-blanks: Ignores blanks that occur at the start of a line.
  • -f, --ignore-case: Does a case-insensitive sort.
  • -i, --ignore-nonprinting: Ignores unprintable characters.
  • -n, --numeric-sort: Treats numbers as values rather than as strings.

    When compared as strings (lexicographic ordering), "00" compares by default as greater than "0". With this flag specified, the text is interpreted as a number, so both of the preceding strings are treated as being equal.

  • -r, --reverse: Sorts in descending order.
  • -s, --stable: Uses a stable sort.

    Stable sorts preserve the relative order of lines that compare as equal.

  • -u, --unique: Removes all but the first line of lines that compare equally.

    This flag can perform the same function as the default behavior of uniq, so it makes sense to skip piping into uniq separately in such cases.

  • -m, --merge: Merges sorted files.

    Merging uses an efficient algorithm to take the data from already sorted files and merge it into a combined sorted output. Do not use the -o flag if using this. The merge flag can save a lot of time in certain cases, such as when adding data to an already existing large sorted file and maintaining its sorted property.

  • -t SEP, --field-separator=SEP: Treats SEP as the field delimiter (useful with -k).
  • -k KEYDEF, --key=KEYDEF: Uses the given field as the sort key.

    KEYDEF is a string consisting of two column numbers separated by commas, for example, 3,5. This instructs sort to use columns 3, 4, and 5 for sorting. Either field number can have an optional period followed by a second number, specifying the character position from where to consider the field. For example, -k3.2,3.4 uses the second to fourth characters of the third field in the file. This option can be specified multiple times to specify the secondary key and so on. The second number is optional and, if omitted, defaults to the last column index.

    Note

    Sorting is an operation that requires the entire file content to be processed before it can output anything. This means that a sort command inside a pipeline essentially prevents any further steps in the pipeline from happening until the sort is finished.

    Hence, we should be careful about the performance implications of using sort and other commands that work on complete files rather than line by line. Usually, the rule of thumb is to save the sorted data to a file if the amount of data is large. Repeatedly sorting a large file in each individual command will be slow. Instead, we can reuse the sorted data.

Remember that the sort command works on lexicographic ordering, according to the ASCII code (use the man ascii command to view it). This has consequences about how punctuation, whitespace, and other special characters affect the ordering. In most cases, however, the data we process is only text and numbers, so this should not be a frequent problem.

Now, let's look at an example where we can use the sort command. Say we receive huge daily event log files from different data centers, say denver.txt, dallas.txt, and chicago.txt. We need to maintain a complete sorted log of the entire year's history too, for example, in a file called 2018-log.txt.

One method would be to concatenate everything and sort it (remember that 2018-log.txt already exists and contains a huge amount of sorted data):

cat denver.txt dallas.txt chicago.txt >>2018-log.txt

sort -o 2018-log.txt 2018-log.txt

This would end up re-sorting most of the data that was already sorted in 2018-log.txt and would be very inefficient.

Instead, we can first sort the three new files and merge them. We need to use a temporary file for this since we cannot use the -o option to merge the file in-place:

sort -o denver.txt denver.txt

sort -o dallas.txt dallas.txt

sort -o chicago.txt chicago.txt

sort -m denver.txt dallas.txt chicago.txt 2018-log.txt >>2018-log-tmp.txt

Once we have the results in the temporary file, we can move the new file onto the old one, overwriting it:

mv 2018-log-tmp.txt 2018-log.txt

Unlike sorting, merging sorted files is a very efficient operation, taking very little memory or disk space.

Print Lines at the End of a File: tail

The tail command prints the specified number of lines at the end of a file. In other words, it shows the tail end of a file. The useful options for tail are as follows:

  • -n N, --lines=N: Outputs the last N lines

    If N is prefixed with a + symbol, then the console shows all lines from the Nth line onward. If this argument is not specified, it defaults to 10.

  • -q, --quiet, --silent: Does not show the filenames when multiple input files are specified
  • -f, --follow: After showing the tail of the file, the command repeatedly checks for new data and shows any new lines that have been added to the file

The most common use of tail other than its default invocation is to follow logfiles. For example, an administrator could view a website's error log that is being continuously updated and that has new lines being appended to it intermittently with a command like the following:

tail -f /var/log/httpd/error_log

When run this way, tail shows the last 10 lines, but the program does not exit. Instead, it waits for more lines of data to arrive and prints them too, as they arrive. This lets a system administrator have a quick view of what happened recently in the log.

The other convenient use of tail is to skip over a given number of lines in a file. Recall step 16 of the previous exercise, where we used grep to skip the first line of a file. We could have used tail instead.

Print Lines at the Head of a File: head

The head command works like the inverse of tail. It prints a certain number of lines from the head (start) of the file. The useful options for head are as follows:

  • -n N, --lines=N: Outputs the first N lines

    If N is prefixed with a symbol, then the console shows all except the last N lines. If this argument is not specified, it defaults to 10.

  • -q, --quiet, --silent: Does not show the filenames when multiple input files are specified

The head command is commonly used to sample the content of a large file. In the previous exercise, we examined the payroll file to observe how the data was structured with less. Instead, we could have used head to just dump a few lines to the screen.

Combining head and tail in a pipeline can be useful to extract any contiguous range of lines from a file. For example, look at the following code:

tail -n+100 data.txt | head -n45

This command would print the lines 100 to 144 from the file.

Join Columns of Files: join

join does a complex data merging operation on two files. The nearest thing that describes it is the working of a database query that joins two tables. If you are familiar with databases, you would know how a join works. However, the best way to describe join is with an example.

Let's assume we have two files specifying the ages of people and the countries to which they belong. The first file is as follows:

Alice 25

Charlie 34

The second file is as follows:

Alice France

Charlie Spain

The result of applying join on these files is as follows:

Alice 25 France

Charlie 34 Spain

join requires the input files to be sorted to work. The command provides a plethora of options of which a small subset is described here:

  • -i, --ignore-case: Compares fields case-insensitively
  • -1 N: Uses the Nth field of file 1 to join
  • -2 N: Uses the Nth field of file 2 to join
  • -j N: Uses the Nth field in both files to join
  • -e EMPTY: When an input field is missing, this option replaces it with the string specified in EMPTY instead
  • -a N: Apart from paired lines, it prints any lines from file N that do not have a pair in the other file
  • -v N: Prints any lines from file N that do not pair but does not print the normal output
  • -t CHAR: Uses CHAR as input and output field separator
  • -o FORMAT: Prints the field's values as per the specified FORMAT

    FORMAT consists of one or more field specifiers separated by commas. Each field specifier can be either 0, or a number M.N where M is the file number and N is the field number. 0 represents the join field. For example, 1.3,0,2.2 means "Third field of file 1, join field, and second field of file 2". FORMAT can also be auto, in which case join prints out all the joined fields.

  • --check-order: Checks if input files are sorted.
  • --nocheck-order: Does not check if input files are sorted.
  • --header: Treats the first line in each file as a header and prints them directly.

join works by reading a line from each file and checking if the values of the join columns match. If they match, it prints the combined fields. By default, it uses the first field to join.

With the -e and -a flags specified, it can perform an outer join in database terminology. For example, look at the following snippet:

robin ~ $ cat a.txt

Name Age

Alice 25

Charlie 34

robin ~ $ cat b.txt

Name Country

Alice France

Bob Spain

robin ~ $ join --header -o auto -e 'N/A' -a2 -a1 a.txt b.txt

Name Age Country

Alice 25 France

Bob N/A Spain

Charlie 34 N/A

There was no data about Charlie's country and Bob's age, but a row was output for everyone containing all the data that was available, and N/A where it was not. The -a1 flag tells join to include rows from file 1 that do not have a row in file 2. This brings in a row for Charlie. The -a2 flag tells join to include rows from file 2 that do not have a row in file 1. This brings in a row for Bob. The -e flag tells join to add a placeholder, N/A, for missing values. The -o auto flag is necessary for this outer join operation to work as shown; it ensures that all the columns from both files are included in every row in the output.

In the default mode of operation (without -a) we get an inner join, which would have skipped any rows for which there is no corresponding match in both files, as shown here:

robin ~ $ join --header a.txt b.txt

Name Age Country

Alice 25 France

In the preceding output, note that only Alice's row was printed, since data about her exists in both files, but not any of the others.

The combination of join, sort, and uniq can be used to perform all the mathematical set operations on two files, such as disjunction, intersection, and so on. join can also be (mis)used to reorder columns of a file by joining a file with itself if it has a column with all distinct values (line numbers, for example).

Output Files in Reverse: tac

tac is used to reverse a file line by line. This is especially useful for quickly reversing the order of already sorted files without re-sorting them. Since tac needs to be able to reach the end of the input stream and move backward to print in reverse, tac will stall a pipeline until it gets all the piped data input, just like sort. However, if tac is provided an actual file as input, it can directly seek to the end of the file and start working backward.

The common options for tac are as follows:

  • -s SEP, --separator=SEP: Uses SEP as the separator to define the chunks to be reversed. If this flag is not specified, the newline character is assumed to be a separator.
  • -r, --regex: Treats the separator string as a regular expression.

The most common use of tac is to reverse the lines of a file. Reversing a file's words or characters can also be done by using the -s and -r flags, but this use case is rare.

Get the Word Count: wc

The wc command can count the lines, words, characters, or bytes in a file. It can also tell the number of characters in the widest line of a file.

The common flags for wc are as follows:

  • -l, --lines: Shows the newline count (number of lines)
  • -w, --words: Shows the word count
  • -m, --chars: Shows the character count
  • -c, --bytes: Shows the byte count (this may differ from the character count for Unicode input)
  • -L, --max-line-length: Shows the length of the longest line

Exercise 12: Working with Transformation Commands

In this exercise, we will continue where we left off from the previous exercises. We will do some more data extraction from the geographical datasets, and then work with the payroll data:

  1. View a few lines of the population.csv file with head:

    robin ~/Lesson2 $ head -n5 population.csv

    Country Name,Country Code,Year,Value

    Arab World,ARB,1960,92490932

    Arab World,ARB,1961,95044497

    Arab World,ARB,1962,97682294

    Arab World,ARB,1963,100411076

    This has a similar format to the land.csv file we used earlier.

  2. Let's use join to merge these two datasets together so that we can see the population and land area of each country by year. join can only use one column when matching two lines, but here we must join on two columns. To work around this, we need to transform the data so that the two columns we need are physically conjoined. So, first, let's extract columns 1, 3, and 4 into separate files with cut:

    robin ~/Lesson2 $ cut -f1 -d, population.csv >p1.txt

    robin ~/Lesson2 $ cut -f3 -d, population.csv >p3.txt

    robin ~/Lesson2 $ cut -f4 -d, population.csv >p4.txt

    Note that we type three commands that are almost the same. In a future chapter, we will learn how to do these kinds of repetitive operations with less effort.

  3. Next, let's paste them back together with two different delimiters, the forward slash and tab:

    robin ~/Lesson2 $ paste -d$'/ ' p1.txt p3.txt p4.txt >p134.txt

    robin ~/Lesson2 $ head -n5 p134.txt

    Country Name/Year Value

    Arab World/1960 92490932

    Arab World/1961 95044497

    Arab World/1962 97682294

    Arab World/1963 100411076

  4. Repeat the same steps for the land.csv file:

    robin ~/Lesson2 $ cut -f1 -d, land.csv >l1.txt

    robin ~/Lesson2 $ cut -f3 -d, land.csv >l3.txt

    robin ~/Lesson2 $ cut -f4 -d, land.csv >l4.txt

    robin ~/Lesson2 $ paste -d$'/ ' l1.txt l3.txt l4.txt >l134.txt

    robin ~/Lesson2 $ head -n5 l134.txt

    Country Name/Year Value

    Arab World/1961 30.9442924784889

    Arab World/1962 30.9441456790578

    Arab World/1963 30.967119790024

    Arab World/1964 30.9765883533295

  5. Now, we have two files, where the country and year have been combined into a single field that can be used as the join key. Let's sort these files into place in preparation for join, but cut off the first line that contains the header using tail with +N:

    robin ~/Lesson2 $ tail -n+2 l134.txt | sort >land.txt

    robin ~/Lesson2 $ tail -n+2 p134.txt | sort >pop.txt

  6. Now, let's join these two tables to get the population and agricultural land percentage matched on each country per year. Use -o, -e, and -a to get an outer join on the data since the data is not complete (rows are missing for some combination of countries and years). Also, tell join to ensure that the files are ordered. This helps us catch errors if we forgot to sort:

    robin ~/Lesson2 $ join -t$' ' --check-order -o auto -e 'UNKNOWN' -a1 -a2 land.txt pop.txt | less

    Values where the data is not present are set to 'UNKNOWN'.

    The output will look as follows:

    Figure 2.13: A screenshot displaying the matched data
    Figure 2.13: A screenshot displaying the matched data
  7. Let's move on to the payroll data again. Recall that we had extracted the names of all the workers to names.tsv earlier. Let's find out the most common names in the payroll. Use uniq to count each name, and sort in reverse with numeric sort and view the first 10 lines of the result with head:

    robin ~/Lesson2 $ <names.tsv uniq -c | sort -n -r | head

     253

      69 RODRIGUEZ MARIA

      59 XXXX XXXX

      54 RODRIGUEZ JOSE

      49 RODRIGUEZ CARMEN

      43 RIVERA MARIA

      42 GONZALEZ MARIA

      40 GONZALEZ JOSE

      38 SMITH MICHAEL

      37 RIVERA JOSE

    We can see the names sorted by most frequent to least frequent (note that 253 names are blank, and 59 names in the records are invalid "XXXX XXXX").

  8. Let's save the results of the frequency of names to a file using the following command:

    robin ~/Lesson2 $ <names.tsv uniq -c | sort -n -r >namecounts.txt

  9. Find all people who have the word "SMITH" in their last name with the -w flag to avoid names like "PSMITH". We assume that no other field could contain "SMITH":

    robin ~/Lesson2 $ grep -w 'SMITH' namecounts.txt | less

    We can see the results in decreasing order of frequency. Here are the first few lines you will see:

      38 SMITH MICHAEL

      29 SMITH ROBERT

      23 SMITH JAMES

      22 SMITH MICHELLE

      20 SMITH CHRISTOPHER

      19 SMITH JENNIFER

      18 SMITH WILLIAM

  10. Now, use tac to view the data in the reverse order, and use head to see the first five lines. These are some of the rarest names with "SMITH" in them:

    robin ~/Lesson2 $ grep -w 'SMITH' namecounts.txt | tac | head -n5

      1 ADGERSON-SMITH KAILEN

      1 ALLYN SMITH LISA

      1 ANDERSON-SMITH FELICIA

      1 BAILEY-SMITH LAUREN

      1 BANUCCI-SMITH KATHERINE

With this exercise, we have learned how to use commands that work for transforming column-based text files in various ways. With some ingenuity, we can mold our text data in any way we please.

Activity 6: Processing Tabular Data – Reordering Columns

In the previous exercises, we used cut to extract individual columns and then paste to create a file with a subset of the columns in the original data, which is analogous to the SELECT operation in databases. Using cut and paste for this is quite cumbersome, but there is a way to use join for this purpose, with a little ingenuity.

In this activity, you will be working with the land.csv file, which contains historical data of agricultural land percentage for hundreds of countries. The data is divided into four columns by commas: Country Name, Country Code, Year, and Value. From the high-level instructions provided here, and the concepts learned in this chapter, create two new files that have the data laid out as follows:

  • Year, Value, and Country Code, that is, columns 3, 4, and 2
  • Value, Year, and Country Name, that is, columns 4, 3, and 1

In this activity, you need to convert a high-level description of a task with hints into actual command pipelines. Refer to the options for the commands and test out your commands carefully, viewing the intermediate results to ensure you are on the right track.

Now, perform the following operations (remember that you can use less or head to verify a command's output, before writing to a file):

  1. Use sort to create a sorted version of land.csv called sorted.txt, and use tail to skip the first line, which has the header.
  2. Create a numbered version of this sorted file called numbered.txt using cat.
  3. View this file and verify that it has five columns. The first column should have a tab after it, but the rest should be delimited by commas. Use cat with the right options to let you distinguish between spaces and tabs.
  4. Convert the commas in numbered.txt into tabs and create a file called tabbed.txt. Use tr, and remember the correct use of escape sequences.
  5. View tabbed.txt to make sure you have a file with five columns separated by tabs. Again, use cat with the right options to let you distinguish between spaces and tabs.
  6. Use join to outer join tabbed.txt with itself and extract the columns Year, Value, and Country Code (3, 4, and 2) into a file called 342.txt. Refer to the options that let you perform an outer join, and the one that lets you select the output columns.

    Note

    Remember that the columns 3, 4, and 2 in the original file are not at the same position in the numbered file.

  7. Repeat step 6. This time, extract columns Value, Year, and Country Name (4, 3, and 1) as 431.txt.

    Note

    Again, remember that the columns 4, 3, and 1 in the original file are not at the same position in the numbered file.

Verify that 342.txt has data of the columns Year, Value, and Country Code. You should get the following output:

Figure 2.14: Output of 342.txt
Figure 2.14: Output of 342.txt

Verify that 431.txt has data of the columns Value, Year, and Country Name.

Figure 2.15: Output of 431.txt
Figure 2.15: Output of 431.txt

Note

The solution for this activity can be found on page 274.

Activity 7: Data Analysis

In this activity, you will perform data analysis tasks using command-line operations. Use the land.csv and population.csv files which contain historical agricultural land percentage and population data for all countries, respectively. Extract the data for the median population for all countries in 1998 and the median value of agricultural land percentage for the country of Australia to answer these questions:

  1. How much was the median percentage of agricultural land in Australia, and which was the median year?
  2. Which country had the median population in 1998, and how much was it?

    Note

    A statistical median is defined as the middle value in a sorted sequence; half the values are below the median and half are above.

    Assuming a sequence has N values, the index of the median is N/2 rounded to the nearest integer. For example, if N is 10, then the median is 5. If N is 17, then the median is 9 because 17/2 is 8.5 and it rounds to 9.

Perform the following operations (remember to use temporary files to save intermediate results):

  1. Extract the data for Australia from the land data file.
  2. Sort the data based on the percentage values.
  3. Count the number of lines of data.
  4. Print out the line that is closest to the middle. You can use the bc command with herestrings to do calculations (note that bc performs integer division by default with the / symbol).
  5. Extract the data for 1998 from the population data file.
  6. Sort the data based on population values.
  7. Repeat steps 3 and 4.

The following are the expected answers to the preceding questions:

  1. The median of the agricultural land percentage in Australia for this dataset was 60.72% in 1963.
  2. Azerbaijan had the median population of 7,913,000 in 1998.

    Note

    The solution for this activity can be found on page 274.

In this section, we have explored a fairly large subset of the text-processing commands of the UNIX ecosystem. We also looked at an introductory level of their functionalities, which have been exposed through various options.

There are many more options for these commands and several commands that we have not covered, but what we have learned is enough to do many data-processing tasks without resorting to specialized GUI applications.

What we have learned so far is as follows:

  • How to use filter and transform commands on text files
  • How to construct complex pipelines
  • How to perform database-like operations on columnar text data
  • How to combine small simple commands into complex high-level operations

In future chapters, we will cover more use patterns of these commands and more mechanisms to drive them.

Summary

In this chapter, you have been introduced to several concepts such as input, output, redirection, and pipelines. You have also learned basic text-processing tools, along with both common and uncommon use cases of these tools, to demonstrate their flexibility. At a conceptual level, several techniques related to processing tabular data have been explored.

A large number of details have been covered. If you are being introduced to these for the first time, you should attempt to understand the concepts at an abstract level and not be overwhelmed by details (which you can always refer to when in doubt). To this end, some additional complexities have been avoided in order to focus on the essential concepts. The students can pick up more nuances as they continue to learn and practice in the future, beyond this brief book.

In the next chapter, you will learn about several more concepts related to the shell, including basic regular expressions, shell expansion, and command substitution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset