2
By the end of the chapter, you will be able to:
This chapter introduces you to the two main composing mechanisms of command lines: redirection and piping. You will also expand your vocabulary of commands to be able to perform a wide variety of data-processing tasks.
So far, we have learned the basics of how to work with the filesystem with the shell. We also looked at some shell mechanisms such as wildcards and completion that simplify life in the command line. In this chapter, we will examine the building blocks that are used to perform data-processing tasks on the shell.
The Unix approach is to favor small, single-purpose utilities with very well-defined interfaces. Redirection and pipes let us connect these small commands and files together so that we can compose them like the elements of an electronic circuit to perform complex tasks. This concept of joining together small units into a more complex mechanism is a very powerful technique.
Most data that we typically work with is textual in nature, so we will study the most useful text-oriented commands in this chapter, along with various practical examples of their usage.
Redirection is a method of connecting files to a command. This mechanism is used to capture the output of a command or to feed input to it.
During this section, we will introduce a few commands briefly, in order to illustrate some concepts. The commands are only used as examples, and their usage does not have any connection to the main topics being covered here. The detailed descriptions of all the features and uses of those commands will be covered in the topic on text-processing commands.
Every command that is run has a channel for data input, termed standard input (stdin), data output, termed standard output (stdout) and standard error (stderr). A command reads data from stdin and writes its results to stdout. If any error occurs, the error messages are written to stderr. These channels can also be thought of as streams through which data flows.
By convention, stdin, stdout, and stderr are assigned the numbers 0, 1, and 2, which are called file descriptors (FDs). We will not go into the technical details of these, but remember the association between these streams and their FD numbers.
When a command is run interactively, the shell attaches the input and output streams to the console input and output, respectively. Note that, by default, both stdout and stderr go to the console display.
The terms console, terminal, and TTY are often used interchangeably. In essence, they refer to the interface where commands are typed, and output is produced as text. Console or terminal output refers to what the command prints out. Console or terminal input refers to what the user types in.
For instance, when we use ls to list an existing and non-existing folder from the dataset we used in the previous chapter, we get the following output:
robin ~/Lesson1/data $ ls podocarpaceae/ nonexistent
ls: cannot access 'nonexistent': No such file or directory
podocarpaceae/:
acmopyle dacrydium lagarostrobos margbensonia parasitaxus podocarpus saxegothaea
afrocarpus falcatifolium lepidothamnus microcachrys pherosphaera prumnopitys stachycarpus
dacrycarpus halocarpus manoao nageia phyllocladus retrophyllum sundacarpus
Note the error message that appears on the first line of the output. This is due to the stderr stream reaching the console, whereas the remaining output is from stdout. The outputs from both channels are combined.
We can tell the shell to connect any of the aforementioned streams to a file using the following operators:
command >file.txt
This instructs the shell to redirect the stdout of a command into a file. If the file already exists, its content is overwritten.
command >>file.txt
This instructs the shell to redirect the stdout of command into a file. If the file does not exist, it is created, but if the file exists, then it gets appended to, rather than overwritten, unlike the previous case.
command <file.txt
This instructs the shell to redirect a file to the stdin of the command.
In the preceding syntax, you can prefix the number of an FD before <, >, or >>, which lets us redirect the stream corresponding to that FD. If no FD is specified, the defaults are stdin (FD 0) and stdout (FD 1) for input and output, respectively. Typically, the FD prefix 2 is used to redirect stderr.
To redirect both stdout and stderr to the same file, the special operator &> is used.
When the shell runs one of the preceding commands, it first opens the files to/from which redirection is requested, attaches the streams to the command, and then runs it. Even if a command produces no output, if output redirection is requested, then the file is created (or truncated).
Both input and output redirection can be specified for the same command, so a command can be considered as consisting of the following parts (of which the redirections are optional):
A key insight is that the order in which these parts appear in the command line does not matter. For example, consider the following sort commands (the sort command reads a file via stdin, sorts the lines, and writes it to stdout):
sort >sorted.txt <data.txt
sort <data.txt >sorted.txt
<data.txt >sorted.txt sort
All of these three commands are valid syntax and perform the exact same action, that is, the content of data.txt is redirected into the sort command and the output of that in turn is written into the sorted.txt file.
It is a matter of individual style or preference as to how these redirection elements are ordered. The conventional way of writing it, which is the style encountered most frequently and considered most readable, is in this order:
command <input_redirect_file >output_redirect_file
A space is sometimes added between the redirection symbol and the filename. For example, we could write the following:
sort < data.txt > sorted.txt
This is perfectly valid. However, from a conceptual level, it is convenient to think of the symbol and the file as a single command-line element, and to write it as follows:
sort <data.txt >sorted.txt
This is to emphasize that the filenames data.txt and sorted.txt are attached to the respective streams, that is, stdin and stdout. Remember that the symbol is always written first, followed by the filename. The symbol points to the direction of the data flow, which is either from or into the file.
A useful convention that most commands follow is that they accept a command-line argument for the input file, but if the argument is omitted, the input data is read from stdin instead. This lets commands easily be used both with redirection (and piping) as well as in a standalone manner. For example, consider the following:
less file.txt
This less command gets the filename file.txt passed as an argument, which it then opens and displays. Now, consider the following:
less <file.txt
For this second case, the shell does not treat <file.txt as an argument to be passed to less—instead, it is an instruction to the shell to redirect file.txt to the stdin of less. When less runs, it sees no arguments passed to it, and therefore, it defaults to reading data from stdin to display, rather than opening any file. Since stdin was connected to file.txt, it achieves the same function as the first command.
When a command reads input from stdin and is not redirected, it accepts lines of text typed by the user until the user presses Ctrl + D. The shell interprets that keystroke and signals an end-of-file (EOF) to the program, causing the input stream to stop, and the program exits.
Here documents (also called heredocs) is a special form of redirection that lets you feed multiple lines of text to a command in a similar way, but rather than requiring Ctrl + D to signal EOF, you can specify an arbitrary string instead. This feature is especially useful in shell scripts, which we will cover in later chapters. A traditional example of using here documents is to type a mail directly into the sendmail program to compose and send a mail from the command line itself, without using an editor.
The syntax for heredocs is as follows:
command <<LIMITSTRING
Here, LIMITSTRING can be any arbitrary string. Upon typing this, the shell prompts for multiple lines of text, until the limit string is typed on its own line, after which the shell runs the command, passing in all the lines that were typed into the command.
Here strings (also called herestrings) is yet another form of input redirection. It allows passage of a single string as input to a command, as if it were redirected from a file. The syntax for herestrings is as follows:
command <<< INPUT
Here, INPUT is the string to be passed into the program's stdin. If the string is quoted, it can extend over multiple lines.
Now, let's talk about a concept called buffering, which applies to both redirection and piping. A buffer can be considered analogous to a flush tank for data. Water flows in until the tank is full, after which the valve closes, and the flow of water is blocked. When water drains out of the tank, the valve opens to let the water fill the tank again. The following are the buffers connected to the input and output streams.
stdout buffer
Every program has a buffer connected to its stdout, and the program writes its output data into this buffer. When the buffer gets full, the program cannot write any more output, and is put to sleep. We say that this program has blocked on a write.
The other end of a command's stdout buffer could be connected to the following:
In each case, an independent process deals with taking the data out of this buffer and moving it to the other end. When the buffer has enough space for the blocked write to succeed, the program is woken up or unblocked.
stdin buffer
In a symmetric fashion, every program has a buffer connected to its stdin, and the program reads data from this buffer. When the buffer is empty, the program cannot read any more input, and is put to sleep. We say that this program has blocked on a read.
The other end of a command's stdin buffer could be connected to the following:
Once again, an independent process deals with filling this buffer with data. When the buffer has enough data for the blocked read to succeed, the program is woken up or unblocked.
The main reason to use buffering is for efficiency—moving data in larger chunks across an input or output channel is efficient. A process can quickly read or write an entire chunk of data from/to a buffer, and continue working, rather than getting blocked. This ensures maximum throughput and parallelism.
Flushing
A program can request a flush on its output buffer, which makes it sleep until the output buffer gets emptied. Flushing an input buffer works in a different way. It simply causes all the data in the input buffer to get discarded.
A command has a buffer for each of its input as well as output streams. These buffers can operate in three modes:
The choice of buffering modes of a command depends on the task that a program does; for example, almost all text files are typically processed line by line. Text-based commands tend to always use line buffering so that a processed line of text is immediately displayed.
Consider a program that reads the error logs of a web server and filters out the lines that refer to a certain error (for example, a failure of authentication). It would typically read an entire line, check if it met the criteria, and if so print the whole line at once. The output would never contain partial lines.
On the other hand, there are some commands that deal with binary data that is not divisible into lines—these use full buffering, so that transfer speed is maximized.
There are a few applications where a completely unbuffered I/O is used. It can be useful in some very narrow situations. For example, some programs draw user interfaces on text screens using ANSI codes. In such cases, the display needs to be updated instantly in order to provide a usable interface, which unbuffered output allows. Another example is the SSH (secure shell) program, which lets a user access a command-line shell on another computer across the internet. Every keystroke the user types is instantaneously sent to the remote end, and the resultant output is sent back. Here, unbuffered I/O is essential for SSH to provide the feeling of interactivity.
By default, the shell sets up the stderr of a command to be unbuffered, since error messages need to be displayed or logged immediately. stdout is set up to be line buffered when writing to the console, but fully buffered when being redirected to a file.
In the next section, which covers pipes, we will learn how shell pipelines connect multiple processes as a chain, each of which has its own buffers for stdin and stdout.
We will now use input and output redirection with basic commands. After this exercise, you should be able to capture the output of any command to a file, or conversely feed a file into a command that requires input:
robin ~ $ cd Lesson1/data
robin ~/Lesson1/data $ ls -l
total 16
drwxr-xr-x 36 robin robin 4096 Aug 20 14:01 cupressaceae
drwxr-xr-x 15 robin robin 4096 Aug 20 14:01 pinaceae
drwxr-xr-x 23 robin robin 4096 Aug 20 14:01 podocarpaceae
drwxr-xr-x 8 robin robin 4096 Aug 20 14:01 taxaceae
robin ~/Lesson1/data $ ls -l >dir.txt
robin ~/Lesson1/data $ cat dir.txt
total 16
drwxr-xr-x 36 robin robin 4096 Aug 20 14:01 cupressaceae
-rw-r--r-- 1 robin robin 0 Aug 27 17:13 dir.txt
drwxr-xr-x 15 robin robin 4096 Aug 20 14:01 pinaceae
drwxr-xr-x 23 robin robin 4096 Aug 20 14:01 podocarpaceae
drwxr-xr-x 8 robin robin 4096 Aug 20 14:01 taxaceae
Note that when we print the contents of dir.txt, we can see an entry for dir.txt itself, with a size of zero. Yet obviously dir.txt is not empty, since we just printed it. This is a little confusing, but the explanation is simple. The shell first creates an empty dir.txt file, and then runs ls, redirecting its stdout to that file. ls gets the list of this directory, which at this point includes an empty dir.txt, and writes the list into its stdout, which in turn ends up as the content of dir.txt. Hence the contents of dir.txt reflect the state of the directory at the instant when ls ran.
robin ~/Lesson1/data $ ls -l nonexistent >dir.txt
ls: cannot access 'nonexistent': No such file or directory
robin ~/Lesson1/data $ ls -l
total 16
drwxr-xr-x 36 robin robin 4096 Aug 20 14:01 cupressaceae
-rw-r--r-- 1 robin robin 0 Aug 27 17:19 dir.txt
drwxr-xr-x 15 robin robin 4096 Aug 20 14:01 pinaceae
drwxr-xr-x 23 robin robin 4096 Aug 20 14:01 podocarpaceae
drwxr-xr-x 8 robin robin 4096 Aug 20 14:01 taxaceae
From the preceding output, we can observe that the error message arrived on the console, but did not get redirected into the file. This is because we only redirected stdout to dir.txt. Note that dir.txt is empty, because there was no data written to stdout by ls.
robin ~/Lesson1/data $ ls -l pinaceae/ >dir.txt
robin ~/Lesson1/data $ ls -l taxaceae/ >>dir.txt
robin ~/Lesson1/data $ cat dir.txt
You will see the following output displayed on the console:
robin ~/Lesson1/data $ ls -l nonexistent taxaceae 2>dir.txt
taxaceae/:
total 24
drwxr-xr-x 8 robin robin 4096 Aug 20 14:01 amentotaxus
drwxr-xr-x 3 robin robin 4096 Aug 20 14:01 austrotaxus
drwxr-xr-x 12 robin robin 4096 Aug 20 14:01 cephalotaxus
drwxr-xr-x 3 robin robin 4096 Aug 20 14:01 pseudotaxus
drwxr-xr-x 14 robin robin 4096 Aug 20 14:01 taxus
drwxr-xr-x 9 robin robin 4096 Aug 20 14:01 torreya
robin ~/Lesson1/data $ cat dir.txt
ls: cannot access 'nonexistent': No such file or directory
Note that only the error message on stderr got redirected into dir.txt.
robin ~/Lesson1/data $ ls pinaceae nosuchthing >out.txt 2>err.txt
Use the cat command to view the output of the two files:
robin ~/Lesson1/data $ ls pinaceae nothing &>dir.txt
You will see the following output if you view the contents of dir.txt with the cat command:
The error message precedes the output because ls attempts to list the directories in lexicographical order. The nothing folder was attempted to be listed first, and then pinaceae.
robin ~/Lesson1/data $ cat -n <pinaceae/pinus/sylvestris/data.txt >numbered.txt
robin ~/Lesson1/data $ less numbered.txt
robin ~/Lesson1/data $ cat
Hello
Hello
Bye
Bye
^D
robin ~/Lesson1/data $ cat <<DONE
> This is some text
> Some more text
> OK, enough
> DONE
This is some text
Some more text
OK, enough
Observe the difference between steps 9 and 10. In step 9, cat processes each line that is typed and prints it back immediately. This is because the TTY (which is connected to the stdin of cat) waits for the Enter key to be pressed before it writes the complete line into the stdin of cat. Thereupon, cat outputs that line, emptying its input buffer, and goes to sleep until the next line arrives. In step 10, the TTY is connected to the shell itself, rather than to the cat process. The shell is, in turn, connected to cat and does not send any data to it until the limit string is encountered, after which the entire text that was typed goes into cat at once.
robin ~/Lesson1/data $ bc <<< 1234^7
4357186184021382204544
When run directly, bc accepts multiple expressions and prints the result. In this case, the herestring is treated as a file's content by the shell and is passed into bc via stdin redirection.
robin ~/Lesson1/data $ rm *.txt
In this exercise, we learned how to redirect the input and output of shell commands to files. Using files as the input and output of commands is essential to performing more complex shell tasks.
A shell pipeline or simply a pipeline refers to a construct where data is pushed from one command to another in an assembly line fashion. It is expressed as a series of commands separated by a pipe symbol |. These pipes connect the stdout of each command to the stdin of the subsequent command. Internally, a pipe is a special memory FIFO (first in, first out) buffer provided by the OS.
The basic syntax of a pipeline is as follows:
command1 | command2
Any number of commands can be linked:
command1 | command2 | command3 | command4
Pipelines are analogous to assembly lines in a factory. Like an assembly line lets multiple workers simultaneously do one designated job repeatedly, ending up with a finished product, a pipeline lets a series of commands work on a stream of data, each doing one task, eventually leading to the desired output.
Pipelines ensure maximum throughput and optimal usage of computing power. The time taken for a pipeline task in most cases will be close to the time taken by the slowest command in it, and not the sum of the times for all the commands.
While it is not a very common use case, you can pipe both stdout and stderr of one command into another command's stdin using the |& operator.
In this exercise, we will explore the use of pipes to pass data between commands. Some new commands will be introduced briefly, which will be explained in more detail later:
robin ~ $ cd Lesson1/data
robin ~/Lesson1/data $
robin ~/Lesson1/data $ tree | less
You will get the following output:
robin ~/Lesson1/data $ ls | tr '[:lower:]' '[:upper:]'
CUPRESSACEAE
PINACEAE
PODOCARPACEAE
TAXACEAE
As a general guideline, we should use single quotes for string arguments, unless we have a reason to use other kinds.
robin ~/Lesson1/data $ tree | tr '[:lower:]' '[:upper:]' | less
You will get the following output:
robin ~/Lesson1/data $ cd pinaceae/nothotsuga/longibracteata/
robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ sort <data.txt | uniq | less
The output is as follows:
robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ tr '[:lower:]' '[:upper:]' <data.txt | sort | uniq >test.txt
robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ less test.txt
robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ rm test.txt
You will get the following output:
robin ~/Lesson1/data/pinaceae/nothotsuga/longibracteata $ cd -
/home/robin/Lesson1/data
robin ~/Lesson1/data $ ls -R . nosuchthing |& less
Both the listing and the error message are visible in less. This would not be the case when using | rather than |&.
In this exercise, we have looked at how to build shell pipelines to stream data through multiple commands. Pipes are probably the single most important feature of shells and must be understood thoroughly. So far, we have studied the following:
Redirection and piping are essentially like performing plumbing with data instead of water. One major difference is that this command pipelining does not have multiple branches in the flow, unlike plumbing. After a little practice, using pipes becomes second nature to a shell user. The fact that shell commands often work in parallel means that some data-processing tasks can be done faster with the shell than with specialized GUI tools. This topic prepares us for the next one, where we will learn a number of text-processing commands, and how to combine them for various tasks.
In the previous sections, we learned about the two main composing mechanisms of command lines: redirection and piping. In this topic, we will expand our vocabulary of commands that, when combined, let us do a wide variety of data-processing tasks.
We focus on text-processing because it applies to a wide range of real-life data, and will be useful to professionals in any field. A huge amount of data on the internet is in textual form, and text happens to be the easiest way to share data in a portable way as simple columnar CSV (comma-separated values) or TSV (tab-separated values) files. Once you learn these commands, you do not have to rely on the knowledge of any specific software GUI tool, and you can run complex tasks on the shell itself. In many cases, running a quick shell pipeline is much faster than setting up the data in a more complex GUI tool. In a later chapter, you will learn how to save your commands as a sort of automatic recipe, with shell scripts, which can be reused whenever you need, and shared with others too.
Some of the commands we will learn now are quite complex and versatile. In fact, there are entire books devoted to them. In this book, we will only look at some of the most commonly used commands' features.
Before moving on to the commands, let's get familiar with some nuances of how the shell command line works when we need to input text.
Escaping
Often, we need to enter strings that contain special characters in the command line:
If we were to use these characters directly, the shell would interpret them, causing undesirable behavior since each of these has a special meaning to the shell. To handle this, we use a mechanism called escaping (which is also used in most programming languages). Escaping involves prefixing with an escape character, which allows us to escape to a different context, where the meaning of a symbol is no longer interpreted but treated as a literal character.
On the command line, the backslash serves as the escape character. You need to simply prefix a backslash to enter a special character literally as part of a string. For example, consider the following command:
echo * *
Since the asterisk is a special character (a wildcard symbol), instead of printing two asterisks, the command will print all the names of all the files and directories in the current directory twice. To print the asterisk character literally, you need to write the following:
echo * *
The two asterisks have now been escaped, and will not be interpreted by the shell. They are printed literally, but the backslashes are not printed. This escaping with a backslash works for any character.
Escaping works with spaces, too. For example, the following code will pass the entire filename My little pony.txt to cat:
cat My little pony.txt
It should now be obvious that you can create filenames and paths that contain special characters, but in general, this is best avoided. The unintended consequences of special characters in filenames can be quite perplexing.
Quoting
The shell provides another means of typing a string with special characters. Simply enclose it with single quotes (strong quoting) or double quotes (weak quoting). For example, look at the following commands:
ls "one two"
ls 'one two'
ls one two
In the preceding commands, the first two pass a single argument to ls, whereas the third passes two arguments. The basic use of single and double quotes is as follows:
echo 'This is a backslash '
This is a backslash
echo "This " was escaped"
This " was escaped
Within double quotes, the shell treats everything as literal, except backslashes (escaping), exclamation marks (history expansion), dollars (shell expansion), or the backtick (command substitution). We will deal with the latter three special symbols and their effects in later chapters.
Yet another quoting mechanism provided by Bash is called dollar-single-quotes, which are described in the following section.
Escape Sequences
Escape sequences are a generalized way of inserting any character into a string. Normally, we can only type the symbols available on the keyboard, but escape sequences allow us to insert any possible character. These escape sequences only work in strings with dollar-single-quotes. These strings are expressed by prefixing a dollar symbol before a single-quoted string. , for example, as shown below:
$'Hello'
Within such strings, the same rules as those for single quotes apply, except that escape sequences are expanded by the shell.
When using escape sequences in arguments, if we use strings with dollar-single-quotes, the shell inserts the corresponding symbols before passing the argument to the command. However, some commands can understand escape sequences themselves. In such cases, we may notice that escape sequences are used within single-quoted strings. In these cases, the command itself expands them internally.
Escape sequences originated in the C programming language, and the Bash shell uses the same convention. An escape sequence consists of a backslash followed by a character or a special code. Now, let's look at the list of escape sequences and their meaning, classified into two categories:
The first category is literal characters:
When the ASCII code was initially standardized, it consisted of seven-bit numbers, which represent 128 symbols: 32 control codes, 26 English alphabets, 10 digits, and several symbols. Later, ASCII was extended to eight bits (one byte), which added another 128 characters and was the common standard for decades until the necessity for non-English text ushered in Unicode. Unicode has several thousand symbols defined by a 32-bit or 16-bit number to represent all the characters of almost all written languages of the world.
The second category of escape sequences are ASCII control characters:
In the original ASCII code, the symbols 0 to 31 (which are not printable characters) were control codes that were sent to a mechanical TTY. Traditionally, TTY terminals were like electric typewriters. To wrap around at the end of a line, the carriage had to return to the right (the home position) and the platen had to rotate to advance the paper to the next line. These operations are called carriage return and line feed with the mnemonics CR and LF, respectively. Similarly, tab, form feed, vertical tab, and so on referred to instructions to move the TTY's carriage and platen.
The following escape codes produce these control characters:
The effect of printing these control characters to a console has a similar effect to what it would have on a mechanical TTY with a roll of paper—even the bell character, which activated an actual physical bell or beeper, is still emulated today on a modern shell console window.
An important practical detail to learn in this context is the difference between the DOS/Windows family operating systems and UNIX-based operating systems when they deal with line endings in text files. Traditionally, UNIX always used only the LF character (ASCII 10) to represent the end of a line. If an LF is printed to a console on UNIX-like operating systems, the cursor moves to the start of the next line. In the DOS/Windows world, however, the end of a line is represented by two characters, CR LF. This means that when a text file is created on one OS family and opened or processed on the other, they may fail to display or process correctly, since the definition of what represents a line ending differs. These conventions came into being due to complex historical events, but we are forever stuck with them.
Interestingly, the original Apple macOS used the convention of using only CR, but thankfully, modern MacOS and iOS are derived from FreeBSD Unix, so we don't have to deal with this third variety of line ending. The consequence of this is that our commands and scripts based on UNIX lineage may go haywire if they encounter text files created on the Windows OS family. If we need our scripts to work with data from all sources, we must take care of that explicitly.
For a file that originated on a Windows system, we must replace the sequence (CR LF) with the sequence (LF) before processing it with commands that work on a line-by-line basis. We may also have to do the inverse before we transfer that file back to a Windows system. Most editors, web browsers, and other tools that deal with text files are smart enough to allow for the display or editing of both kinds of files properly.
Another application of escape sequences is to send special characters to the console, called ANSI codes, which can change the text color, move the cursor to arbitrary locations, and so on. ANSI codes are an extension of the TTY-based control codes for early video-based displays called virtual terminals. They were typically CRT screens that showed 80 columns and 25 lines of text. They are expressed as an ASCII 27 (ESCAPE) or e, followed by several characters describing the action to be performed.
These ANSI codes work just the same even today and are useful for producing colorized text or crude graphical elements on the console, such as progress bars. Commands such as ls produce their colored output by the same mechanism by simply printing out the right ANSI codes.
We will not cover the details of ANSI codes and their usage, as it is beyond the scope of this book.
Multiline Input
The final feature of shell input we will discuss is multiline commands and multiline strings. When entering an extremely long command line, for readability's sake, we might like to split it into multiple lines. We can achieve this by typing a single backslash at any time and pressing the Enter key. Instead of executing the command, the shell prompts for the continuation of the command on the next line. Look at the example below:
robin ~ $ echo this is a very long command, let me extend
> to the next line and then
> once again
this is a very long command, let me extend to the next line and then once again
The backslash must be the last character of the line for this to work, and a command can be divided into as many lines as desired, with the line breaks having no effect on the command.
In a similar fashion, we can enter a literal multiline string containing newlines, simply by using quotes. Although they appear like multiline commands, multiline strings do not ignore the newlines that are typed. The rules for single and double quoted strings described earlier apply for multiline strings as well. For example:
robin ~ $ echo 'First line
> Second line
> Last line'
First line
Second line
Last line
Commands of this category operate by reading the input line by line, transforming it, and (optionally) producing an output line for each input line. They can be considered analogous to a filtering process.
Concatenate Files: cat
The cat command is primarily meant for concatenating files and for viewing small files, but it can also perform some useful line-oriented transformations on the input data. We have used cat before, but there are some options it provides that are quite useful. The long and short versions of some of these options are as follows:
Among the preceding options, the numbering options are particularly useful.
Translate: tr
The tr command works like a translator, reading the input stream and producing a translated output according to the rules specified in the arguments. The basic syntax is as follows:
tr SET1 SET2
This translates characters from SET1 into corresponding ones from SET2.
The tr command always works only on its standard input and does not take an argument for an input file.
There are three basic uses of tr that can be selected with a command-line flag:
The character sets passed to tr can be passed in various ways:
(a) [:alnum:] for all letters and digits
(b) [:alpha:] for all letters
(c) [:blank:] for all horizontal whitespaces
(d) [:cntrl:] for all control characters
(e) [:digit:] for all digits
(f) [:graph:] for all printable characters, not including space
(g) [:lower:] for all lowercase letters
(h) [:print:] for all printable characters, including space
(i) [:punct:] for all punctuation characters
(j) [:space:] for all horizontal or vertical whitespaces
(k) [:upper:] for all uppercase letters
(l) [:xdigit:] for all hexadecimal digits
Character classes are used in many commands, so it's useful to remember the common ones.
Stream Editor: sed
The sed command is a very comprehensive tool that can transform text in various ways. It could be considered a mini programming language in itself. However, we will restrict ourselves to using it for the most common function: search and replace.
sed reads from stdin and writes transformed output to stdout based on the rules passed to it as an argument. In its basic form for replacing text in the stream, the syntax that's used is shown here:
sed 'pattern'
Here, pattern is a string such as s/day/night/FLAGS, which consists of several parts. In the preceding code, for example:
(a) g stands for global, which tells sed to replace all matches of the search string (the default behavior is to replace only the first).
(b) i stands for case-insensitive, which tells sed to ignore case when matching.
(c) A number, N, specifies that the Nth match alone should be replaced. Combining the g flag with this specifies that all matches including and after the Nth one are to be replaced.
The delimiter is not mandated to be the / character. Any character can be used, as long as the same one is used at all three locations. Thus, all the following patterns are equivalent:
's#day#night#'
's1day1night1'
's:day:night:'
's day night '
'sAdayAnightA'
Multiple patterns can be combined in a pattern by separating them with a semicolon. For instance, the following pattern tells sed to replace day with night and long with short:
's/day/night/ ; s/long/short/'
Character classes can be used for the search string, but they need to be enclosed in an extra pair of square brackets. The reason for this will be apparent when we learn regular expressions in a later chapter.
The following pattern tells sed to replace all alphanumeric characters with an asterisk symbol:
's/[[:alnum:]]/*/g'
Cut Columns: cut
The cut command interprets each line of its input as a series of fields and prints out a subset of those fields based on the specified flags. The effect of this is to select a certain set of columns from a file containing columnar data.
The following is a partial list of the flags that can be used with cut:
Here, the syntax of LIST is a comma-separated list of one or more of the following expressions (M and N are numbers):
Let's look at an example of using cut. The sample data for this chapter has a file called pinaceae.csv, which contains a list of tree species with comma-separated fields. This file has data separated by commas, with some values that are empty, and looks like this (only a few lines are shown):
Here, cut is used to extract data from the third column onward, using the comma character as the delimiter, and display the output with tabs as a delimiter (only a few lines are shown):
robin ~/Lesson2 $ cut -s -d',' -f 3- --output-delimiter=$' ' pinaceae.csv | less
The output is as follows:
Note the usage of dollar-single-quotes to pass in the tab character to cut as a delimiter.
Paste Columns from Files Together: paste
paste works like the opposite of cut. While cut can extract one or more columns from a file, paste combines files that have columnar data. It does the equivalent of pasting a set of columns of data side by side in the output. The basic syntax of paste is as follows:
paste filenames
The preceding command instructs the command to read a line from each file specified and produce a line of output that has each of those lines combined, delimited by a tab character. Think of it like pasting files side by side in columns.
The paste command has one option that is commonly used:
DELIMS specifies individual delimiters for each field. For example, if it is set to XYZ, then X, Y, and Z are used as the delimiters after each column, respectively.
Since paste works with multiple input files, typically it is used on its own without pipes, because we can only pipe one stream of data into a command.
A combination of cut and paste can be used to reorder the columns of a file by first extracting the columns to separate files with cut, and then using paste to recombine them.
Globally Search a Regular Expression and Print: grep
grep is one of the most useful and versatile tools on UNIX-like systems. The basic purpose of grep is to search for a pattern within a file. This command is so widely used that the term grep is officially a verb meaning to search in the Oxford dictionary.
A complete description of grep would be quite overwhelming. In this book, we will instead focus on the smallest useful subset of its features.
The basic syntax of grep is as follows:
grep pattern filenames
The preceding command instructs the shell to search for the specified pattern within the files listed as arguments. This pattern can be any string or a regular expression. Also, multiple files can be specified as arguments. Omitting the filename argument(s) makes grep read from the stdin, as with most commands.
The default action of grep is to print out the lines that contain the pattern. Here is a list of the most commonly used flags for grep:
For an example of how grep works, we will use the man command (which stands for manual), since it's a handy place to get a bunch of English text as test data. The man command outputs the built-in documentation for any command or common terminology. Try the following command:
man ascii | grep -n --color 'the'
Here, we ask man to show the manual page for ascii, which includes the ASCII code and some supplementary information. The output of that is piped to grep, which searches for the string "the" and prints the matching lines as numbered and colorized:
man uses the system pager (which is less) to display the manual, so the keyboard shortcuts are the same as less. The output that man provides for a command is called a man page.
Students are encouraged to read man pages to learn more about any command; however, the material is written in a style more suited for people who are already quite used to the command line, so watch out for unfamiliar or complex material.
Print Unique Lines: uniq
The basic function of the uniq command is to remove duplicate lines in a file. In other words, all the lines in the output are unique. The commonly used options of uniq are as follows:
As you can see, uniq has several modes of operation, apart from the default, and can be used in many ways to analyze data.
Note that uniq requires that the input file be sorted for it to work correctly.
In this exercise, we will walk through some text-processing tasks using the commands we learned previously. The test data for this chapter contains three main datasets (available publicly on the internet):
These datasets are large enough to demonstrate how well the shell can deal with big data. It is possible to efficiently process files of many gigabytes on the shell, even on limited hardware such as a small laptop. We will first do some simple tasks with the data from earlier chapters and then try some more complex commands to filter the aforementioned data.
Many commands in this exercise and the ones to follow print many lines of data, but we will only show a few lines here for brevity's sake.
robin ~ $ cd Lesson1/data/
robin ~/Lesson1/data $
robin ~/Lesson1/data $ ls -l | cat -n
1 total 16
2 drwxr-xr-x 36 robin robin 4096 Sep 5 15:49 cupressaceae
3 drwxr-xr-x 15 robin robin 4096 Sep 5 15:49 pinaceae
4 drwxr-xr-x 23 robin robin 4096 Sep 5 15:49 podocarpaceae
5 drwxr-xr-x 8 robin robin 4096 Sep 5 15:49 taxaceae
robin ~/Lesson1/data $ ls | tr 'a-z' 'A-Z'
CUPRESSACEAE
PINACEAE
PODOCARPACEAE
TAXACEAE
robin ~/Lesson1/data $ ls | tr 'aeiou' 'AEIOU'
cUprEssAcEAE
pInAcEAE
pOdOcArpAcEAE
tAxAcEAE
robin ~/Lesson1/data $ cd
robin ~ $ cd Lesson2
robin ~/Lesson2 $ less land.csv
The file is in CSV format. The first line describes the field names, and the remaining lines contain data. Here is what the file looks like:
Country Name,Country Code,Year,Value
Arab World,ARB,1961,30.9442924784889
Arab World,ARB,1962,30.9441456790578
Arab World,ARB,1963,30.967119790024
robin ~/Lesson2 $ grep -w 'Austria' <land.csv
Austria,AUT,1961,43.0540082344393
Austria,AUT,1962,42.7585371760717
Austria,AUT,1963,42.2596270283362
As a general rule, we use the -w flag with grep when looking for data that is in columnar form. This ensures that the search term is matched only if it is the entire field, otherwise it may match a substring of a field. It is still possible that this will match something like "Republic Of Austria". In this case, we know that "Austria" is always written as a single word, so it works. To handle such ambiguities, we can use regular expressions (described in the next chapter) to exactly specify the matching logic.
We have used input redirection to grep instead of just passing the file as an argument. This is only because we are emphasizing the use of redirection and piping. The command would work exactly the same if we passed land.csv as an argument instead of redirecting it.
robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep '20'
Austria,AUT,1971,40.2143376120126
Austria,AUT,1978,38.5178009203197
Austria,AUT,1979,38.0588520222814
Austria,AUT,1981,38.2102203923468
Austria,AUT,1983,36.6420440784694
Austria,AUT,1984,36.7304432065876
Austria,AUT,1992,36.4858319205619
Austria,AUT,2000,35.604262533301
Austria,AUT,2001,35.312424315815
robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep ',20'
Austria,AUT,2000,35.604262533301
Austria,AUT,2001,35.312424315815
Austria,AUT,2002,35.1441026883023
Austria,AUT,2003,34.9386049891015
We used this slightly hack-ish approach to filter data by searching for ",20". In this case, it worked because none of the percentage values started with "20". However, that is not true for the data in many other countries, and this would have included rows we did not need.
Ideally, we should use some other method to extract these values. The best general option is to use the awk tool. awk is a general text-processing language, which is too complex to cover in this brief book. The students are encouraged to learn about that tool, as it is extremely powerful.
We can also use grep with a simple regular expression to match four digits starting with "20". This will not be subject to the issue of matching wrong lines (partial output is shown here):
robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep -w '20[[:digit:]][[:digit:]]'
Austria,AUT,2000,35.604262533301
Austria,AUT,2001,35.312424315815
Austria,AUT,2002,35.1441026883023
Austria,AUT,2003,34.9386049891015
The -w flag matches an entire word, and the expression '20[[:digit:]][[:digit:]]' matches 20, followed by two digits. We use character classes, which were described earlier in this topic. We will learn regular expressions in more detail in the next chapter, but for the remaining examples and activities, use this regular expression to match the years for 2000 onward.
robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep -w '20[[:digit:]][[:digit:]]' | cut -d, --complement -f2
Austria,2000,35.604262533301
Austria,2001,35.312424315815
Austria,2002,35.1441026883023
Austria,2003,34.9386049891015
We used the --complement flag to specify that we want everything except field #2. The input field delimiter was set to the comma character.
robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep -w '20[[:digit:]][[:digit:]]' | cut -d, --complement -f2 --output-delimiter=$' '
Austria 2000 35.604262533301
Austria 2001 35.312424315815
Austria 2002 35.1441026883023
Austria 2003 34.9386049891015
When a tab character is printed on the console, it does not print a fixed number of spaces. Instead, it tries to print as many spaces as required to reach the next column that is a multiple of the tab width (usually 8). This means that when you print a file with tabs as delimiters, it can appear to have an arbitrary number of spaces between fields. Even though the number of spaces printed varies, there is only one tab character between the fields.
robin ~/Lesson2 $ grep -w 'Austria' <land.csv | grep -w '20[[:digit:]][[:digit:]]' >austria.txt
robin ~/Lesson2 $ cut -d, -f3 <austria.txt >year.txt
robin ~/Lesson2 $ cut -d, -f4 <austria.txt >percent.txt
robin ~/Lesson2 $ paste -d' ' percent.txt year.txt
The output will appear as follows:
robin ~/Lesson2 $ paste -d' ' percent.txt year.txt | sort -n
The output will appear as follows:
We will examine the sort command in detail in the next section.
robin ~/Lesson2 $ less payroll.tsv
robin ~/Lesson2 $ cut -f3,4 payroll.tsv | grep -v 'First' | sort >names.tsv
grep is not quite the right tool for this job because it will end up processing every line of the file looking for the string "First", which is unnecessary work. What we really need is just to skip the first line. This can be done with the tail command, which we will learn about later.
robin ~/Lesson2 $ wc -l names.tsv
562266 names.tsv
robin ~/Lesson2 $ uniq names.tsv | wc -l
410141
robin ~/Lesson2 $ uniq -u names.tsv | wc -l
301172
In this exercise, we have learned some commands that work on a line-by-line basis on files. Next, we will examine commands that work on entire files at a time.
Commands of this category operate by reading more than one line at a time and producing output. These can be thought of as a complete transformation of contents.
Sort Lines: sort
The sort command orders the lines of a file (we have seen this in some exercises before). Here is the list of the commonly used options for sort:
Since sort reads its entire input into memory before writing anything, this flag can be used to sort a file in place. Always specify this as the first option if it is being used, since some systems mandate it. This flag should not be used with -m since merging is a different operation altogether and happens line by line.
When compared as strings (lexicographic ordering), "00" compares by default as greater than "0". With this flag specified, the text is interpreted as a number, so both of the preceding strings are treated as being equal.
Stable sorts preserve the relative order of lines that compare as equal.
This flag can perform the same function as the default behavior of uniq, so it makes sense to skip piping into uniq separately in such cases.
Merging uses an efficient algorithm to take the data from already sorted files and merge it into a combined sorted output. Do not use the -o flag if using this. The merge flag can save a lot of time in certain cases, such as when adding data to an already existing large sorted file and maintaining its sorted property.
KEYDEF is a string consisting of two column numbers separated by commas, for example, 3,5. This instructs sort to use columns 3, 4, and 5 for sorting. Either field number can have an optional period followed by a second number, specifying the character position from where to consider the field. For example, -k3.2,3.4 uses the second to fourth characters of the third field in the file. This option can be specified multiple times to specify the secondary key and so on. The second number is optional and, if omitted, defaults to the last column index.
Sorting is an operation that requires the entire file content to be processed before it can output anything. This means that a sort command inside a pipeline essentially prevents any further steps in the pipeline from happening until the sort is finished.
Hence, we should be careful about the performance implications of using sort and other commands that work on complete files rather than line by line. Usually, the rule of thumb is to save the sorted data to a file if the amount of data is large. Repeatedly sorting a large file in each individual command will be slow. Instead, we can reuse the sorted data.
Remember that the sort command works on lexicographic ordering, according to the ASCII code (use the man ascii command to view it). This has consequences about how punctuation, whitespace, and other special characters affect the ordering. In most cases, however, the data we process is only text and numbers, so this should not be a frequent problem.
Now, let's look at an example where we can use the sort command. Say we receive huge daily event log files from different data centers, say denver.txt, dallas.txt, and chicago.txt. We need to maintain a complete sorted log of the entire year's history too, for example, in a file called 2018-log.txt.
One method would be to concatenate everything and sort it (remember that 2018-log.txt already exists and contains a huge amount of sorted data):
cat denver.txt dallas.txt chicago.txt >>2018-log.txt
sort -o 2018-log.txt 2018-log.txt
This would end up re-sorting most of the data that was already sorted in 2018-log.txt and would be very inefficient.
Instead, we can first sort the three new files and merge them. We need to use a temporary file for this since we cannot use the -o option to merge the file in-place:
sort -o denver.txt denver.txt
sort -o dallas.txt dallas.txt
sort -o chicago.txt chicago.txt
sort -m denver.txt dallas.txt chicago.txt 2018-log.txt >>2018-log-tmp.txt
Once we have the results in the temporary file, we can move the new file onto the old one, overwriting it:
mv 2018-log-tmp.txt 2018-log.txt
Unlike sorting, merging sorted files is a very efficient operation, taking very little memory or disk space.
Print Lines at the End of a File: tail
The tail command prints the specified number of lines at the end of a file. In other words, it shows the tail end of a file. The useful options for tail are as follows:
If N is prefixed with a + symbol, then the console shows all lines from the Nth line onward. If this argument is not specified, it defaults to 10.
The most common use of tail other than its default invocation is to follow logfiles. For example, an administrator could view a website's error log that is being continuously updated and that has new lines being appended to it intermittently with a command like the following:
tail -f /var/log/httpd/error_log
When run this way, tail shows the last 10 lines, but the program does not exit. Instead, it waits for more lines of data to arrive and prints them too, as they arrive. This lets a system administrator have a quick view of what happened recently in the log.
The other convenient use of tail is to skip over a given number of lines in a file. Recall step 16 of the previous exercise, where we used grep to skip the first line of a file. We could have used tail instead.
Print Lines at the Head of a File: head
The head command works like the inverse of tail. It prints a certain number of lines from the head (start) of the file. The useful options for head are as follows:
If N is prefixed with a – symbol, then the console shows all except the last N lines. If this argument is not specified, it defaults to 10.
The head command is commonly used to sample the content of a large file. In the previous exercise, we examined the payroll file to observe how the data was structured with less. Instead, we could have used head to just dump a few lines to the screen.
Combining head and tail in a pipeline can be useful to extract any contiguous range of lines from a file. For example, look at the following code:
tail -n+100 data.txt | head -n45
This command would print the lines 100 to 144 from the file.
Join Columns of Files: join
join does a complex data merging operation on two files. The nearest thing that describes it is the working of a database query that joins two tables. If you are familiar with databases, you would know how a join works. However, the best way to describe join is with an example.
Let's assume we have two files specifying the ages of people and the countries to which they belong. The first file is as follows:
Alice 25
Charlie 34
The second file is as follows:
Alice France
Charlie Spain
The result of applying join on these files is as follows:
Alice 25 France
Charlie 34 Spain
join requires the input files to be sorted to work. The command provides a plethora of options of which a small subset is described here:
FORMAT consists of one or more field specifiers separated by commas. Each field specifier can be either 0, or a number M.N where M is the file number and N is the field number. 0 represents the join field. For example, 1.3,0,2.2 means "Third field of file 1, join field, and second field of file 2". FORMAT can also be auto, in which case join prints out all the joined fields.
join works by reading a line from each file and checking if the values of the join columns match. If they match, it prints the combined fields. By default, it uses the first field to join.
With the -e and -a flags specified, it can perform an outer join in database terminology. For example, look at the following snippet:
robin ~ $ cat a.txt
Name Age
Alice 25
Charlie 34
robin ~ $ cat b.txt
Name Country
Alice France
Bob Spain
robin ~ $ join --header -o auto -e 'N/A' -a2 -a1 a.txt b.txt
Name Age Country
Alice 25 France
Bob N/A Spain
Charlie 34 N/A
There was no data about Charlie's country and Bob's age, but a row was output for everyone containing all the data that was available, and N/A where it was not. The -a1 flag tells join to include rows from file 1 that do not have a row in file 2. This brings in a row for Charlie. The -a2 flag tells join to include rows from file 2 that do not have a row in file 1. This brings in a row for Bob. The -e flag tells join to add a placeholder, N/A, for missing values. The -o auto flag is necessary for this outer join operation to work as shown; it ensures that all the columns from both files are included in every row in the output.
In the default mode of operation (without -a) we get an inner join, which would have skipped any rows for which there is no corresponding match in both files, as shown here:
robin ~ $ join --header a.txt b.txt
Name Age Country
Alice 25 France
In the preceding output, note that only Alice's row was printed, since data about her exists in both files, but not any of the others.
The combination of join, sort, and uniq can be used to perform all the mathematical set operations on two files, such as disjunction, intersection, and so on. join can also be (mis)used to reorder columns of a file by joining a file with itself if it has a column with all distinct values (line numbers, for example).
Output Files in Reverse: tac
tac is used to reverse a file line by line. This is especially useful for quickly reversing the order of already sorted files without re-sorting them. Since tac needs to be able to reach the end of the input stream and move backward to print in reverse, tac will stall a pipeline until it gets all the piped data input, just like sort. However, if tac is provided an actual file as input, it can directly seek to the end of the file and start working backward.
The common options for tac are as follows:
The most common use of tac is to reverse the lines of a file. Reversing a file's words or characters can also be done by using the -s and -r flags, but this use case is rare.
Get the Word Count: wc
The wc command can count the lines, words, characters, or bytes in a file. It can also tell the number of characters in the widest line of a file.
The common flags for wc are as follows:
In this exercise, we will continue where we left off from the previous exercises. We will do some more data extraction from the geographical datasets, and then work with the payroll data:
robin ~/Lesson2 $ head -n5 population.csv
Country Name,Country Code,Year,Value
Arab World,ARB,1960,92490932
Arab World,ARB,1961,95044497
Arab World,ARB,1962,97682294
Arab World,ARB,1963,100411076
This has a similar format to the land.csv file we used earlier.
robin ~/Lesson2 $ cut -f1 -d, population.csv >p1.txt
robin ~/Lesson2 $ cut -f3 -d, population.csv >p3.txt
robin ~/Lesson2 $ cut -f4 -d, population.csv >p4.txt
Note that we type three commands that are almost the same. In a future chapter, we will learn how to do these kinds of repetitive operations with less effort.
robin ~/Lesson2 $ paste -d$'/ ' p1.txt p3.txt p4.txt >p134.txt
robin ~/Lesson2 $ head -n5 p134.txt
Country Name/Year Value
Arab World/1960 92490932
Arab World/1961 95044497
Arab World/1962 97682294
Arab World/1963 100411076
robin ~/Lesson2 $ cut -f1 -d, land.csv >l1.txt
robin ~/Lesson2 $ cut -f3 -d, land.csv >l3.txt
robin ~/Lesson2 $ cut -f4 -d, land.csv >l4.txt
robin ~/Lesson2 $ paste -d$'/ ' l1.txt l3.txt l4.txt >l134.txt
robin ~/Lesson2 $ head -n5 l134.txt
Country Name/Year Value
Arab World/1961 30.9442924784889
Arab World/1962 30.9441456790578
Arab World/1963 30.967119790024
Arab World/1964 30.9765883533295
robin ~/Lesson2 $ tail -n+2 l134.txt | sort >land.txt
robin ~/Lesson2 $ tail -n+2 p134.txt | sort >pop.txt
robin ~/Lesson2 $ join -t$' ' --check-order -o auto -e 'UNKNOWN' -a1 -a2 land.txt pop.txt | less
Values where the data is not present are set to 'UNKNOWN'.
The output will look as follows:
robin ~/Lesson2 $ <names.tsv uniq -c | sort -n -r | head
253
69 RODRIGUEZ MARIA
59 XXXX XXXX
54 RODRIGUEZ JOSE
49 RODRIGUEZ CARMEN
43 RIVERA MARIA
42 GONZALEZ MARIA
40 GONZALEZ JOSE
38 SMITH MICHAEL
37 RIVERA JOSE
We can see the names sorted by most frequent to least frequent (note that 253 names are blank, and 59 names in the records are invalid "XXXX XXXX").
robin ~/Lesson2 $ <names.tsv uniq -c | sort -n -r >namecounts.txt
robin ~/Lesson2 $ grep -w 'SMITH' namecounts.txt | less
We can see the results in decreasing order of frequency. Here are the first few lines you will see:
38 SMITH MICHAEL
29 SMITH ROBERT
23 SMITH JAMES
22 SMITH MICHELLE
20 SMITH CHRISTOPHER
19 SMITH JENNIFER
18 SMITH WILLIAM
robin ~/Lesson2 $ grep -w 'SMITH' namecounts.txt | tac | head -n5
1 ADGERSON-SMITH KAILEN
1 ALLYN SMITH LISA
1 ANDERSON-SMITH FELICIA
1 BAILEY-SMITH LAUREN
1 BANUCCI-SMITH KATHERINE
With this exercise, we have learned how to use commands that work for transforming column-based text files in various ways. With some ingenuity, we can mold our text data in any way we please.
In the previous exercises, we used cut to extract individual columns and then paste to create a file with a subset of the columns in the original data, which is analogous to the SELECT operation in databases. Using cut and paste for this is quite cumbersome, but there is a way to use join for this purpose, with a little ingenuity.
In this activity, you will be working with the land.csv file, which contains historical data of agricultural land percentage for hundreds of countries. The data is divided into four columns by commas: Country Name, Country Code, Year, and Value. From the high-level instructions provided here, and the concepts learned in this chapter, create two new files that have the data laid out as follows:
In this activity, you need to convert a high-level description of a task with hints into actual command pipelines. Refer to the options for the commands and test out your commands carefully, viewing the intermediate results to ensure you are on the right track.
Now, perform the following operations (remember that you can use less or head to verify a command's output, before writing to a file):
Remember that the columns 3, 4, and 2 in the original file are not at the same position in the numbered file.
Again, remember that the columns 4, 3, and 1 in the original file are not at the same position in the numbered file.
Verify that 342.txt has data of the columns Year, Value, and Country Code. You should get the following output:
Verify that 431.txt has data of the columns Value, Year, and Country Name.
The solution for this activity can be found on page 274.
In this activity, you will perform data analysis tasks using command-line operations. Use the land.csv and population.csv files which contain historical agricultural land percentage and population data for all countries, respectively. Extract the data for the median population for all countries in 1998 and the median value of agricultural land percentage for the country of Australia to answer these questions:
A statistical median is defined as the middle value in a sorted sequence; half the values are below the median and half are above.
Assuming a sequence has N values, the index of the median is N/2 rounded to the nearest integer. For example, if N is 10, then the median is 5. If N is 17, then the median is 9 because 17/2 is 8.5 and it rounds to 9.
Perform the following operations (remember to use temporary files to save intermediate results):
The following are the expected answers to the preceding questions:
The solution for this activity can be found on page 274.
In this section, we have explored a fairly large subset of the text-processing commands of the UNIX ecosystem. We also looked at an introductory level of their functionalities, which have been exposed through various options.
There are many more options for these commands and several commands that we have not covered, but what we have learned is enough to do many data-processing tasks without resorting to specialized GUI applications.
What we have learned so far is as follows:
In future chapters, we will cover more use patterns of these commands and more mechanisms to drive them.
In this chapter, you have been introduced to several concepts such as input, output, redirection, and pipelines. You have also learned basic text-processing tools, along with both common and uncommon use cases of these tools, to demonstrate their flexibility. At a conceptual level, several techniques related to processing tabular data have been explored.
A large number of details have been covered. If you are being introduced to these for the first time, you should attempt to understand the concepts at an abstract level and not be overwhelmed by details (which you can always refer to when in doubt). To this end, some additional complexities have been avoided in order to focus on the essential concepts. The students can pick up more nuances as they continue to learn and practice in the future, beyond this brief book.
In the next chapter, you will learn about several more concepts related to the shell, including basic regular expressions, shell expansion, and command substitution.