While we won’t assume that you’re a relational database expert, we do assume that you’ve confronted a database or two in the past. If so, odds are good that the database was relational. We’ll start with creating our own schemas and populating them. Then we’ll take a look at querying for values and finally explore what makes relational databases so special: the table join.
Like most databases you’ll read about, Postgres provides a back-end server that does all of the work and a command-line shell to connect to the running server. The server communicates through port 5432 by default, which you can connect to the shell using the psql command. Let’s connect to our 7dbs schema now:
| $ psql 7dbs |
PostgreSQL prompts with the name of the database followed by a hash mark (#) if you run as an administrator and by a dollar sign ($) as a regular user. The shell also comes equipped with perhaps the best built-in documentation that you will find in any console. Typing h lists information about SQL commands and ? helps with psql-specific commands, namely those that begin with a backslash. You can find usage details about each SQL command in the following way (the output that follows is truncated):
| 7dbs=# h CREATE INDEX |
| Command: CREATE INDEX |
| Description: define a new index |
| Syntax: |
| CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table [ USING method ] |
| ( { column | ( expression ) } [ opclass ] [ ASC | DESC ] [ NULLS ... |
| [ WITH ( storage_parameter = value [, ... ] ) ] |
| [ TABLESPACE tablespace ] |
| [ WHERE predicate ] |
Before we dig too deeply into Postgres, it would be good to familiarize yourself with this useful tool. It’s worth looking over (or brushing up on) a few common commands, such as SELECT and CREATE TABLE.
PostgreSQL follows the SQL convention of calling relations TABLEs, attributes COLUMNs, and tuples ROWs. For consistency, we will use this terminology, though you may occasionally encounter the mathematical terms relations, attributes, and tuples. For more on these concepts, see "Mathematical Relations" in the text.
PostgreSQL, being of the relational style, is a design-first database. First you design the schema; then you enter data that conforms to the definition of that schema.
Creating a table involves giving the table a name and a list of columns with types and (optional) constraint information. Each table should also nominate a unique identifier column to pinpoint specific rows. That identifier is called a PRIMARY KEY. The SQL to create a countries table looks like this:
| CREATE TABLE countries ( |
| country_code char(2) PRIMARY KEY, |
| country_name text UNIQUE |
| ); |
This new table will store a set of rows, where each is identified by a two-character code and the name is unique. These columns both have constraints. The PRIMARY KEY constrains the country_code column to disallow duplicate country codes. Only one us and one gb may exist. We explicitly gave country_name a similar unique constraint, although it is not a primary key. We can populate the countries table by inserting a few rows.
| INSERT INTO countries (country_code, country_name) |
| VALUES ('us','United States'), ('mx','Mexico'), ('au','Australia'), |
| ('gb','United Kingdom'), ('de','Germany'), ('ll','Loompaland'); |
Let’s test our unique constraint. Attempting to add a duplicate country_name will cause our unique constraint to fail, thus disallowing insertion. Constraints are how relational databases such as PostgreSQL ensure kosher data.
| INSERT INTO countries |
| VALUES ('uk','United Kingdom'); |
This will return an error indicating that the Key (country_name)=(United Kingdom) already exists.
We can validate that the proper rows were inserted by reading them using the SELECT...FROM table command.
| SELECT * |
| FROM countries; |
| |
| country_code | country_name |
| --------------+--------------- |
| us | United States |
| mx | Mexico |
| au | Australia |
| gb | United Kingdom |
| de | Germany |
| ll | Loompaland |
| (6 rows) |
According to any respectable map, Loompaland isn’t a real place, so let’s remove it from the table. You specify which row to remove using the WHERE clause. The row whose country_code equals ll will be removed.
| DELETE FROM countries |
| WHERE country_code = 'll'; |
With only real countries left in the countries table, let’s add a cities table. To ensure any inserted country_code also exists in our countries table, we add the REFERENCES keyword. Because the country_code column references another table’s key, it’s known as the foreign key constraint.
| CREATE TABLE cities ( |
| name text NOT NULL, |
| postal_code varchar(9) CHECK (postal_code <> ''), |
| country_code char(2) REFERENCES countries, |
| PRIMARY KEY (country_code, postal_code) |
| ); |
This time, we constrained the name in cities by disallowing NULL values. We constrained postal_code by checking that no values are empty strings (<> means not equal). Furthermore, because a PRIMARY KEY uniquely identifies a row, we created a compound key: country_code + postal_code. Together, they uniquely define a row.
Postgres also has a rich set of datatypes, by far the richest amongst the databases in this book. You’ve just seen three different string representations: text (a string of any length), varchar(9) (a string of variable length up to nine characters), and char(2) (a string of exactly two characters). With our schema in place, let’s insert Toronto, CA.
| INSERT INTO cities |
| VALUES ('Toronto','M4C1B5','ca'); |
| |
| ERROR: insert or update on table "cities" violates foreign key constraint |
| "cities_country_code_fkey" |
| DETAIL: Key (country_code)=(ca) is not present in table "countries". |
This failure is good! Because country_code REFERENCES countries, the country_code must exist in the countries table. As shown in the figure, the REFERENCES keyword constrains fields to another table’s primary key. This is called maintaining referential integrity, and it ensures our data is always correct.
It’s worth noting that NULL is valid for cities.country_code because NULL represents the lack of a value. If you want to disallow a NULL country_code reference, you would define the table cities column like this: country_code char(2) REFERENCES countries NOT NULL.
Now let’s try another insert, this time with a U.S. city (quite possibly the greatest of U.S. cities).
| INSERT INTO cities |
| VALUES ('Portland','87200','us'); |
| |
| INSERT 0 1 |
This is a successful insert, to be sure. But we mistakenly entered a postal_code that doesn’t actually exist in Portland. One postal code that does exist and may just belong to one of the authors is 97206. Rather than delete and reinsert the value, we can update it inline.
| UPDATE cities |
| SET postal_code = '97206' |
| WHERE name = 'Portland'; |
We have now Created, Read, Updated, and Deleted table rows.
All of the other databases you’ll read about in this book perform CRUD operations as well. What sets relational databases like PostgreSQL apart is their ability to join tables together when reading them. Joining, in essence, is an operation taking two separate tables and combining them in some way to return a single table. It’s somewhat like putting together puzzle pieces to create a bigger, more complete picture.
The basic form of a join is the inner join. In the simplest form, you specify two columns (one from each table) to match by, using the ON keyword.
| SELECT cities.*, country_name |
| FROM cities INNER JOIN countries /* or just FROM cities JOIN countries */ |
| ON cities.country_code = countries.country_code; |
| |
| country_code | name | postal_code | country_name |
| --------------+----------+-------------+--------------- |
| us | Portland | 97206 | United States |
The join returns a single table, sharing all columns’ values of the cities table plus the matching country_name value from the countries table.
You can also join a table like cities that has a compound primary key. To test a compound join, let’s create a new table that stores a list of venues.
A venue exists in both a postal code and a specific country. The foreign key must be two columns that reference both cities primary key columns. (MATCH FULL is a constraint that ensures either both values exist or both are NULL.)
| CREATE TABLE venues ( |
| venue_id SERIAL PRIMARY KEY, |
| name varchar(255), |
| street_address text, |
| type char(7) CHECK ( type in ('public','private') ) DEFAULT 'public', |
| postal_code varchar(9), |
| country_code char(2), |
| FOREIGN KEY (country_code, postal_code) |
| REFERENCES cities (country_code, postal_code) MATCH FULL |
| ); |
This venue_id column is a common primary key setup: automatically incremented integers (1, 2, 3, 4, and so on). You make this identifier using the SERIAL keyword. (MySQL has a similar construct called AUTO_INCREMENT.)
| INSERT INTO venues (name, postal_code, country_code) |
| VALUES ('Crystal Ballroom', '97206', 'us'); |
Although we did not set a venue_id value, creating the row populated it.
Back to our compound join. Joining the venues table with the cities table requires both foreign key columns. To save on typing, you can alias the table names by following the real table name directly with an alias, with an optional AS between (for example, venues v or venues AS v).
| SELECT v.venue_id, v.name, c.name |
| FROM venues v INNER JOIN cities c |
| ON v.postal_code=c.postal_code AND v.country_code=c.country_code; |
| |
| venue_id | name | name |
| ----------+------------------+---------- |
| 1 | Crystal Ballroom | Portland |
You can optionally request that PostgreSQL return columns after insertion by ending the query with a RETURNING statement.
| INSERT INTO venues (name, postal_code, country_code) |
| VALUES ('Voodoo Doughnut', '97206', 'us') RETURNING venue_id; |
| |
| id |
| - - - - |
| 2 |
This provides the new venue_id without issuing another query.
In addition to inner joins, PostgreSQL can also perform outer joins. Outer joins are a way of merging two tables when the results of one table must always be returned, whether or not any matching column values exist on the other table.
It’s easiest to give an example, but to do that, we’ll create a new table named events. This one is up to you. Your events table should have these columns: a SERIAL integer event_id, a title, starts and ends (of type timestamp), and a venue_id (foreign key that references venues). A schema definition diagram covering all the tables we’ve made so far is shown in the following figure.
After creating the events table, INSERT the following values (timestamps are inserted as a string like 2018-02-15 17:30) for two holidays and a club we do not talk about:
title | starts | ends | venue_id | event_id |
---|---|---|---|---|
Fight Club | 2018-02-15 17:30:00 | 2018-02-15 19:30:00 | 2 | 1 |
April Fools Day | 2018-04-01 00:00:00 | 2018-04-01 23:59:00 | 2 | |
Christmas Day | 2018-02-15 19:30:00 | 2018-12-25 23:59:00 | 3 |
Let’s first craft a query that returns an event title and venue name as an inner join (the word INNER from INNER JOIN is not required, so leave it off here).
| SELECT e.title, v.name |
| FROM events e JOIN venues v |
| ON e.venue_id = v.venue_id; |
| |
| title | name |
| --------------+------------------ |
| Fight Club | Voodoo Doughnut |
INNER JOIN will return a row only if the column values match. Because we can’t have NULL venues.venue_id, the two NULL events.venue_ids refer to nothing. Retrieving all of the events, whether or not they have a venue, requires a LEFT OUTER JOIN (shortened to LEFT JOIN).
| SELECT e.title, v.name |
| FROM events e LEFT JOIN venues v |
| ON e.venue_id = v.venue_id; |
| |
| title | name |
| ------------------+---------------- |
| Fight Club | Voodoo Doughnut |
| April Fools Day | |
| Christmas Day | |
If you require the inverse, all venues and only matching events, use a RIGHT JOIN. Finally, there’s the FULL JOIN, which is the union of LEFT and RIGHT; you’re guaranteed all values from each table, joined wherever columns match.
The speed of PostgreSQL (and any other RDBMS) lies in its efficient management of blocks of data, reduced disk reads, query optimization, and other techniques. But those only go so far in fetching results quickly. If we select the title of Christmas Day from the events table, the algorithm must scan every row for a match to return. Without an index, each row must be read from disk to know whether a query should return it. See the following.
An index is a special data structure built to avoid a full table scan when performing a query. When running CREATE TABLE commands, you may have noticed a message like this:
| CREATE TABLE / PRIMARY KEY will create implicit index "events_pkey" |
| for table "events" |
PostgreSQL automatically creates an index on the primary key—in particular a B-tree index—where the key is the primary key value and where the value points to a row on disk, as shown in the top figure. Using the UNIQUE keyword is another way to force an index on a table column.
You can explicitly add a hash index using the CREATE INDEX command, where each value must be unique (like a hashtable or a map).
| CREATE INDEX events_title |
| ON events USING hash (title); |
For less-than/greater-than/equals-to matches, we want an index more flexible than a simple hash, like a B-tree, which can match on ranged queries (see the following figure).
Consider a query to find all events that are on or after April 1.
| SELECT * |
| FROM events |
| WHERE starts >= '2018-04-01'; |
| |
| event_id | title | starts | ... |
| ----------+------------------+---------------------+----- |
| 2 | April Fools Day | 2018-04-01 00:00:00 | ... |
| 3 | Christmas Day | 2018-12-25 00:00:00 | ... |
For this, a tree is the perfect data structure. To index the starts column with a B-tree, use this:
| CREATE INDEX events_starts |
| ON events USING btree (starts); |
Now our query over a range of dates will avoid a full table scan. It makes a huge difference when scanning millions or billions of rows.
We can inspect our work with this command to list all indexes in the schema:
| 7dbs=# di |
It’s worth noting that when you set a FOREIGN KEY constraint, PostgreSQL will not automatically create an index on the targeted column(s). You’ll need to create an index on the targeted column(s) yourself. Even if you don’t like using database constraints (that’s right, we’re looking at you, Ruby on Rails developers), you will often find yourself creating indexes on columns you plan to join against in order to help speed up foreign key joins.
We sped through a lot today and covered many terms. Here’s a recap:
Term | Definition |
---|---|
Column | A domain of values of a certain type, sometimes called an attribute |
Row | An object comprised of a set of column values, sometimes called a tuple |
Table | A set of rows with the same columns, sometimes called a relation |
Primary key | The unique value that pinpoints a specific row |
Foreign key | A data constraint that ensures that each entry in a column in one table uniquely corresponds to a row in another table (or even the same table) |
CRUD | Create, Read, Update, Delete |
SQL | Structured Query Language, the lingua franca of a relational database |
Join | Combining two tables into one by some matching columns |
Left join | Combining two tables into one by some matching columns or NULL if nothing matches the left table |
Index | A data structure to optimize selection of a specific set of columns |
B-tree index | A good standard index; values are stored as a balanced tree data structure; very flexible; B-tree indexes are the default in Postgres |
Hash index | Another good standard index in which each index value is unique; hash indexes tend to offer better performance for comparison operations than B-tree indexes but are less flexible and don’t allow for things like range queries |
Relational databases have been the de facto data management strategy for forty years—many of us began our careers in the midst of their evolution. Others may disagree, but we think that understanding “NoSQL” databases is a non-starter without rooting ourselves in this paradigm, even if for just a brief sojourn. So we looked at some of the core concepts of the relational model via basic SQL queries and undertook a light foray into some mathematical foundations. We will expound on these root concepts tomorrow.
Find
Do