Day 1: Relations, CRUD, and Joins

While we won’t assume that you’re a relational database expert, we do assume that you’ve confronted a database or two in the past. If so, odds are good that the database was relational. We’ll start with creating our own schemas and populating them. Then we’ll take a look at querying for values and finally explore what makes relational databases so special: the table join.

Like most databases you’ll read about, Postgres provides a back-end server that does all of the work and a command-line shell to connect to the running server. The server communicates through port 5432 by default, which you can connect to the shell using the psql command. Let’s connect to our 7dbs schema now:

 $ ​​psql​​ ​​7dbs

PostgreSQL prompts with the name of the database followed by a hash mark (#) if you run as an administrator and by a dollar sign ($) as a regular user. The shell also comes equipped with perhaps the best built-in documentation that you will find in any console. Typing h lists information about SQL commands and ? helps with psql-specific commands, namely those that begin with a backslash. You can find usage details about each SQL command in the following way (the output that follows is truncated):

 7dbs=# ​​h ​CREATE​ ​INDEX
 Command: ​CREATE​ ​INDEX
 Description: define a ​new​ ​index
 Syntax:
 CREATE​ [ ​UNIQUE​ ] ​INDEX​ [ CONCURRENTLY ] [ ​name​ ] ​ON​ ​table​ [ ​USING​ method ]
  ( ​{​ ​column​ | ( expression ) ​}​ [ opclass ] [ ​ASC​ | ​DESC​ ] [ NULLS ...
  [ ​WITH​ ( storage_parameter = ​value​ [, ... ] ) ]
  [ ​TABLESPACE​ ​tablespace​ ]
  [ ​WHERE​ predicate ]

Before we dig too deeply into Postgres, it would be good to familiarize yourself with this useful tool. It’s worth looking over (or brushing up on) a few common commands, such as SELECT and CREATE TABLE.

Starting with SQL

PostgreSQL follows the SQL convention of calling relations TABLEs, attributes COLUMNs, and tuples ROWs. For consistency, we will use this terminology, though you may occasionally encounter the mathematical terms relations, attributes, and tuples. For more on these concepts, see "Mathematical Relations" in the text.

Working with Tables

PostgreSQL, being of the relational style, is a design-first database. First you design the schema; then you enter data that conforms to the definition of that schema.

Creating a table involves giving the table a name and a list of columns with types and (optional) constraint information. Each table should also nominate a unique identifier column to pinpoint specific rows. That identifier is called a PRIMARY KEY. The SQL to create a countries table looks like this:

 CREATE​ ​TABLE​ countries (
  country_code ​char​(2) ​PRIMARY​ ​KEY​,
  country_name ​text​ ​UNIQUE
 );

This new table will store a set of rows, where each is identified by a two-character code and the name is unique. These columns both have constraints. The PRIMARY KEY constrains the country_code column to disallow duplicate country codes. Only one us and one gb may exist. We explicitly gave country_name a similar unique constraint, although it is not a primary key. We can populate the countries table by inserting a few rows.

 INSERT​ ​INTO​ countries (country_code, country_name)
 VALUES​ (​'us'​,​'United States'​), (​'mx'​,​'Mexico'​), (​'au'​,​'Australia'​),
  (​'gb'​,​'United Kingdom'​), (​'de'​,​'Germany'​), (​'ll'​,​'Loompaland'​);

Let’s test our unique constraint. Attempting to add a duplicate country_name will cause our unique constraint to fail, thus disallowing insertion. Constraints are how relational databases such as PostgreSQL ensure kosher data.

 INSERT​ ​INTO​ countries
 VALUES​ (​'uk'​,​'United Kingdom'​);

This will return an error indicating that the Key (country_name)=(United Kingdom) already exists.

We can validate that the proper rows were inserted by reading them using the SELECT...FROM table command.

 SELECT​ *
 FROM​ countries;
 
  country_code | country_name
 --------------+---------------
  us | United States
  mx | Mexico
  au | Australia
  gb | United Kingdom
  de | Germany
  ll | Loompaland
 (6 ​rows​)

According to any respectable map, Loompaland isn’t a real place, so let’s remove it from the table. You specify which row to remove using the WHERE clause. The row whose country_code equals ll will be removed.

 DELETE​ ​FROM​ countries
 WHERE​ country_code = ​'ll'​;

With only real countries left in the countries table, let’s add a cities table. To ensure any inserted country_code also exists in our countries table, we add the REFERENCES keyword. Because the country_code column references another table’s key, it’s known as the foreign key constraint.

 CREATE​ ​TABLE​ cities (
 name​ ​text​ ​NOT​ ​NULL​,
  postal_code ​varchar​(9) ​CHECK​ (postal_code <> ​''​),
  country_code ​char​(2) ​REFERENCES​ countries,
 PRIMARY​ ​KEY​ (country_code, postal_code)
 );

This time, we constrained the name in cities by disallowing NULL values. We constrained postal_code by checking that no values are empty strings (<> means not equal). Furthermore, because a PRIMARY KEY uniquely identifies a row, we created a compound key: country_code + postal_code. Together, they uniquely define a row.

Postgres also has a rich set of datatypes, by far the richest amongst the databases in this book. You’ve just seen three different string representations: text (a string of any length), varchar(9) (a string of variable length up to nine characters), and char(2) (a string of exactly two characters). With our schema in place, let’s insert Toronto, CA.

 INSERT​ ​INTO​ cities
 VALUES​ (​'Toronto'​,​'M4C1B5'​,​'ca'​);
 
 ERROR​: ​insert​ ​or​ ​update​ ​on​ ​table​ "cities" violates ​foreign​ ​key​ ​constraint
  "cities_country_code_fkey"
 DETAIL: ​Key​ (country_code)=(ca) ​is​ ​not​ present ​in​ ​table​ "countries".

This failure is good! Because country_code REFERENCES countries, the country_code must exist in the countries table. As shown in the figure, the REFERENCES keyword constrains fields to another table’s primary key. This is called maintaining referential integrity, and it ensures our data is always correct.

images/postgres-join-text.png

It’s worth noting that NULL is valid for cities.country_code because NULL represents the lack of a value. If you want to disallow a NULL country_code reference, you would define the table cities column like this: country_code char(2) REFERENCES countries NOT NULL.

Now let’s try another insert, this time with a U.S. city (quite possibly the greatest of U.S. cities).

 INSERT​ ​INTO​ cities
 VALUES​ (​'Portland'​,​'87200'​,​'us'​);
 
 INSERT​ 0 1

This is a successful insert, to be sure. But we mistakenly entered a postal_code that doesn’t actually exist in Portland. One postal code that does exist and may just belong to one of the authors is 97206. Rather than delete and reinsert the value, we can update it inline.

 UPDATE​ cities
 SET​ postal_code = ​'97206'
 WHERE​ ​name​ = ​'Portland'​;

We have now Created, Read, Updated, and Deleted table rows.

Join Reads

All of the other databases you’ll read about in this book perform CRUD operations as well. What sets relational databases like PostgreSQL apart is their ability to join tables together when reading them. Joining, in essence, is an operation taking two separate tables and combining them in some way to return a single table. It’s somewhat like putting together puzzle pieces to create a bigger, more complete picture.

The basic form of a join is the inner join. In the simplest form, you specify two columns (one from each table) to match by, using the ON keyword.

 SELECT​ cities.*, country_name
 FROM​ cities ​INNER​ ​JOIN​ countries ​/* or just FROM cities JOIN countries */
 ON​ cities.country_code = countries.country_code;
 
  country_code | ​name​ | postal_code | country_name
 --------------+----------+-------------+---------------
  us | Portland | 97206 | United States

The join returns a single table, sharing all columns’ values of the cities table plus the matching country_name value from the countries table.

You can also join a table like cities that has a compound primary key. To test a compound join, let’s create a new table that stores a list of venues.

A venue exists in both a postal code and a specific country. The foreign key must be two columns that reference both cities primary key columns. (MATCH FULL is a constraint that ensures either both values exist or both are NULL.)

 CREATE​ ​TABLE​ venues (
  venue_id ​SERIAL​ ​PRIMARY​ ​KEY​,
 name​ ​varchar​(255),
  street_address ​text​,
 type​ ​char​(7) ​CHECK​ ( ​type​ ​in​ (​'public'​,​'private'​) ) ​DEFAULT​ ​'public'​,
  postal_code ​varchar​(9),
  country_code ​char​(2),
 FOREIGN​ ​KEY​ (country_code, postal_code)
 REFERENCES​ cities (country_code, postal_code) ​MATCH​ ​FULL
 );

This venue_id column is a common primary key setup: automatically incremented integers (1, 2, 3, 4, and so on). You make this identifier using the SERIAL keyword. (MySQL has a similar construct called AUTO_INCREMENT.)

 INSERT​ ​INTO​ venues (​name​, postal_code, country_code)
 VALUES​ (​'Crystal Ballroom'​, ​'97206'​, ​'us'​);

Although we did not set a venue_id value, creating the row populated it.

Back to our compound join. Joining the venues table with the cities table requires both foreign key columns. To save on typing, you can alias the table names by following the real table name directly with an alias, with an optional AS between (for example, venues v or venues AS v).

 SELECT​ v.venue_id, v.​name​, c.​name
 FROM​ venues v ​INNER​ ​JOIN​ cities c
 ON​ v.postal_code=c.postal_code ​AND​ v.country_code=c.country_code;
 
  venue_id | ​name​ | ​name
 ----------+------------------+----------
  1 | Crystal Ballroom | Portland

You can optionally request that PostgreSQL return columns after insertion by ending the query with a RETURNING statement.

 INSERT​ ​INTO​ venues (​name​, postal_code, country_code)
 VALUES​ (​'Voodoo Doughnut'​, ​'97206'​, ​'us'​) RETURNING venue_id;
 
  id
 - - - -
  2

This provides the new venue_id without issuing another query.

The Outer Limits

In addition to inner joins, PostgreSQL can also perform outer joins. Outer joins are a way of merging two tables when the results of one table must always be returned, whether or not any matching column values exist on the other table.

It’s easiest to give an example, but to do that, we’ll create a new table named events. This one is up to you. Your events table should have these columns: a SERIAL integer event_id, a title, starts and ends (of type timestamp), and a venue_id (foreign key that references venues). A schema definition diagram covering all the tables we’ve made so far is shown in the following figure.

images/postgres-erd.png

After creating the events table, INSERT the following values (timestamps are inserted as a string like 2018-02-15 17:30) for two holidays and a club we do not talk about:

title     

starts     

ends     

venue_id     

event_id

Fight Club

2018-02-15 17:30:00

2018-02-15 19:30:00

2     

1

April Fools Day     

2018-04-01 00:00:00     

2018-04-01 23:59:00     

2

Christmas Day

2018-02-15 19:30:00

2018-12-25 23:59:00

3

Let’s first craft a query that returns an event title and venue name as an inner join (the word INNER from INNER JOIN is not required, so leave it off here).

 SELECT​ e.title, v.​name
 FROM​ ​events​ e ​JOIN​ venues v
 ON​ e.venue_id = v.venue_id;
 
  title | ​name
 --------------+------------------
  Fight Club | Voodoo Doughnut

INNER JOIN will return a row only if the column values match. Because we can’t have NULL venues.venue_id, the two NULL events.venue_ids refer to nothing. Retrieving all of the events, whether or not they have a venue, requires a LEFT OUTER JOIN (shortened to LEFT JOIN).

 SELECT​ e.title, v.​name
 FROM​ ​events​ e ​LEFT​ ​JOIN​ venues v
 ON​ e.venue_id = v.venue_id;
 
  title | ​name
 ------------------+----------------
  Fight Club | Voodoo Doughnut
  April Fools ​Day​ |
  Christmas ​Day​ |

If you require the inverse, all venues and only matching events, use a RIGHT JOIN. Finally, there’s the FULL JOIN, which is the union of LEFT and RIGHT; you’re guaranteed all values from each table, joined wherever columns match.

Fast Lookups with Indexing

The speed of PostgreSQL (and any other RDBMS) lies in its efficient management of blocks of data, reduced disk reads, query optimization, and other techniques. But those only go so far in fetching results quickly. If we select the title of Christmas Day from the events table, the algorithm must scan every row for a match to return. Without an index, each row must be read from disk to know whether a query should return it. See the following.

images/postgres-scan.png

An index is a special data structure built to avoid a full table scan when performing a query. When running CREATE TABLE commands, you may have noticed a message like this:

 CREATE​ ​TABLE​ / ​PRIMARY​ ​KEY​ will ​create​ implicit ​index​ "events_pkey" ​
 for​ ​table​ "events"

PostgreSQL automatically creates an index on the primary key—in particular a B-tree index—where the key is the primary key value and where the value points to a row on disk, as shown in the top figure. Using the UNIQUE keyword is another way to force an index on a table column.

images/postgres-hash-index.png

You can explicitly add a hash index using the CREATE INDEX command, where each value must be unique (like a hashtable or a map).

 CREATE​ ​INDEX​ events_title
 ON​ ​events​ ​USING​ ​hash​ (title);

For less-than/greater-than/equals-to matches, we want an index more flexible than a simple hash, like a B-tree, which can match on ranged queries (see the following figure).

images/postgres-big-btree.png

Consider a query to find all events that are on or after April 1.

 SELECT​ *
 FROM​ ​events
 WHERE​ ​starts​ >= ​'2018-04-01'​;
 
  event_id | title | ​starts​ | ...
 ----------+------------------+---------------------+-----
  2 | April Fools ​Day​ | 2018-04-01 00:00:00 | ...
  3 | Christmas ​Day​ | 2018-12-25 00:00:00 | ...

For this, a tree is the perfect data structure. To index the starts column with a B-tree, use this:

 CREATE​ ​INDEX​ events_starts
 ON​ ​events​ ​USING​ ​btree​ (​starts​);

Now our query over a range of dates will avoid a full table scan. It makes a huge difference when scanning millions or billions of rows.

We can inspect our work with this command to list all indexes in the schema:

 7dbs=# ​​di

It’s worth noting that when you set a FOREIGN KEY constraint, PostgreSQL will not automatically create an index on the targeted column(s). You’ll need to create an index on the targeted column(s) yourself. Even if you don’t like using database constraints (that’s right, we’re looking at you, Ruby on Rails developers), you will often find yourself creating indexes on columns you plan to join against in order to help speed up foreign key joins.

Day 1 Wrap-Up

We sped through a lot today and covered many terms. Here’s a recap:

Term

Definition

Column

A domain of values of a certain type, sometimes called an attribute

Row

An object comprised of a set of column values, sometimes called a tuple

Table

A set of rows with the same columns, sometimes called a relation

Primary key

The unique value that pinpoints a specific row

Foreign key

A data constraint that ensures that each entry in a column in one table uniquely corresponds to a row in another table (or even the same table)

CRUD

Create, Read, Update, Delete

SQL

Structured Query Language, the lingua franca of a relational database

Join

Combining two tables into one by some matching columns

Left join

Combining two tables into one by some matching columns or NULL if nothing matches the left table

Index

A data structure to optimize selection of a specific set of columns

B-tree index

A good standard index; values are stored as a balanced tree data structure; very flexible; B-tree indexes are the default in Postgres

Hash index

Another good standard index in which each index value is unique; hash indexes tend to offer better performance for comparison operations than B-tree indexes but are less flexible and don’t allow for things like range queries

Relational databases have been the de facto data management strategy for forty years—many of us began our careers in the midst of their evolution. Others may disagree, but we think that understanding “NoSQL” databases is a non-starter without rooting ourselves in this paradigm, even if for just a brief sojourn. So we looked at some of the core concepts of the relational model via basic SQL queries and undertook a light foray into some mathematical foundations. We will expound on these root concepts tomorrow.

Day 1 Homework

Find

  1. Find the PostgreSQL documentation online and bookmark it.
  2. Acquaint yourself with the command-line ? and h output.
  3. We briefly mentioned the MATCH FULL constraint. Find some information on the other available types of MATCH constraints.

Do

  1. Select all the tables we created (and only those) from pg_class and examine the table to get a sense of what kinds of metadata Postgres stores about tables.
  2. Write a query that finds the country name of the Fight Club event.
  3. Alter the venues table such that it contains a Boolean column called active with a default value of TRUE.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset