Day 1: Relations, CRUD, and Joins

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Day 1: Relations, CRUD, and Joins

While we won’t assume that you’re a relational database expert, we do assume that you’ve confronted a database or two in the past. If so, odds are good that the database was relational. We’ll start with creating our own schemas and populating them. Then we’ll take a look at querying for values and finally explore what makes relational databases so special: the table join.

Like most databases you’ll read about, Postgres provides a back-end server that does all of the work and a command-line shell to connect to the running server. The server communicates through port 5432 by default, which you can connect to the shell using the psql command. Let’s connect to our 7dbs schema now:

$ psql 7dbs

PostgreSQL prompts with the name of the database followed by a hash mark (#) if you run as an administrator and by a dollar sign ($) as a regular user. The shell also comes equipped with perhaps the best built-in documentation that you will find in any console. Typing h lists information about SQL commands and ? helps with psql-specific commands, namely those that begin with a backslash. You can find usage details about each SQL command in the following way (the output that follows is truncated):

	7dbs=# h CREATE INDEX
	Command: CREATE INDEX
	Description: define a new index
	Syntax:
	CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table [ USING method ]
	( { column \| ( expression ) } [ opclass ] [ ASC \| DESC ] [ NULLS ...
	[ WITH ( storage_parameter = value [, ... ] ) ]
	[ TABLESPACE tablespace ]
	[ WHERE predicate ]

Before we dig too deeply into Postgres, it would be good to familiarize yourself with this useful tool. It’s worth looking over (or brushing up on) a few common commands, such as SELECT and CREATE TABLE.

Starting with SQL

PostgreSQL follows the SQL convention of calling relations TABLEs, attributes COLUMNs, and tuples ROWs. For consistency, we will use this terminology, though you may occasionally encounter the mathematical terms relations, attributes, and tuples. For more on these concepts, see "Mathematical Relations" in the text.

Relational databases are so named because they contain relations (tables). Tables are sets of tuples (rows) that map attributes to atomic values—for example, {name: ’Genghis Khan’, died_at_age: 65}. The available attributes are defined by a header, which is a tuple of attributes mapped to some domain or constraining type (columns, for example {name: string, age: int}). That’s the gist of the relational structure.

Implementations are much more practically minded than the names imply, despite sounding so mathematical. So why bring them up? We’re trying to make the point that relational databases are relational based on mathematics. They aren’t relational because tables “relate” to each other via foreign keys. Whether any such constraints exist is beside the point.

The power of the model is certainly in the math, even though the math is largely hidden from you. This magic allows users to express powerful queries while the system optimizes based on predefined patterns. RDBMSs are built atop a set theory branch called relational algebra—a combination of selections (WHERE ...), projections (SELECT ...), Cartesian products (JOIN ...), and more, as shown in the figure that follows.

You can imagine relations as arrays of arrays, where a table is an array of rows, each of which contains an array of attribute/value pairs. One way of working with tables at the code level would be to iterate across all rows in a table and then iterate across each attribute/value pair within the row. But let’s be real: that sounds like a real chore. Fortunately, relational queries are much more declarative—and fun to work with—than that. They’re derived from a branch of mathematics known as tuple relational calculus, which can be converted to relational algebra. PostgreSQL and other RDBMSs optimize queries by performing this conversion and simplifying the algebra. You can see that the SQL in the diagram that follows is the same as in the previous diagram.

Working with Tables

PostgreSQL, being of the relational style, is a design-first database. First you design the schema; then you enter data that conforms to the definition of that schema.

Creating a table involves giving the table a name and a list of columns with types and (optional) constraint information. Each table should also nominate a unique identifier column to pinpoint specific rows. That identifier is called a PRIMARY KEY. The SQL to create a countries table looks like this:

	CREATE TABLE countries (
	country_code char(2) PRIMARY KEY,
	country_name text UNIQUE
	);

This new table will store a set of rows, where each is identified by a two-character code and the name is unique. These columns both have constraints. The PRIMARY KEY constrains the country_code column to disallow duplicate country codes. Only one us and one gb may exist. We explicitly gave country_name a similar unique constraint, although it is not a primary key. We can populate the countries table by inserting a few rows.

	INSERT INTO countries (country_code, country_name)
	VALUES ('us','United States'), ('mx','Mexico'), ('au','Australia'),
	('gb','United Kingdom'), ('de','Germany'), ('ll','Loompaland');

Let’s test our unique constraint. Attempting to add a duplicate country_name will cause our unique constraint to fail, thus disallowing insertion. Constraints are how relational databases such as PostgreSQL ensure kosher data.

	INSERT INTO countries
	VALUES ('uk','United Kingdom');

This will return an error indicating that the Key (country_name)=(United Kingdom) already exists.

We can validate that the proper rows were inserted by reading them using the SELECT...FROM table command.

	SELECT *
	FROM countries;

	country_code \| country_name
	--------------+---------------
	us \| United States
	mx \| Mexico
	au \| Australia
	gb \| United Kingdom
	de \| Germany
	ll \| Loompaland
	(6 rows)

According to any respectable map, Loompaland isn’t a real place, so let’s remove it from the table. You specify which row to remove using the WHERE clause. The row whose country_code equals ll will be removed.

	DELETE FROM countries
	WHERE country_code = 'll';

With only real countries left in the countries table, let’s add a cities table. To ensure any inserted country_code also exists in our countries table, we add the REFERENCES keyword. Because the country_code column references another table’s key, it’s known as the foreign key constraint.

	CREATE TABLE cities (
	name text NOT NULL,
	postal_code varchar(9) CHECK (postal_code <> ''),
	country_code char(2) REFERENCES countries,
	PRIMARY KEY (country_code, postal_code)
	);

This time, we constrained the name in cities by disallowing NULL values. We constrained postal_code by checking that no values are empty strings (<> means not equal). Furthermore, because a PRIMARY KEY uniquely identifies a row, we created a compound key: country_code + postal_code. Together, they uniquely define a row.

Postgres also has a rich set of datatypes, by far the richest amongst the databases in this book. You’ve just seen three different string representations: text (a string of any length), varchar(9) (a string of variable length up to nine characters), and char(2) (a string of exactly two characters). With our schema in place, let’s insert Toronto, CA.

	INSERT INTO cities
	VALUES ('Toronto','M4C1B5','ca');

	ERROR: insert or update on table "cities" violates foreign key constraint
	"cities_country_code_fkey"
	DETAIL: Key (country_code)=(ca) is not present in table "countries".

This failure is good! Because country_code REFERENCES countries, the country_code must exist in the countries table. As shown in the figure, the REFERENCES keyword constrains fields to another table’s primary key. This is called maintaining referential integrity, and it ensures our data is always correct.

It’s worth noting that NULL is valid for cities.country_code because NULL represents the lack of a value. If you want to disallow a NULL country_code reference, you would define the table cities column like this: country_code char(2) REFERENCES countries NOT NULL.

Now let’s try another insert, this time with a U.S. city (quite possibly the greatest of U.S. cities).

	INSERT INTO cities
	VALUES ('Portland','87200','us');

	INSERT 0 1

This is a successful insert, to be sure. But we mistakenly entered a postal_code that doesn’t actually exist in Portland. One postal code that does exist and may just belong to one of the authors is 97206. Rather than delete and reinsert the value, we can update it inline.

	UPDATE cities
	SET postal_code = '97206'
	WHERE name = 'Portland';

We have now Created, Read, Updated, and Deleted table rows.

Join Reads

All of the other databases you’ll read about in this book perform CRUD operations as well. What sets relational databases like PostgreSQL apart is their ability to join tables together when reading them. Joining, in essence, is an operation taking two separate tables and combining them in some way to return a single table. It’s somewhat like putting together puzzle pieces to create a bigger, more complete picture.

The basic form of a join is the inner join. In the simplest form, you specify two columns (one from each table) to match by, using the ON keyword.

	SELECT cities.*, country_name
	FROM cities INNER JOIN countries / or just FROM cities JOIN countries /
	ON cities.country_code = countries.country_code;

	country_code \| name \| postal_code \| country_name
	--------------+----------+-------------+---------------
	us \| Portland \| 97206 \| United States

The join returns a single table, sharing all columns’ values of the cities table plus the matching country_name value from the countries table.

You can also join a table like cities that has a compound primary key. To test a compound join, let’s create a new table that stores a list of venues.

A venue exists in both a postal code and a specific country. The foreign key must be two columns that reference both cities primary key columns. (MATCH FULL is a constraint that ensures either both values exist or both are NULL.)

	CREATE TABLE venues (
	venue_id SERIAL PRIMARY KEY,
	name varchar(255),
	street_address text,
	type char(7) CHECK ( type in ('public','private') ) DEFAULT 'public',
	postal_code varchar(9),
	country_code char(2),
	FOREIGN KEY (country_code, postal_code)
	REFERENCES cities (country_code, postal_code) MATCH FULL
	);

This venue_id column is a common primary key setup: automatically incremented integers (1, 2, 3, 4, and so on). You make this identifier using the SERIAL keyword. (MySQL has a similar construct called AUTO_INCREMENT.)

	INSERT INTO venues (name, postal_code, country_code)
	VALUES ('Crystal Ballroom', '97206', 'us');

Although we did not set a venue_id value, creating the row populated it.

Back to our compound join. Joining the venues table with the cities table requires both foreign key columns. To save on typing, you can alias the table names by following the real table name directly with an alias, with an optional AS between (for example, venues v or venues AS v).

	SELECT v.venue_id, v.name, c.name
	FROM venues v INNER JOIN cities c
	ON v.postal_code=c.postal_code AND v.country_code=c.country_code;

	venue_id \| name \| name
	----------+------------------+----------
	1 \| Crystal Ballroom \| Portland

You can optionally request that PostgreSQL return columns after insertion by ending the query with a RETURNING statement.

	INSERT INTO venues (name, postal_code, country_code)
	VALUES ('Voodoo Doughnut', '97206', 'us') RETURNING venue_id;

	id
	- - - -
	2

This provides the new venue_id without issuing another query.

The Outer Limits

In addition to inner joins, PostgreSQL can also perform outer joins. Outer joins are a way of merging two tables when the results of one table must always be returned, whether or not any matching column values exist on the other table.

It’s easiest to give an example, but to do that, we’ll create a new table named events. This one is up to you. Your events table should have these columns: a SERIAL integer event_id, a title, starts and ends (of type timestamp), and a venue_id (foreign key that references venues). A schema definition diagram covering all the tables we’ve made so far is shown in the following figure.

After creating the events table, INSERT the following values (timestamps are inserted as a string like 2018-02-15 17:30) for two holidays and a club we do not talk about:

title	starts	ends	venue_id	event_id
Fight Club	2018-02-15 17:30:00	2018-02-15 19:30:00	2	1
April Fools Day	2018-04-01 00:00:00	2018-04-01 23:59:00		2
Christmas Day	2018-02-15 19:30:00	2018-12-25 23:59:00		3

Let’s first craft a query that returns an event title and venue name as an inner join (the word INNER from INNER JOIN is not required, so leave it off here).

	SELECT e.title, v.name
	FROM events e JOIN venues v
	ON e.venue_id = v.venue_id;

	title \| name
	--------------+------------------
	Fight Club \| Voodoo Doughnut

INNER JOIN will return a row only if the column values match. Because we can’t have NULL venues.venue_id, the two NULL events.venue_ids refer to nothing. Retrieving all of the events, whether or not they have a venue, requires a LEFT OUTER JOIN (shortened to LEFT JOIN).

	SELECT e.title, v.name
	FROM events e LEFT JOIN venues v
	ON e.venue_id = v.venue_id;

	title \| name
	------------------+----------------
	Fight Club \| Voodoo Doughnut
	April Fools Day \|
	Christmas Day \|

If you require the inverse, all venues and only matching events, use a RIGHT JOIN. Finally, there’s the FULL JOIN, which is the union of LEFT and RIGHT; you’re guaranteed all values from each table, joined wherever columns match.

Fast Lookups with Indexing

The speed of PostgreSQL (and any other RDBMS) lies in its efficient management of blocks of data, reduced disk reads, query optimization, and other techniques. But those only go so far in fetching results quickly. If we select the title of Christmas Day from the events table, the algorithm must scan every row for a match to return. Without an index, each row must be read from disk to know whether a query should return it. See the following.

An index is a special data structure built to avoid a full table scan when performing a query. When running CREATE TABLE commands, you may have noticed a message like this:

	CREATE TABLE / PRIMARY KEY will create implicit index "events_pkey"
	for table "events"

PostgreSQL automatically creates an index on the primary key—in particular a B-tree index—where the key is the primary key value and where the value points to a row on disk, as shown in the top figure. Using the UNIQUE keyword is another way to force an index on a table column.

You can explicitly add a hash index using the CREATE INDEX command, where each value must be unique (like a hashtable or a map).

	CREATE INDEX events_title
	ON events USING hash (title);

For less-than/greater-than/equals-to matches, we want an index more flexible than a simple hash, like a B-tree, which can match on ranged queries (see the following figure).

Consider a query to find all events that are on or after April 1.

	SELECT *
	FROM events
	WHERE starts >= '2018-04-01';

	event_id \| title \| starts \| ...
	----------+------------------+---------------------+-----
	2 \| April Fools Day \| 2018-04-01 00:00:00 \| ...
	3 \| Christmas Day \| 2018-12-25 00:00:00 \| ...

For this, a tree is the perfect data structure. To index the starts column with a B-tree, use this:

	CREATE INDEX events_starts
	ON events USING btree (starts);

Now our query over a range of dates will avoid a full table scan. It makes a huge difference when scanning millions or billions of rows.

We can inspect our work with this command to list all indexes in the schema:

7dbs=# di

It’s worth noting that when you set a FOREIGN KEY constraint, PostgreSQL will not automatically create an index on the targeted column(s). You’ll need to create an index on the targeted column(s) yourself. Even if you don’t like using database constraints (that’s right, we’re looking at you, Ruby on Rails developers), you will often find yourself creating indexes on columns you plan to join against in order to help speed up foreign key joins.

Day 1 Wrap-Up

We sped through a lot today and covered many terms. Here’s a recap:

Term	Definition
Column	A domain of values of a certain type, sometimes called an attribute
Row	An object comprised of a set of column values, sometimes called a tuple
Table	A set of rows with the same columns, sometimes called a relation
Primary key	The unique value that pinpoints a specific row
Foreign key	A data constraint that ensures that each entry in a column in one table uniquely corresponds to a row in another table (or even the same table)
CRUD	Create, Read, Update, Delete
SQL	Structured Query Language, the lingua franca of a relational database
Join	Combining two tables into one by some matching columns
Left join	Combining two tables into one by some matching columns or NULL if nothing matches the left table
Index	A data structure to optimize selection of a specific set of columns
B-tree index	A good standard index; values are stored as a balanced tree data structure; very flexible; B-tree indexes are the default in Postgres
Hash index	Another good standard index in which each index value is unique; hash indexes tend to offer better performance for comparison operations than B-tree indexes but are less flexible and don’t allow for things like range queries

Relational databases have been the de facto data management strategy for forty years—many of us began our careers in the midst of their evolution. Others may disagree, but we think that understanding “NoSQL” databases is a non-starter without rooting ourselves in this paradigm, even if for just a brief sojourn. So we looked at some of the core concepts of the relational model via basic SQL queries and undertook a light foray into some mathematical foundations. We will expound on these root concepts tomorrow.

Day 1 Homework

Find

Find the PostgreSQL documentation online and bookmark it.
Acquaint yourself with the command-line ? and h output.
We briefly mentioned the MATCH FULL constraint. Find some information on the other available types of MATCH constraints.

Select all the tables we created (and only those) from pg_class and examine the table to get a sense of what kinds of metadata Postgres stores about tables.
Write a query that finds the country name of the Fight Club event.
Alter the venues table such that it contains a Boolean column called active with a default value of TRUE.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

	SELECT *
	FROM countries;

	country_code \| country_name
	--------------+---------------
	us \| United States
	mx \| Mexico
	au \| Australia
	gb \| United Kingdom
	de \| Germany
	ll \| Loompaland
	(6 rows)

Table of Contents for Day 1: Relations, CRUD, and Joins

Create new playlist

Sign In

Sign Up

Day 1: Relations, CRUD, and Joins

Starting with SQL

Working with Tables

Join Reads

The Outer Limits

Fast Lookups with Indexing

Day 1 Wrap-Up

Day 1 Homework

Table of Contents for
Day 1: Relations, CRUD, and Joins