© Daniel Bartholomew 2017
Daniel BartholomewMariaDB and MySQL Common Table Expressions and Window Functions Revealedhttps://doi.org/10.1007/978-1-4842-3120-3_3

3. Recursive Common Table Expressions

Daniel Bartholomew
(1)
Raleigh, North Carolina, USA
 
Recursion is a very useful technique in computer science. Recursive algorithms are well suited for navigating data structures such as trees , where items contain other items that may also contain items, and graphs, which track connections or routes between items. SQL has historically done a poor job with these.
Oracle attempted to add recursive support to SQL in the 1980s with their non-standard CONNECT BY syntax, but this has now been superseded, improved upon, and standardized in the official SQL standard, version SQL99, with recursive CTEs. Implementations of this standard started appearing in various databases, such as Oracle and SQL Server, starting around 2007, with MariaDB and MySQL finally catching up and getting them about ten years after that.
In simple terms, a recursive CTE is a CTE that refers to itself in its <cte_body>. Having a CTE refer to itself might seem complicated, but once you get the hang of it, it isn’t bad, and by using recursion there are a lot of cool things you can do. This chapter will provide examples showing some of the things you can do with recursive CTEs.

Before We Begin

As with the other chapters, the examples in this chapter utilize sample data. For this chapter, we’ll be using two tables, one called tudors and another called routes . The tudors table can be created with the following query:
CREATE TABLE tudors (
  id serial primary key,
  name VARCHAR(100) NOT NULL,
  father BIGINT(20),
  mother BIGINT(20)
);
The data is in a CSV file called bartholomew-ch03-tudors.csv . It can be loaded with a query similar to the following (assuming the file is on the computer running MariaDB or MySQL server in the /tmp/ folder):
LOAD DATA INFILE '/tmp/bartholomew-ch03-tudors.csv'
  INTO TABLE tudors
  FIELDS TERMINATED BY ','
  OPTIONALLY ENCLOSED BY '"';
The routes table can be created with the following query:
CREATE TABLE routes (
  id serial primary key,
  departing VARCHAR(100) NOT NULL,
  arriving VARCHAR(100) NOT NULL
);
The data is in a CSV file called bartholomew-ch03-routes.csv . It can be loaded with a query similar to the following (assuming the file is on the computer running MariaDB or MySQL server in the /tmp/ folder):
LOAD DATA INFILE '/tmp/bartholomew-ch03-routes.csv'
  INTO TABLE routes
  FIELDS TERMINATED BY ','
  OPTIONALLY ENCLOSED BY '"';
Note
See the “Before We Begin” section of Chapter 1 for extra information about loading the files on Windows and working around issues with secure_file_priv.
We’re now ready to begin.

Recursive CTE Syntax

The syntax for recursive CTE queries is similar to that for non-recursive CTEs, with a couple differences. Here’s the basic syntax :
WITH RECURSIVE <cte_name> AS (
  <anchor>
  UNION [ALL]
  <recursive>
)
<cte_query>
Right away, you’ll notice the introduction of the RECURSIVE keyword. This must be included for recursive CTEs in MariaDB and MySQL. This keyword is not required for recursive CTEs in other databases, such as Oracle and SQL Server.
The other difference from non-recursive CTEs is that the <cte_body> is split into two parts, with a UNION or UNION ALL separating the two. The first part is the <anchor>. It is a non-recursive query similar to the <cte_body> of a non-recursive CTE. Then, after the UNION or UNION ALL, there is the recursive part. This part will contain references to the <cte_name>, which is what makes the CTE recursive, so we’ll call this part <recursive> for simplicity.

Adding Numbers

The biggest thing to keep in mind when getting started with recursive CTEs is that the way recursion actually works might not be exactly how we would expect it to work.
We can illustrate this by working through a good, if artificial, example of how recursive CTEs work: adding numbers together. Suppose we want to add all of the numbers from 1 through 100 together. Figuring it out by hand with 1 + 2 + 3 + 4 + ... + 100 would be very tedious. It would be much better to have a recursive loop that does the repetitive parts for me. This example has the benefit of not requiring us to create any tables. Don’t worry, we’ll get to our tudors and routes tables later in the chapter.
To get us started adding our numbers together, we want a loop that does the following:
  1. 1.
    Takes the current number and adds it to the current total
     
  2. 2.
    Adds 1 to the current number
     
  3. 3.
    Repeats until the current number equals 100
     
To start things out, we’ll want to have two columns—one for the counter keeping track of the current number and another for the total. We’ll call these Count and Total, respectively, and we’ll start from zero. The simplest way to express this in SQL is like so:
SELECT
  0 AS Count,
  0 AS Total
This will be our <anchor>, the point we start from.
Now, we need to add 1 to Count and add the current Count to the Total. This will be our <recursive> part. Here it is in SQL:
SELECT
  Count + 1,
  Total + Count
By adding a WITH RECURSIVE TotalSum AS part to the front, a UNION ALL between the two sample queries, a FROM to refer to our <cte_name>, a WHERE clause so we know when the CTE will finish, and lastly a simple SELECT * FROM <cte_name> as our output, we get a CTE query that looks like the following:
WITH RECURSIVE TotalSum AS (
  SELECT
    0 AS Count,
    0 AS Total
  UNION ALL
  SELECT
    Count + 1,
    Total + Count
  FROM TotalSum
  WHERE Count <= 100
)
SELECT * FROM TotalSum;
This looks reasonable , but when we run this query, things don’t look exactly right:
+-------+-------+
| Count | Total |
+-------+-------+
|     0 |     0 |
|     1 |     0 |
|     2 |     1 |
|     3 |     3 |
|     4 |     6 |
|     5 |    10 |
...
|    96 |  4560 |
|    97 |  4656 |
|    98 |  4753 |
|    99 |  4851 |
|   100 |  4950 |
|   101 |  5050 |
+-------+-------+
The Count column looks fine, until we get to the end when it stops at 101 instead of 100 like we wanted. Also, while the Total column ends up with the correct answer, 5050, it’s confusing because at the beginning when Count is 1, Total is still equal to 0, and at the end when Count is 101, the total of 5050 is after adding the final 100, not after adding 101.
This behavior can be explained with an understanding of how the database is performing the UNION ALL between the <anchor> and <recursive> parts of our TotalSum CTE.
First, when we begin, Count and Total are both set to 0, and the first line of our output reflects that:
+-------+-------+
| Count | Total |
+-------+-------+
|     0 |     0 |
+-------+-------+
We then do a UNION ALL against this table with our Count + 1 and Total + Count expressions. Count + 1 = 0 + 1 = 1, but Total + Count = 0 + 0 = 0, because when the value of Count is calculated, the value of Count is being pulled from the previous output, not the expression one line above that adds 1 to Count. The UNION ALL is only looking at the previous row of output, and in that row, Count = 0. So, our second line of output may look wrong, but from the database’s perspective, it is completely accurate:
+-------+-------+
| Count | Total |
+-------+-------+
|     1 |     0 |
+-------+-------+
The fix then, is pretty simple. When calculating the Total column, we add one to it just like we do to the Count column. With this change, we should also change the WHERE Count <= 100 to WHERE Count < 100 because by the time Count actually reaches 100 we’re already done. With those modifications, our recursive CTE now looks like the following:
WITH RECURSIVE TotalSum AS (
  SELECT
    0 AS Count,
    0 AS Total
  UNION ALL
  SELECT
    Count + 1,
    Total + Count + 1
  FROM TotalSum
  WHERE Count < 100
)
SELECT * FROM TotalSum;
The output of this version looks much better, exactly what we would expect :
+-------+-------+
| Count | Total |
+-------+-------+
|     0 |     0 |
|     1 |     1 |
|     2 |     3 |
|     3 |     6 |
|     4 |    10 |
|     5 |    15 |
...
|    96 |  4656 |
|    97 |  4753 |
|    98 |  4851 |
|    99 |  4950 |
|   100 |  5050 |
+-------+-------+
The key takeaway from this exercise is to remember that the <recursive> part after a UNION or UNION ALL is looking at the previously retrieved or calculated row, not the current row, when making its calculations.

Calculating Fibonacci Numbers

Another interesting application of recursion is to calculate the Fibonacci sequence. In this series of numbers, every new number in the sequence is calculated as the sum of the previous two. Because the sequence relies on two numbers, we must define two; we can choose either 0 and 1, or 1 and 1. For this example, we’ll go with the former and call them Current and Next. A simple bit of SQL that does this for our <anchor> part is:
SELECT
  0 AS Current,
  1 AS Next
For our <recursive> part, our loop needs to do the following:
  1. 1.
    Move Current to Next.
     
  2. 2.
    Calculate the new Next by adding Current + Next.
     
  3. 3.
    Repeat until we say stop.
     
The math part is straightforward :
SELECT
  Next AS Current,
  Current + Next AS Next
Putting both together, with an upper limit set at 1000 and a simple SELECT * FROM <cte_name> as our output, we get the following:
WITH RECURSIVE fibonacci AS (
  SELECT
    0 AS Current,
    1 AS Next
  UNION ALL
  SELECT
    Next AS Current,
    Current + Next AS Next
  FROM fibonacci
  WHERE Next < 1000
)
SELECT * FROM fibonacci;
The output of this recursive CTE looks like this:
+---------+------+
| Current | Next |
+---------+------+
|       0 |    1 |
|       1 |    1 |
|       1 |    2 |
|       2 |    3 |
|       3 |    5 |
|       5 |    8 |
|       8 |   13 |
|      13 |   21 |
|      21 |   34 |
|      34 |   55 |
|      55 |   89 |
|      89 |  144 |
|     144 |  233 |
|     233 |  377 |
|     377 |  610 |
|     610 |  987 |
|     987 | 1597 |
+---------+------+
As with the previous example, this result is probably not exactly what we want. Instead of a simple Fibonacci sequence, we have parallel series, with the Current and Next columns off by one in sequence order . The reason for this, again, relates to how the <recursive> part is calculated.
When we start, Current = 0 and Next = 1. That is the first row in our output:
+---------+------+
| Current | Next |
+---------+------+
|       0 |    1 |
+---------+------+
In our first run through the <recursive> part, we first move the value of Next (1) to Current so that for the following row, Current will be equal to 1. We then set Next to Current + Next of the initial row, or 0 + 1, or 1. So, for the second row, both Current and Next are equal to 1:
+---------+------+
| Current | Next |
+---------+------+
|       1 |    1 |
+---------+------+
The loop now repeats, and in the <recursive> part we move the value of Next from the second row to Current. So, for the third row it will still be 1. Then, we set the value of Next to the second-row values of Current + Next, or 1 + 1, or 2. So, for the third row the values are:
+---------+------+
| Current | Next |
+---------+------+
|       1 |    2 |
+---------+------+
This process repeats until our WHERE condition is met, which happens when the loop looks at the 17th row, with the side effect being that we get output beyond our 1000 limit because until that point the value of Next was always less than that.
To get the output we want—a single column containing a Fibonacci sequence where the highest number is less than 1000—you can probably guess what we have to do: we simply SELECT the Current column from our CTE instead of selecting all columns. We can rename it to further improve the output:
WITH RECURSIVE fibonacci AS (
  SELECT
    0 AS Current,
    1 AS Next
  UNION ALL
  SELECT
    Next AS Current,
    Current + Next AS Next
  FROM fibonacci
  WHERE Next < 1000
)
SELECT Current AS fibonacci_series FROM fibonacci;
Now, the output of our fibonacci CTE looks like this:
+------------------+
| fibonacci_series |
+------------------+
|                0 |
|                1 |
|                1 |
|                2 |
|                3 |
|                5 |
|                8 |
|               13 |
|               21 |
|               34 |
|               55 |
|               89 |
|              144 |
|              233 |
|              377 |
|              610 |
|              987 |
+------------------+
Some additional things we could do here include setting up a counter to track which position of the Fibonacci sequence we are at, and maybe using that as our limiter instead of the actual Fibonacci value we are at. For example, we could modify our CTE and calculate the Fibonacci sequence to 100 places.

Looking Up Ancestors in a Tree

Using recursive CTEs to solve math problems, as in the previous two examples, or even creating a recursive CTE Sieve of Eratosthenes, can be fun little diversions, but they aren’t often practical in the real world. So, let’s move away from those and tackle some examples that you might actually run into. We’ll start with using the tudors table.
This table contains data on the Tudor monarchs of England— you know, Henry VIII, Elizabeth I, Bloody Mary, those guys. There are four columns: id, name, father, and mother. The father and mother columns, if populated, point to the records for the father and mother of the person in question, like you would expect.
The data starts with Elizabeth I and then contains several generations back, as well as some of her cousins, aunts, and uncles. Conveniently, her id is 1. Here’s the SQL to pull up her record:
SELECT * FROM tudors
WHERE id = 1;
The result looks like this:
+----+------------------------+--------+--------+
| id | name                   | father | mother |
+----+------------------------+--------+--------+
|  1 | Elizabeth I of England |      2 |      3 |
+----+------------------------+--------+--------+
To find her parents without using a CTE , there are many things we could do; for example, here is one way using a simple JOIN:
SELECT
  elizabeth.id, elizabeth.name, tudors.id, tudors.name
FROM
  tudors AS elizabeth
    JOIN tudors ON
        tudors.id = elizabeth.father
      OR
        tudors.id = elizabeth.mother
WHERE elizabeth.id=1;
This query isn’t the easiest to read, but it’s not too bad. The result looks like this:
+----+------------------------+----+-----------------------+
| id | name                   | id | name                  |
+----+------------------------+----+-----------------------+
|  1 | Elizabeth I of England |  2 | Henry VIII of England |
|  1 | Elizabeth I of England |  3 | Anne Boleyn           |
+----+------------------------+----+-----------------------+
This result gives us Elizabeth’s parents, but what if we want to pull up all of Elizabeth’s ancestors: parents, grandparents, great-grandparents, and so on? This is the type of query that recursive CTEs were made for.
For the <anchor> part of our query, we can use the query that just retrieves Elizabeth’s record , and for the <recursive> part, we can use something that resembles our JOIN-based query but makes more sense, syntactically:
WITH RECURSIVE elizabeth AS (
  SELECT * FROM tudors
  WHERE id = 1
UNION
  SELECT tudors.*
  FROM tudors, elizabeth
  WHERE
    tudors.id = elizabeth.father OR
    tudors.id = elizabeth.mother
)
SELECT * FROM elizabeth;
Using a <cte_name> of elizabeth both does and doesn’t make sense. It makes sense because for the first run through the loop we actually are looking for Elizabeth’s father and mother. It doesn’t make sense for future runs of the loop, because on the second pass through the loop we are looking for the parents of Henry VIII and Anne Boleyn, Elizabeth’s grandparents, and for the third loop her great-grandparents, and so on.
However, for me at least, the name helped when writing the <recursive> part. Thinking recursively is hard enough, so any advantage that can be found in naming CTEs is a good thing.
The (truncated) result of this query looks like this:
+----+---------------------------------------+--------+--------+
| id | name                                  | father | mother |
+----+---------------------------------------+--------+--------+
|  1 | Elizabeth I of England                |      2 |      3 |
|  2 | Henry VIII of England                 |      4 |      5 |
|  3 | Anne Boleyn                           |      6 |      7 |
|  4 | Henry VII of England                  |      8 |      9 |
|  5 | Elizabeth of York                     |     10 |     11 |
|  6 | Thomas Boleyn, 1st Earl of Wiltshire  |     12 |     13 |
|  7 | Elizabeth Howard                      |     14 |     15 |
|  8 | Edmund Tudor, 1st Earl of Richmond    |     16 |     17 |
|  9 | Margaret Beaufort                     |     18 |     19 |
| 10 | Edward IV of England                  |     20 |     21 |
...
+----+---------------------------------------+--------+--------+
You’ll notice that for this recursive CTE there is a WHERE clause, like our previous ones, but it doesn’t have a set stopping point like WHERE tudors.id < 100. So, how does the CTE know when it is done? To find out, let’s walk through what the query is doing step by step.
First, there is our <anchor>, and during the first pass its result is output. Then, the <recursive> part looks for records where the father or mother fields match the id. Those that it finds are joined to the result table.
The CTE then loops back and does the same search again, this time incorporating the previous results in the UNION and ignoring records it finds that are already in the result table. This process repeats until no new results are returned. That is the trigger for the CTE to stop looping.
For our sample data, looping until nothing new is returned is no problem, as it only loops a handful of times. But what if we are navigating an enormous tree of data? What’s to prevent our query from looping endlessly?
The answer depends on whether you are using MariaDB or MySQL.
In MariaDB, as a final safety measure , there is a @@max_recursive_iterations variable that governs the maximum number of loops the server will make before stopping. You can show its current value with:
SHOW VARIABLES LIKE '%recursive%';
The default setting is very high, 4294967295, which should be fine for almost all queries, but it can be changed, like any other variable, if needed. Setting it to 0 disables it, which should be done cautiously.
As of right now, there is no corresponding variable in MySQL. There, the only current protection is to set @@max_statement_time to the maximum amount of time you will allow a query to run until it should be killed.

Finding All Possible Destinations

Our last two examples in this chapter use the routes table . This table contains a list of hypothetical train routes between various cities in North America. Each route has a departing city and an arriving city. Between some cities there are two routes—one in each direction. In other cities, the route only goes in one direction. Figure 3-1 shows all of the routes and cities.
A454556_1_En_3_Fig1_HTML.jpg
Figure 3-1.
All routes between all cities
As you can see from looking at the routes , there are some loops in the paths. For example, Raleigh to Atlanta to Miami to Raleigh.
Let’s say we want to find out all of the destinations we can get to from Raleigh. How would we do that using a CTE? Here’s a proposed set of steps:
  1. 1.
    Look up all destinations from Raleigh.
     
  2. 2.
    Take those results and look up all their destinations.
     
  3. 3.
    Repeat until all destinations are found.
     
Step one looks to be perfect to use as our <anchor> part, with the other two steps being in the <recursive> part. The obvious <anchor> is to SELECT every route departing from Raleigh:
SELECT arriving FROM routes
WHERE departing='Raleigh';
This query gives us the following output:
+------------+
| arriving   |
+------------+
| Washington |
| Atlanta    |
| Miami      |
+------------+
For the <recursive> part of our CTE, we need to SELECT records with routes that have those cities as departing cities. We can do this by looking for arrivals from cities returned by our <anchor> as the initial departing cities and repeating the process until we have a list of all possible destinations.
Here’s everything as a recursive CTE named destinations :
WITH RECURSIVE destinations AS (
    SELECT arriving
    FROM routes
    WHERE departing='Raleigh'
  UNION
    SELECT routes.arriving
    FROM destinations, routes
    WHERE
      destinations.arriving=routes.departing
)
SELECT * FROM destinations;
The results returned by this CTE are:
+------------+
| arriving   |
+------------+
| Washington |
| Atlanta    |
| Miami      |
| Chicago    |
| Raleigh    |
| New York   |
| Toronto    |
+------------+
As expected, the only city we can’t get to from Raleigh is Houston. In fact, nobody can get to Houston by train because there’s no path to Houston, only a single path out of Houston. We should lay some track and fix that.
Besides the absence of Houston, there are a couple things to note about this result. First is that Raleigh itself appears in the output, and the second is how the CTE was smart enough to not loop endlessly.
What provides closure to our CTE and prevents these loops from running until we hit @@max_recursive_iterations or @@max_statement_time is our use of UNION instead of UNION ALL. When UNION sees a duplicated result it ignores it, so once all possible cities have been located, the only cities being returned will be ones it has already seen, and so the CTE terminates.
What about the inclusion of Raleigh? Well, if you refer back to Figure 3-1, you’ll see that from Raleigh there are several paths that lead back to Raleigh. All of the paths leaving Raleigh have paths that return to Raleigh; for example, Raleigh to Miami to Raleigh. There is also a big circle of Raleigh to Atlanta to Chicago to New York to Washington back to Raleigh. Because Raleigh wasn’t in our list to begin with, it is included in the result like any other valid destination, but only once. Any additional times Raleigh appears in new results it will be ignored.
If we want to remove Raleigh from our result, we can simply change the final line to:
SELECT * FROM destinations WHERE arriving!='Raleigh';
We could, alternatively, move Raleigh to the first position of our results, which makes a bit more sense logically. After all, the first location we can get to is where we are right now. To do this, we need to cheat a little and tell the parser that we’re selecting Raleigh as an arrival even though we’re actually selecting it as departing city. The SQL looks like this:
SELECT departing AS arriving
  FROM routes
  WHERE departing='Raleigh';
Running this query by itself gives us the following:
+----------+
| arriving |
+----------+
| Raleigh  |
| Raleigh  |
| Raleigh  |
+----------+
This result is expected because there are three routes from Raleigh to other cities. We can now plug this into our CTE to get the following:
WITH RECURSIVE destinations AS (
    SELECT departing AS arriving
    FROM routes
    WHERE departing='Raleigh'
  UNION
    SELECT routes.arriving
    FROM destinations, routes
    WHERE
      destinations.arriving=routes.departing
)
SELECT * FROM destinations;
And the result is:
+------------+
| arriving   |
+------------+
| Raleigh    |
| Washington |
| Atlanta    |
| Miami      |
| Chicago    |
| New York   |
| Toronto    |
+------------+
This still gives us Raleigh in the output, but at least it is the first result instead of it confusingly showing up in the middle of the results.

Finding All Possible Paths

Finding all of the possible destinations we can get to from Raleigh is nice, but what about finding all of the possible paths we could take to get from Raleigh to every city we can get to from Raleigh?
Here are the steps to do this:
  1. 1.
    Look up destinations from our starting point.
     
  2. 2.
    Find destinations from that point and add them; UNION will prevent duplicates.
     
  3. 3.
    Repeat until all possible paths are found.
     
Because we want to start from Raleigh, for our <anchor> we need to do something similar to what we did in the previous section and issue our SELECT in such a way that it starts from Raleigh. Here’s a possible <anchor> candidate:
SELECT departing, arriving
FROM routes
  WHERE departing='Raleigh';
This gives us what we would expect:
+-----------+------------+
| departing | arriving   |
+-----------+------------+
| Raleigh   | Washington |
| Raleigh   | Atlanta    |
| Raleigh   | Miami      |
+-----------+------------+
Our <recursive> part is going to be trickier. We want to show the complete set of every possible path, not just a list of end points. So, we therefore want to add any additions, if any, to a given path to the end of an existing path with a separator in between. The CONCAT() function was made for this sort of thing, and the departing column looks like the column we will want to concat on, because that’s where our starting point, Raleigh, is.
After the first run -through of our <recursive> part, we should concat the departing and arriving columns together as our new departing column, and then also include our arriving column to use for the next run through the loop. We should end up with a result that outputs something that looks like this:
+----------------------+------------+
| departing            | arriving   |
+----------------------+------------+
| Raleigh > Washington | Washington |
| Raleigh > Atlanta    | Atlanta    |
| Raleigh > Miami      | Miami      |
+----------------------+------------+
Actually, the departing column name doesn’t make sense, because it is holding our path, not the initial departure city, so let’s call it path in our actual CTE.
Are we now ready to actually write our CTE? Actually, not quite! There’s one other issue we should solve first. Look again at Figure 3-1; what can we do to prevent silly results like the following?
Raleigh > Washington > New York > Washington > Raleigh > Miami
This is a perfectly valid path, but it isn’t one that any sane person would take. If we want to go from Raleigh to Miami, we would take that route; we would never go to New York first then back through Raleigh to Miami. What can we do to prevent this? The LOCATE() function provides an easy way. It searches a string for a given substring, returning 0 if the substring is not found. So, all we need to do is add something like the following to the WHERE clause of our <recursive> part:
LOCATE(routes.arriving, <cte_name>.paths)=0
We will, of course, replace <cte_name> with the actual name of our CTE .
Let’s try putting everything together, finally, into a CTE named full_routes:
WITH RECURSIVE full_routes AS (
    SELECT departing AS path, arriving
    FROM routes
    WHERE departing='Raleigh'
  UNION
    SELECT
      CONCAT(full_routes.path, ' > ',
             routes.arriving),
      routes.arriving
    FROM full_routes, routes
    WHERE
      full_routes.arriving=routes.departing
      AND
      LOCATE(routes.arriving, full_routes.path)=0
) SELECT * FROM full_routes;
This CTE looks reasonable, but when we run it, the result is obviously wrong:
+-------------------------------------------+------------+
| path                                      | arriving   |
+-------------------------------------------+------------+
| Raleigh                                   | Washington |
| Raleigh                                   | Atlanta    |
| Raleigh                                   | Miami      |
| Raleigh > Chicago                         | Chicago    |
| Raleigh > New York                        | New York   |
| Raleigh > Miami                           | Miami      |
| Raleigh > Chicago > New York              | New York   |
| Raleigh > New York > Washington           | Washington |
| Raleigh > New York > Toronto              | Toronto    |
| Raleigh > Chicago > New York > Washington | Washington |
| Raleigh > Chicago > New York > Toronto    | Toronto    |
+-------------------------------------------+------------+
What is going on here? Where are our expected Raleigh > Washington, Raleigh > Atlanta, and Raleigh > Miami paths? And there’s no way to get directly from Raleigh to Chicago, as you need to go through Washington first. Actually, the same is true with all of the paths; they’re all missing one stop.
If we look closer at the logic of our recursive CTE, the reason becomes clear.
During the first run-through of our <recursive> part, our WHERE clause is looking for cities in the arriving column that also appear in the departing column of the routes table. During the initial loop through the result set, our recursive CTE has access to only what was returned by our <anchor> part, the three cities Washington, Atlanta, and Miami. As we asked it to, our recursive CTE first looks for a row in the routes table that has Washington in the departing column, and the first result it finds is the Washington to Chicago entry. It then dutifully concats Chicago to the value in the path column, which is Raleigh, as a new result row. That’s why the fourth row of our output is:
+-------------------+----------+
| path              | arriving |
+-------------------+----------+
| Raleigh > Chicago | Chicago  |
+-------------------+----------+
To fix this, we need to somehow set our <anchor> so that our starting point is just Raleigh. Then, during the first run-through it will correctly concat Washington, Atlanta, and Miami to the path. What if instead of our <anchor> returning Raleigh’s actual connections, we just select the departing column again and tell the CTE that it is the arriving column? It seems a bit cheaty, but the SQL for it is perfectly valid:
SELECT departing AS path, departing AS arriving
  FROM routes
  WHERE departing='Raleigh';
Importantly, the result of this query contains a result that should work perfectly fine as our <anchor>:
+---------+----------+
| path    | arriving |
+---------+----------+
| Raleigh | Raleigh  |
| Raleigh | Raleigh  |
| Raleigh | Raleigh  |
+---------+----------+
After putting our new <anchor> into our full_routes CTE, it now looks like this:
WITH RECURSIVE full_routes AS (
    SELECT departing AS path, departing AS arriving
    FROM routes
    WHERE departing='Raleigh'
  UNION
    SELECT
      CONCAT(full_routes.path, ' > ',
             routes.arriving),
      routes.arriving
    FROM full_routes, routes
    WHERE
      full_routes.arriving=routes.departing
      AND
      LOCATE(routes.arriving, full_routes.path)=0
) SELECT * FROM full_routes;
And the result is what we would expect:
+-----------------------------------------------------+------------+
| path                                                | arriving   |
+-----------------------------------------------------+------------+
| Raleigh                                             | Raleigh    |
| Raleigh > Washington                                | Washington |
| Raleigh > Atlanta                                   | Atlanta    |
| Raleigh > Miami                                     | Miami      |
| Raleigh > Atlanta > Chicago                         | Chicago    |
| Raleigh > Washington > New York                     | New York   |
| Raleigh > Atlanta > Miami                           | Miami      |
| Raleigh > Atlanta > Chicago > New York              | New York   |
| Raleigh > Washington > New York > Toronto           | Toronto    |
| Raleigh > Atlanta > Chicago > New York > Washington | Washington |
| Raleigh > Atlanta > Chicago > New York > Toronto    | Toronto    |
+-----------------------------------------------------+------------+
The first result is a little silly, with both the path and arriving columns as Raleigh, but the rest of the results are exactly what we would expect. And, actually, the next time I need to take the train from Raleigh to Washington, I should take the scenic route and get there via Atlanta, Chicago, and New York.
Just for fun , I removed the LOCATE part of the WHERE clause to see how many possible non-duplicated combinations there were. On MySQL 8.0.2 DMR it returned an error:
ERROR 1406 (22001): Data too long for column 'path' at row 231
However, on MariaDB 10.2 it returned all of the possible combinations . There are a lot. Appropriately, for Dragon Ball Z fans, it’s over 9000!
9117 to be exact.

Summary

In this chapter, we explored how recursive CTEs differ from non-recursive CTEs. We explored a few of their many uses:
  • Solving recursive math problems
  • Walking a genealogical tree to match children with their ancestors
  • Finding all possible routes between two points
There are many more uses, but we’re going to switch gears for the next three chapters so that we can explore the second major topic of this book, Window Functions. We’ll then return to CTEs and combine them with Window Functions in Chapter 7, because, why not?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset