Very often, our work on data analysis and visualization is at the consumer end of the data pipeline. We most often use the already produced data rather than producing the data ourselves. A modern application, for example, holds different datasets inside relational databases (or other databases like MongoDB), and we use these databases to produce beautiful graphs.
This recipe will show you how to use SQL drivers from Python to access data.
We will demonstrate this recipe using a SQLite database because it requires the least effort to set up, but the interface is similar to most other SQL-based database engines (MySQL and PostgreSQL). There are, however, differences in the SQL dialect that those database engines support. This example uses simple SQL language and should be reproducible on most common SQL database engines.
To be able to execute this recipe, we need to install the SQLite library as shown here:
$ sudo apt-get install sqlite3
Python support for SQLite is available by default, so we don't need to install anything Python-related. Just fire the following code snippet in IPython to verify that everything is present:
import sqlite3 sqlite3.version sqlite3.sqlite_version
We get an output similar to this as shown here:
In [1]: import sqlite3 In [2]: sqlite3.version Out[2]: '2.6.0' In [3]: sqlite3.sqlite_version Out[3]: '3.8.4.3'
Here, sqlite3
.version gets us the version of the Python sqlite3
module, and sqlite_version
returns the system SQLite library version.
To be able to read from the database, we need to perform the following steps:
I will not try to teach SQL here as there are many books on that particular topic. But just for the purpose of clarity, we will explain the SQL query in this code sample:
SELECT ID, Name, Population FROM City ORDER BY Population DESC LIMIT 1000
ID
, Name
, and Population
are columns (fields) of the table City
from which we select data. ORDER BY
tells the database engine to sort our data by the Population
column, and DESC
means descending order. LIMIT
allows us to get just the first 1,000 records found.
For this example, we will use the world.sql
example table, which holds the world's city names and populations. This table has more than 5,000 entries.
First, we need to import this SQL file into the SQLite database. Here's is the code on how to do it:
import sqlite3 import sys if len(sys.argv) < 2: print "Error: You must supply at least SQL script." print "Usage: %s table.db ./sql-dump.sql" % (sys.argv[0]) sys.exit(1) script_path = sys.argv[1] if len(sys.argv) == 3: db = sys.argv[2] else: # if DB is not defined # create memory database db = ":memory:" try: con = sqlite3.connect(db) with con: cur = con.cursor() with open(script_path,'rb') as f: cur.executescript(f.read()) except sqlite3.Error as err: print "Error occurred: %s" % err
This reads the SQL file and executes the SQL statements against the opened SQLite db
file. If we don't specify the filename, SQLite creates the database in the memory. The statements are then executed line by line.
If we encounter any errors, we catch exceptions and print the error message to the user.
After we have imported data into the database, we are able to query the data and do some processing. Here is the code to read the data from the database file:
import sqlite3 import sys if len(sys.argv) != 2: print "Please specify database file." sys.exit(1) db = sys.argv[1] try: con = sqlite3.connect(db) with con: cur = con.cursor() query = 'SELECT ID, Name, Population FROM City ORDER BY Population DESC LIMIT 1000' con.text_factory = str cur.execute(query) resultset = cur.fetchall() # extract column names col_names = [cn[0] for cn in cur.description] print "%10s %30s %10s" % tuple(col_names) print "="*(10+1+30+1+10) for row in resultset: print "%10s %30s %10s" % row except sqlite3.Error as err: print "[ERROR]:", err
Here's an example of how to use the two preceding scripts:
$ python ch02-sqlite-import.py world.sql world.db $ python ch02-sqlite-read.py world.db ID Name Population ==================================================== 1024 Mumbai (Bombay) 10500000 2331 Seoul 9981619 206 S?o Paulo 9968485 1890 Shanghai 9696300
First, we verify that the user has provided the database file path. This is just a quick check to ensure that we can proceed with the rest of the code.
Then, we try to connect to the database; if that fails, we catch sqlite3.Error
and print it to the user.
If the connection is successful, we obtain a cursor using con.cursor()
. A cursor is an iterator-like structure that enables us to traverse records of the result set returned from a database.
We define a query that we execute over the connection and we fetch the result set using cur.fetchall()
. Had we expected just one result, we would have used just fetchone()
.
List comprehension over cur.description
allows us to obtain column names. description is a read-only attribute and returns more than we need for just column names, so we just fetch the first item from every column's 7-item tuple.
We then use simple string formatting to print the header of our table with column names. After that, we iterate over resultset and print every row in a similar manner.
Databases are the most common sources of data today. We could not present everything in this short recipe, but we can suggest you where to look for more information.
The official Python documentation is the first place to look for an explanation about how to work with databases. The most common databases are open source databases, such as MySQL, PostgreSQL, and SQLite, and on the other end of the spectrum, there are enterprise database systems such as MS SQL, Oracle, and Sybase. Mostly Python has support for them and the interface is abstracted always, so you don't have to change your program if your underlying database changes, but some tweaks may be required. It depends on whether you have used the specifics of a particular database system. For example, Oracle supports a specific language PL/SQL that is not standard SQL, and some things will not work if your database changes from Oracle to MS SQL. Similarly, SQLite does not support specifics from MySQL data types or database engine types (MyISAM and InnoDB). Those things can be annoying, but having your code rely on standard SQL (available at http://en.wikipedia.org/wiki/SQL:2011) will make your code portable from one database system to another.