There are many ways to create a DataFrame. This chapter will cover some of the most common ones. It will also show how to persist them.
Usually, we create a DataFrame from an existing file or a database, but we can also create one from scratch. We can create a DataFrame from parallel lists of data.
>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]
>>> people = {"first": fname, "last": lname, "birth": birth}
>>> beatles = pd.DataFrame(people)
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
By default, pandas will create a RangeIndex
for our DataFrame when we call the constructor:
>>> beatles.index
RangeIndex(start=0, stop=4, step=1)
We can specify another index for the DataFrame if we desire:
>>> pd.DataFrame(people, index=["a", "b", "c", "d"])
first last birth
a Paul McCartney 1942
b John Lennon 1940
c Richard Starkey 1940
d George Harrison 1943
You can also create a DataFrame from a list of dictionaries:
>>> pd.DataFrame(
... [
... {
... "first": "Paul",
... "last": "McCartney",
... "birth": 1942,
... },
... {
... "first": "John",
... "last": "Lennon",
... "birth": 1940,
... },
... {
... "first": "Richard",
... "last": "Starkey",
... "birth": 1940,
... },
... {
... "first": "George",
... "last": "Harrison",
... "birth": 1943,
... },
... ]
... )
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
Note that the columns are ordered by the alphabetic ordering of the keys when you use rows of dictionaries. You can use the columns
parameter to specify the column order if that is important to you:
>>> pd.DataFrame(
... [
... {
... "first": "Paul",
... "last": "McCartney",
... "birth": 1942,
... },
... {
... "first": "John",
... "last": "Lennon",
... "birth": 1940,
... },
... {
... "first": "Richard",
... "last": "Starkey",
... "birth": 1940,
... },
... {
... "first": "George",
... "last": "Harrison",
... "birth": 1943,
... },
... ],
... columns=["last", "first", "birth"],
... )
last first birth
0 McCartney Paul 1942
1 Lennon John 1940
2 Starkey Richard 1940
3 Harrison George 1943
For better or worse, there are a lot of CSV files in the world. Like most technologies, there are good and bad parts to CSV files. On the plus side, they are human-readable, can be opened in any text editor, and most spreadsheet software can load them. On the downside, there is no standard for CSV files, so encoding may be weird, there is no way to enforce types, and they can be large because they are text-based (though they can be compressed).
In this recipe, we will show how to create a CSV file from a pandas DataFrame.
There are a few methods on the DataFrame that start with to_
. These are methods that export DataFrames. We are going to use the .to_csv
method. We will write out to a string buffer in the examples, but you will usually use a filename instead.
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> from io import StringIO
>>> fout = StringIO()
>>> beatles.to_csv(fout) # use a filename instead of fout
>>> print(fout.getvalue())
,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943
The .to_csv
method has a few options. You will notice that it included the index in the output but did not give the index a column name. If you were to read this CSV file into a DataFrame using the read_csv
function, it would not use this as the index by default. Instead, you will get a column named Unnamed: 0 in addition to an index. These columns are redundant:
>>> _ = fout.seek(0)
>>> pd.read_csv(fout)
Unnamed: 0 first last birth
0 0 Paul McCartney 1942
1 1 John Lennon 1940
2 2 Richard Starkey 1940
3 3 George Harrison 1943
The read_csv
function has an index_col
parameter that you can use to specify the location of the index:
>>> _ = fout.seek(0)
>>> pd.read_csv(fout, index_col=0)
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
Alternatively, if we didn't want to include the index when writing the CSV file, we can set the index
parameter to False
:
>>> fout = StringIO()
>>> beatles.to_csv(fout, index=False)
>>> print(fout.getvalue())
first,last,birth
Paul,McCartney,1942
John,Lennon,1940
Richard,Starkey,1940
George,Harrison,1943
The pandas library is an in-memory tool. You need to be able to fit your data in memory to use pandas with it. If you come across a large CSV file that you want to process, you have a few options. If you can process portions of it at a time, you can read it into chunks and process each chunk. Alternatively, if you know that you should have enough memory to load the file, there are a few hints to help pare down the file size.
Note that in general, you should have three to ten times the amount of memory as the size of the DataFrame that you want to manipulate. Extra memory should give you enough extra space to perform many of the common operations.
In this section, we will look at the diamonds dataset. This dataset easily fits into the memory of my 2015 MacBook, but let's pretend that the file is a lot bigger than it is, or that the memory of my machine is limited such that when pandas tries to load it with the read_csv
function, I get a memory error.
nrows
parameter of read_csv
to limit how much data we load to a small sample:
>>> diamonds = pd.read_csv("data/diamonds.csv", nrows=1000)
>>> diamonds
carat cut color clarity ... price x y z
0 0.23 Ideal E SI2 ... 326 3.95 3.98 2.43
1 0.21 Premium E SI1 ... 326 3.89 3.84 2.31
2 0.23 Good E VS1 ... 327 4.05 4.07 2.31
3 0.29 Premium I VS2 ... 334 4.20 4.23 2.63
4 0.31 Good J SI2 ... 335 4.34 4.35 2.75
.. ... ... ... ... ... ... ... ... ...
995 0.54 Ideal D VVS2 ... 2897 5.30 5.34 3.26
996 0.72 Ideal E SI1 ... 2897 5.69 5.74 3.57
997 0.72 Good F VS1 ... 2897 5.82 5.89 3.48
998 0.74 Premium D VS2 ... 2897 5.81 5.77 3.58
999 1.12 Premium J SI2 ... 2898 6.68 6.61 4.03
.info
method to see how much memory the sample of data uses:
>>> diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float64
cut 1000 non-null object
color 1000 non-null object
clarity 1000 non-null object
depth 1000 non-null float64
table 1000 non-null float64
price 1000 non-null int64
x 1000 non-null float64
y 1000 non-null float64
z 1000 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB
We can see that 1,000 rows use about 78.2 KB of memory. If we had 1 billion rows, that would take about 78 GB of memory. It turns out that it is possible to rent machines in the cloud that have that much memory but let's see if we can take it down a little.
dtype
parameter to read_csv
to tell it to use the correct (or smaller) numeric types:
>>> diamonds2 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "x": np.float32,
... "y": np.float32,
... "z": np.float32,
... "price": np.int16,
... },
... )
>>> diamonds2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float32
cut 1000 non-null object
color 1000 non-null object
clarity 1000 non-null object
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
x 1000 non-null float32
y 1000 non-null float32
z 1000 non-null float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB
Make sure that summary statistics are similar with our new dataset to the original:
>>> diamonds.describe()
carat depth ... y z
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean 0.689280 61.722800 ... 5.599180 3.457530
std 0.195291 1.758879 ... 0.611974 0.389819
min 0.200000 53.000000 ... 3.750000 2.270000
25% 0.700000 60.900000 ... 5.630000 3.450000
50% 0.710000 61.800000 ... 5.760000 3.550000
75% 0.790000 62.600000 ... 5.910000 3.640000
max 1.270000 69.500000 ... 7.050000 4.330000
>>> diamonds2.describe()
carat depth ... y z
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean 0.689453 61.718750 ... 5.601562 3.457031
std 0.195312 1.759766 ... 0.611816 0.389648
min 0.199951 53.000000 ... 3.750000 2.269531
25% 0.700195 60.906250 ... 5.628906 3.449219
50% 0.709961 61.812500 ... 5.761719 3.550781
75% 0.790039 62.593750 ... 5.910156 3.640625
max 1.269531 69.500000 ... 7.050781 4.328125
By changing the numeric types, we use about 62% of the memory. Note that we lose some precision, which may or may not be acceptable.
dtype
parameter to use change object types to categoricals. First, inspect the .value_counts
method of the object columns. If they are low cardinality, you can convert them to categorical columns to save even more memory:
>>> diamonds2.cut.value_counts(
Ideal 333
Premium 290
Very Good 226
Good 89
Fair 62
Name: cut, dtype: int64
>>> diamonds2.color.value_counts()
E 240
F 226
G 139
D 129
H 125
I 95
J 46
Name: color, dtype: int64
>>> diamonds2.clarity.value_counts()
SI1 306
VS2 218
VS1 159
SI2 154
VVS2 62
VVS1 58
I1 29
IF 14
Name: clarity, dtype: int64
Because these are of low cardinality, we can convert them to categoricals and use around 37% of the original size:
>>> diamonds3 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "x": np.float32,
... "y": np.float32,
... "z": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... )
>>> diamonds3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float32
cut 1000 non-null category
color 1000 non-null category
clarity 1000 non-null category
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
x 1000 non-null float32
y 1000 non-null float32
z 1000 non-null float32
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB
usecols
parameter to specify the columns we want to load. Here, we will ignore columns x, y, and z:
>>> cols = [
... "carat",
... "cut",
... "color",
... "clarity",
... "depth",
... "table",
... "price",
... ]
>>> diamonds4 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... usecols=cols,
... )
>>> diamonds4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
carat 1000 non-null float32
cut 1000 non-null category
color 1000 non-null category
clarity 1000 non-null category
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB
chunksize
parameter:
>>> cols = [
... "carat",
... "cut",
... "color",
... "clarity",
... "depth",
... "table",
... "price",
... ]
>>> diamonds_iter = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... usecols=cols,
... chunksize=200,
... )
>>> def process(df):
... return f"processed {df.size} items"
>>> for chunk in diamonds_iter:
... process(chunk)
Because CSV files contain no information about type, pandas tries to infer the types of the columns. If all of the values of a column are whole numbers and none of them are missing, then it uses the int64
type. If the column is numeric but not whole numbers, or if there are missing values, it uses float64
. These data types may store more information that you need. For example, if your numbers are all below 200, you could use a smaller type, like np.int16
(or np.int8
if they are all positive).
As of pandas 0.24, there is a new type 'Int64'
(note the capitalization) that supports integer types with missing numbers. You will need to specify it with the dtype
parameter if you want to use this type, as pandas will convert integers that have missing numbers to float64
.
If the column turns out to be non-numeric, pandas will convert it to an object
column, and treat the values as strings. String values in pandas take up a bunch of memory as each value is stored as a Python string. If we convert these to categoricals, pandas will use much less memory as it only stores the string once, rather than creating new strings (even if they repeat) for every row.
The pandas library can also read CSV files found on the internet. You can point the read_csv
function to the URL directly.
If we use int8
for the price, we will lose information. You can use the NumPy iinfo
function to list limits for NumPy integer types:
>>> np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)
You can use the finfo
function for information about floating-point numbers:
>>> np.finfo(np.float16)
finfo(resolution=0.001, min=-6.55040e+04,
max=6.55040e+04, dtype=float16)
You can also ask a DataFrame or Series how many bytes it is using with the .memory_usage
method. Note that this also includes the memory requirements of the index. Also, you need to pass deep=True
to get the usage of Series with object types:
>>> diamonds.price.memory_usage()
8080
>>> diamonds.price.memory_usage(index=False)
8000
>>> diamonds.cut.memory_usage()
8080
>>> diamonds.cut.memory_usage(deep=True)
63413
Once you have your data in a format you like, you can save it in a binary format that tracks types, such as the Feather format (pandas leverages the pyarrow
library to do this). This format is meant to enable in-memory transfer of structured data between languages and optimized so that data can be used as is without internal conversion. Reading from this format is much quicker and easy once you have the types defined:
>>> diamonds4.to_feather("d.arr")
>>> diamonds5 = pd.read_feather("d.arr")
Another binary option is the Parquet format. Whereas Feather optimizes the binary data for the in-memory structure, Parquet optimizes for the on-disk format. Parquet is used by many big data products. The pandas library has support for Parquet as well.
>>> diamonds4.to_parquet("/tmp/d.pqt")
Right now there is some conversion required for pandas to load data from both Parquet and Feather. But both are quicker than CSV and persist types.
While CSV files are common, it seems that the world is ruled by Excel. I've been surprised in my consulting work to see how many companies are using Excel as a critical if not the critical tool for making decisions.
In this recipe, we will show how to create and read Excel files. You may need to install xlwt
or openpyxl
to write XLS or XLSX files, respectively.
.to_excel
method. You can write either xls
files or xlsx
files:
>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")
Excel file
read_excel
function:
>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
Unnamed: 0 first last birth
0 0 Paul McCartney 1942
1 1 John Lennon 1940
2 2 Richard Starkey 1940
3 3 George Harrison 1943
index_col
parameter:
>>> beat2 = pd.read_excel("/tmp/beat.xls", index_col=0)
>>> beat2
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> beat2.dtypes
first object
last object
birth int64
dtype: object
The Python ecosystem has many packages, which include the ability to read and write to Excel. This functionality has been integrated into pandas, you just need to make sure that you have the appropriate libraries for reading and writing to Excel.
We can use pandas to write to a sheet of a spreadsheet. You can pass a sheet_name
parameter to the .to_excel
method to tell it the name of the sheet to create:
>>> xl_writer = pd.ExcelWriter("beat2.xlsx")
>>> beatles.to_excel(xl_writer, sheet_name="All")
>>> beatles[beatles.birth < 1941].to_excel(
... xl_writer, sheet_name="1940"
... )
>>> xl_writer.save()
This file will have two sheets, one labeled All
that has the whole DataFrame, and another labeled 1940
that is filtered to births before 1941.
As was mentioned previously, CSV files are very common for sharing data. Because they are plain text files, they can get big. One solution for managing the size of CSV files is to compress them. In this recipe, we will look at loading files from ZIP files.
We will load a CSV file that is compressed as the only thing in the ZIP file. This is the behavior that you get if you were to right-click on a file in the Finder on Mac and click Compress beatles.csv. We will also look at reading a CSV file from a ZIP file with multiple files in it.
The first file is from the fueleconomy.gov website. It is a list of all car makes that have been available in the US market from 1984-2018.
The second file is a survey of users of the Kaggle website. It was intended to get information about the users, their background, and the tools that they prefer.
read_csv
function on it:
>>> autos = pd.read_csv("data/vehicles.csv.zip")
>>> autos
barrels08 barrelsA08 ... phevHwy phevComb
0 15.695714 0.0 ... 0 0
1 29.964545 0.0 ... 0 0
2 12.207778 0.0 ... 0 0
3 29.964545 0.0 ... 0 0
4 17.347895 0.0 ... 0 0
... ... ... ... ... ...
41139 14.982273 0.0 ... 0 0
41140 14.330870 0.0 ... 0 0
41141 15.695714 0.0 ... 0 0
41142 15.695714 0.0 ... 0 0
41143 18.311667 0.0 ... 0 0
>>> autos.modifiedOn.dtype
dtype('O')
parse_dates
parameter from read_csv
and convert them when loading the file. Alternatively, you can use the more powerful to_datetime
function after loading:
>>> autos.modifiedOn
0 Tue Jan 01 00:00:00 EST 2013
1 Tue Jan 01 00:00:00 EST 2013
2 Tue Jan 01 00:00:00 EST 2013
3 Tue Jan 01 00:00:00 EST 2013
4 Tue Jan 01 00:00:00 EST 2013
...
39096 Tue Jan 01 00:00:00 EST 2013
39097 Tue Jan 01 00:00:00 EST 2013
39098 Tue Jan 01 00:00:00 EST 2013
39099 Tue Jan 01 00:00:00 EST 2013
39100 Tue Jan 01 00:00:00 EST 2013
Name: modifiedOn, Length: 39101, dtype: object
>>> pd.to_datetime(autos.modifiedOn)
0 2013-01-01
1 2013-01-01
2 2013-01-01
3 2013-01-01
4 2013-01-01
...
39096 2013-01-01
39097 2013-01-01
39098 2013-01-01
39099 2013-01-01
39100 2013-01-01
Name: modifiedOn, Length: 39101, dtype: datetime64[ns]
Here's the code to convert during load time:
>>> autos= pd.read_csv(
... "data/vehicles.csv.zip", parse_dates=["modifiedOn"]
... )
>>> autos.modifiedOn
0 2013-01-0...
1 2013-01-0...
2 2013-01-0...
3 2013-01-0...
4 2013-01-0...
...
41139 2013-01-0...
41140 2013-01-0...
41141 2013-01-0...
41142 2013-01-0...
41143 2013-01-0...
Name: modifiedOn, Length: 41144, dtype: datetime64[ns, tzlocal()]
read_csv
function does not have the ability to specify a file inside a ZIP file. Instead, we will use the zipfile
module from the Python standard library.I like to print out the names of the files in the zip file; that makes it easy to see what filename to choose. Note that this file has a long question in the second row (this first row is a question identifier, which I'm keeping for the column names). I'm pulling out the second row as kag_questions
. The responses are stored in the survey
variable:
>>> import zipfile
>>> with zipfile.ZipFile(
... "data/kaggle-survey-2018.zip"
... ) as z:
... print("
".join(z.namelist()))
... kag = pd.read_csv(
... z.open("multipleChoiceResponses.csv")
... )
... kag_questions = kag.iloc[0]
... survey = kag.iloc[1:]
multipleChoiceResponses.csv
freeFormResponses.csv
SurveySchema.csv
>>> survey.head(2).T
1 2
Time from... 710 434
Q1 Female Male
Q1_OTHER_... -1 -1
Q2 45-49 30-34
Q3 United S... Indonesia
... ... ...
Q50_Part_5 NaN NaN
Q50_Part_6 NaN NaN
Q50_Part_7 NaN NaN
Q50_Part_8 NaN NaN
Q50_OTHER... -1 -1
ZIP files with only a single file can be read directly with the read_csv
function. If the ZIP file contains multiple files, you will need to resort to another mechanism to read the data. The standard library includes the zipfile
module that can pull a file out of a ZIP file.
Sadly, the zipfile
module will not work with URLs (unlike the read_csv
function). So, if your ZIP file is in a URL, you will need to download it first.
The read_csv
function will work with other compression types as well. If you have GZIP
, BZ2
, or XZ
files, pandas can handle those as long as they are just compressing a CSV file and not a directory.
We mentioned that pandas is useful for tabular or structured data. Many organizations use databases to store tabular data. In this recipe, we will work with databases to insert and read data.
Note that this example uses the SQLite database, which is included with Python. However, Python has the ability to connect with most SQL databases and pandas, in turn, can leverage that.
>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
... cur = con.cursor()
... cur.execute("""DROP TABLE Band""")
... cur.execute(
... """CREATE TABLE Band(id INTEGER PRIMARY KEY,
... fname TEXT, lname TEXT, birthyear INT)"""
... )
... cur.execute(
... """INSERT INTO Band VALUES(
... 0, 'Paul', 'McCartney', 1942)"""
... )
... cur.execute(
... """INSERT INTO Band VALUES(
... 1, 'John', 'Lennon', 1940)"""
... )
... _ = con.commit()
>>> import sqlalchemy as sa
>>> engine = sa.create_engine(
... "sqlite:///data/beat.db", echo=True
... )
>>> sa_connection = engine.connect()
>>> beat = pd.read_sql(
... "Band", sa_connection, index_col="id"
... )
>>> beat
fname lname birthyear
id
0 Paul McCartney 1942
1 John Lennon 1940
>>> sql = """SELECT fname, birthyear from Band"""
>>> fnames = pd.read_sql(sql, con)
>>> fnames
fname birthyear
0 Paul 1942
1 John 1940
The pandas library leverages the SQLAlchemy library, which can talk to most SQL databases. This lets you create DataFrames from tables, or you can run a SQL select query and create the DataFrame from the query.
JavaScript Object Notation (JSON) is a common format used for transferring data over the internet. Contrary to the name, it does not require JavaScript to read or create. The Python standard library ships with the json
library that will encode and decode from JSON:
>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}
read_json
function. If your JSON is of the form where it is a dictionary mapping to lists of columns, you can ingest it without much fanfare. This orientation is called columns in pandas:
>>> beatles = pd.read_json(encoded)
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
columns
– (default) A mapping of column names to a list of values in the columns.records
– A list of rows. Each row is a dictionary mapping a column to a value.split
– A mapping of columns
to column names, index
to index values, and data
to a list of each row of data (each row is a list as well).index
– A mapping of index value to a row. A row is a dictionary mapping a column to a value.values
– A list of each row of data (each row is a list as well). This does not include column or index values.table
– A mapping of schema
to the DataFrame schema, and data
to a list of dictionaries.Following are examples of these styles. The columns
style was the example shown previously:
>>> records = beatles.to_json(orient="records")
>>> records
'[{"first":"Paul","last":"McCartney","birth":1942},{"first":"John","last":"Lennon","birth":1940},{"first":"Richard","last":"Starkey","birth":1940},{"first":"George","last":"Harrison","birth":1943}]'
>>> pd.read_json(records, orient="records")
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
>>> split = beatles.to_json(orient="split")
>>> split
'{"columns":["first","last","birth"],"index":[0,1,2,3],"data":[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]}'
>>> pd.read_json(split, orient="split")
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> index = beatles.to_json(orient="index")
>>> index
'{"0":{"first":"Paul","last":"McCartney","birth":1942},"1":{"first":"John","last":"Lennon","birth":1940},"2":{"first":"Richard","last":"Starkey","birth":1940},"3":{"first":"George","last":"Harrison","birth":1943}}'
>>> pd.read_json(index, orient="index")
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
>>> values = beatles.to_json(orient="values")
>>> values
'[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]'
>>> pd.read_json(values, orient="values")
0 1 2
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> (
... pd.read_json(values, orient="values").rename(
... columns=dict(
... enumerate(["first", "last", "birth"])
... )
... )
... )
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> table = beatles.to_json(orient="table")
>>> table
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"first","type":"string"},{"name":"last","type":"string"},{"name":"birth","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"first":"Paul","last":"McCartney","birth":1942},{"index":1,"first":"John","last":"Lennon","birth":1940},{"index":2,"first":"Richard","last":"Starkey","birth":1940},{"index":3,"first":"George","last":"Harrison","birth":1943}]}'
>>> pd.read_json(table, orient="table")
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
JSON can be formatted in many ways. Preferably, the JSON you need to consume comes in a supported orientation. If it does not, I find it easier to use standard Python to create data in a dictionary that maps column names to values and pass this into the DataFrame
constructor.
If you need to generate JSON (say you are creating a web service), I would suggest the columns or records orientation.
If you are working on a web service and need to add additional data to the JSON, just use the .to_dict
method to generate dictionaries. You can add your new data to the dictionary, and then convert that dictionary to JSON:
>>> output = beat.to_dict()
>>> output
{'fname': {0: 'Paul', 1: 'John'}, 'lname': {0: 'McCartney', 1: 'Lennon'}, 'birthyear': {0: 1942, 1: 1940}}
>>> output["version"] = "0.4.1"
>>> json.dumps(output)
'{"fname": {"0": "Paul", "1": "John"}, "lname": {"0": "McCartney", "1": "Lennon"}, "birthyear": {"0": 1942, "1": 1940}, "version": "0.4.1"}'
You can use pandas to read HTML tables from websites. This makes it easy to ingest tables such as those found on Wikipedia or other websites.
In this recipe, we will scrape tables from the Wikipedia entry for The Beatles Discography. In particular, we want to scrape the table in the image that was in Wikipedia during 2019:
Wikipedia table for studio albums
read_html
function to load all of the tables from https://en.wikipedia.org/wiki/The_Beatles_discography:
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(url)
>>> len(dfs)
51
>>> dfs[0]
The Beatles discography The Beatles discography.1
0 The Beat... The Beat...
1 Studio a... 23
2 Live albums 5
3 Compilat... 53
4 Video al... 15
5 Music vi... 64
6 EPs 21
7 Singles 63
8 Mash-ups 2
9 Box sets 15
read_html
created, or we could give it a hint to find a specific table.The function has the match
parameter, which can be a string or a regular expression. It also has an attrs
parameter, that allows you to pass in an HTML tag attribute key and value (in a dictionary) and will use that to identify the table.
I used the Chrome browser to inspect the HTML to see if there is an attribute on the table
element or a unique string in the table to use.
Here is a portion of the HTML:
<table class="wikitable plainrowheaders" style="text-align:center;">
<caption>List of studio albums,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[A]</a></sup> with selected chart positions and certifications
</caption>
<tbody>
<tr>
<th scope="col" rowspan="2" style="width:20em;">Title
</th>
<th scope="col" rowspan="2" style="width:20em;">Release
...
There are no attributes on the table, but we can use the string, List of studio albums
, to match the table. I'm also going to stick in a value for na_values
that I copied from the Wikipedia page:
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
... url, match="List of studio albums", na_values="—"
... )
>>> len(dfs)
1
>>> dfs[0].columns
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
... url,
... match="List of studio albums",
... na_values="—",
... header=[0, 1],
... )
>>> len(dfs)
1
>>> dfs[0]
Title Release ... Peak chart positions Certifications
Title Release ... US[8][9] Certifications
0 Please P... Released... ... NaN BPI: Gol...
1 With the... Released... ... NaN BPI: Gol...
2 Introduc... Released... ... 2 RIAA: Pl...
3 Meet the... Released... ... 1 MC: Plat...
4 Twist an... Released... ... NaN MC: 3× P...
.. ... ... ... ... ...
22 The Beat... Released... ... 1 BPI: 2× ...
23 Yellow S... Released... ... 2 BPI: Gol...
24 Abbey Road Released... ... 1 BPI: 2× ...
25 Let It Be Released... ... 1 BPI: Gol...
26 "—" deno... "—" deno... ... "—" deno... "—" deno...
>>> dfs[0].columns
MultiIndex(levels=[['Certifications', 'Peak chart positions', 'Release', 'Title'], ['AUS[3]', 'CAN[4]', 'Certifications', 'FRA[5]', 'GER[6]', 'NOR[7]', 'Release', 'Title', 'UK[1][2]', 'US[8][9]']],
codes=[[3, 2, 1, 1, 1, 1, 1, 1, 1, 0], [7, 6, 8, 0, 1, 3, 4, 5, 9, 2]])
This is not something that is easy to fix programmatically. In this case, the easiest solution is to update the columns manually:
>>> df = dfs[0]
>>> df.columns = [
... "Title",
... "Release",
... "UK",
... "AUS",
... "CAN",
... "FRA",
... "GER",
... "NOR",
... "US",
... "Certifications",
... ]
>>> df
Title Release ... US Certifications
0 Please P... Released... ... NaN BPI: Gol...
1 With the... Released... ... NaN BPI: Gol...
2 Introduc... Released... ... 2 RIAA: Pl...
3 Meet the... Released... ... 1 MC: Plat...
4 Twist an... Released... ... NaN MC: 3× P...
.. ... ... ... ... ...
22 The Beat... Released... ... 1 BPI: 2× ...
23 Yellow S... Released... ... 2 BPI: Gol...
24 Abbey Road Released... ... 1 BPI: 2× ...
25 Let It Be Released... ... 1 BPI: Gol...
26 "—" deno... "—" deno... ... "—" deno... "—" deno...
Released
is another release of the previous row. pandas does not have the ability to parse rows that have a rowspan more than 1 (which the "release" rows have). In the Wikipedia page, these rows look like this:
<th scope="row" rowspan="2">
<i><a href="/wiki/A_Hard_Day%27s_Night_(album)" title="A Hard Day's Night (album)">A Hard Day's Night</a></i>
<img alt="double-dagger" src="//upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" decoding="async" width="9" height="14" data-file-width="9" data-file-height="14">
</th>
We will skip these rows. They confuse pandas, and the data pandas puts in these rows is not correct. We will split the release column into two columns, release_date
and label
:
>>> res = (
... df.pipe(
... lambda df_: df_[
... ~df_.Title.str.startswith("Released")
... ]
... )
... .assign(
... release_date=lambda df_: pd.to_datetime(
... df_.Release.str.extract(
... r"Released: (.*) Label"
... )[0].str.replace(r"[E]", "")
... ),
... label=lambda df_: df_.Release.str.extract(
... r"Label: (.*)"
... ),
... )
... .loc[
... :,
... [
... "Title",
... "UK",
... "AUS",
... "CAN",
... "FRA",
... "GER",
... "NOR",
... "US",
... "release_date",
... "label",
... ],
... ]
... )
>>> res
Title UK ... release_date label
0 Please P... 1 ... 1963-03-22 Parlopho...
1 With the... 1 ... 1963-11-22 Parlopho...
2 Introduc... NaN ... 1964-01-10 Vee-Jay ...
3 Meet the... NaN ... 1964-01-20 Capitol ...
4 Twist an... NaN ... 1964-02-03 Capitol ...
.. ... ... ... ... ...
21 Magical ... 31 ... 1967-11-27 Parlopho...
22 The Beat... 1 ... 1968-11-22 Apple
23 Yellow S... 3 ... 1969-01-13 Apple (U...
24 Abbey Road 1 ... 1969-09-26 Apple
25 Let It Be 1 ... 1970-05-08 Apple
The read_html
function looks through the HTML for table
tags and parses the contents into DataFrames. This can ease the scraping of websites. Unfortunately, as the example shows, sometimes data in HTML tables may be hard to parse. Rowspans and multiline headers may confuse pandas. You will want to make sure that you perform a sanity check on the result.
Sometimes, the table in HTML is simple such that pandas can ingest it with no problems. For the table we looked at, we needed to chain a few operations onto the output to clean it up.
You can also use the attrs
parameter to select a table from the page. Next, I select read data from GitHub's view of a CSV file. Note that I am not reading this from the raw CSV data but from GitHub's online file viewer. I have inspected the table and noticed that it has a class attribute with the value csv-data
. We will use that to limit the table selected:
>>> url = https://github.com/mattharrison/datasets/blob/master/data/anscombes.csv
>>> dfs = pd.read_html(url, attrs={"class": "csv-data"})
>>> len(dfs)
1
>>> dfs[0]
Unnamed: 0 quadrant x y
0 NaN I 10.0 8.04
1 NaN I 14.0 9.96
2 NaN I 6.0 7.24
3 NaN I 9.0 8.81
4 NaN I 4.0 4.26
.. ... ... ... ...
39 NaN IV 8.0 6.58
40 NaN IV 8.0 7.91
41 NaN IV 8.0 8.47
42 NaN IV 8.0 5.25
43 NaN IV 8.0 6.89
Note that GitHub hijacks a td
element to show the line number, hence the Unnamed: 0
column. It appears to be using JavaScript to dynamically add line numbers to the web page, so while the web page shows line numbers, the source code has empty cells, hence the NaN
values in that column. You would want to drop that column as it is useless.
One thing to be aware of is that websites can change. Do not count on your data being there (or being the same) next week. My recommendation is to save the data after retrieving it.
Sometimes you need to use a different tool. If the read_html
function is not able to get your data from a website, you may need to resort to screen scraping. Luckily, Python has tools for that too. Simple scraping can be done with the requests library. The Beautiful Soup library is another tool that makes going through the HTML content easier.