© A.J. Henley and Dave Wolf 2018
A.J. Henley and Dave WolfLearn Data Analysis with Pythonhttps://doi.org/10.1007/978-1-4842-3486-0_2

2. Getting Data Into and Out of Python

A. J. Henley1  and Dave Wolf2
(1)
Washington, D.C., District of Columbia, USA
(2)
Sterling Business Advantage, LLC, Adamstown, Maryland, USA
 

The first stage of data analysis is getting the data. Moving your data from where you have it stored into your analytical tools and back out again can be a difficult task if you don't know what you are doing. Python and its libraries try to make it as easy as possible.

With just a few lines of code, you will be able to import and export data in the following formats:
  • CSV

  • Excel

  • SQL

Loading Data from CSV Files

Normally, data will come to us as files or database links. See Listing 2-1 to learn how to load data from a CSV file.

import pandas as pd
Location = "datasets/smallgradesh.csv"
df = pd.read_csv(Location, header=None)
Listing 2-1

Loading Data from CSV File

Now, let's take a look at what our data looks like (Listing 2-2):

df.head()
Listing 2-2

Display First Five Lines of Data

As you can see, our dataframe lacks column headers. Or, rather, there are headers, but they weren't loaded as headers; they were loaded as row one of your data. To load data that includes headers, you can use the code shown in Listing 2-3.

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
Listing 2-3

Loading Data from CSV File with Headers

Then, as before, we take a look at what the data looks like by running the code shown in Listing 2-4.

df.head()
Listing 2-4

Display First Five Lines of Data

If you have a dataset that doesn't include headers, you can add them afterward. To add them, we can use one of the options shown in Listing 2-5.

import pandas as pd
Location = "datasets/smallgrades.csv"
# To add headers as we load the data...
df = pd.read_csv(Location, names=['Names','Grades'])
# To add headers to a dataframe
df.columns = ['Names','Grades']
Listing 2-5

Loading Data from CSV File and Adding Headers

Your Turn

Can you make a dataframe from a file you have uploaded and imported on your own? Let's find out. Go to the following website, which contains U.S. Census data ( http://census.ire.org/data/bulkdata.html ), and download the CSV datafile for a state. Now, try to import that data into Python.

Saving Data to CSV

Maybe you want to save your progress when analyzing data. Maybe you are just using Python to massage some data for later analysis in another tool. Or maybe you have some other reason to export your dataframe to a CSV file. The code shown in Listing 2-6 is an example of how to do this.

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
GradeList = zip(names,grades)
df = pd.DataFrame(data = GradeList, columns=['Names','Grades'])
df.to_csv('studentgrades.csv',index=False,header=False)
Listing 2-6

Exporting a Dataset to CSV

Lines 1 to 6 are the lines that create the dataframe. Line 7 is the code to export the dataframe df to a CSV file called studentgrades.csv.

The only parameters we use are index and header. Setting these parameters to false will prevent the index and header names from being exported. Change the values of these parameters to get a better understanding of their use.

If you want in-depth information about the to_csv method, you can, of course, use the code shown in Listing 2-7.

df.to_csv?
Listing 2-7

Getting Help on to_csv

Your Turn

Can you export the dataframe created by the code in Listing 2-8 to CSV?

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bsdegrees = [1,1,0,0,1]
msdegrees = [2,1,0,0,0]
phddegrees = [0,1,0,0,0]
Degrees = zip(names,grades,bsdegrees,msdegrees,phddegrees)
columns = ['Names','Grades','BS','MS','PhD']
df = pd.DataFrame(data = Degrees, columns=column)
df
Listing 2-8

Creating a Dataset for the Exercise

Loading Data from Excel Files

Normally, data will come to us as files or database links. Let's see how to load data from an Excel file (Listing 2-9).

import pandas as pd
Location = "datasets/gradedata.xlsx"
df = pd.read_excel(Location)
Listing 2-9

Loading Data from Excel File

Now, let's take a look at what our data looks like (Listing 2-10).

df.head()
Listing 2-10

Display First Five Lines of Data

If you wish to change or simplify your column names, you can run the code shown in Listing 2-11.

df.columns = ['first','last','sex','age','exer','hrs','grd','addr']
df.head()
Listing 2-11

Changing Column Names

Your Turn

Can you make a dataframe from a file you have uploaded and imported on your own? Let's find out. Go to https://www.census.gov/support/USACdataDownloads.html and download one of the Excel datafiles at the bottom of the page. Now, try to import that data into Python.

Saving Data to Excel Files

The code shown in Listing 2-12 is an example of how to do this.

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
GradeList = zip(names,grades)
df = pd.DataFrame(data = GradeList,
        columns=['Names','Grades'])
writer = pd.ExcelWriter('dataframe.xlsx', engine="xlsxwriter")
df.to_excel(writer, sheet_name="Sheet1")
writer.save()
Listing 2-12

Exporting a Dataframe to Excel

If you wish, you can save different dataframes to different sheets, and with one .save() you will create an Excel file with multiple worksheets (see Listing 2-13).

writer = pd.ExcelWriter('dataframe.xlsx',engine='xlsxwriter')
df.to_excel(writer, sheet_name="Sheet1")
df2.to_excel(writer, sheet_name="Sheet2")
writer.save()
Listing 2-13

Exporting Multiple Dataframes to Excel

Note

This assumes that you have another dataframe already loaded into the df2 variable.

Your Turn

Can you export the dataframe created by the code shown in Listing 2-14 to Excel?

import pandas as pd
names = ['Nike','Adidas','New Balance','Puma',’Reebok’]
grades = [176,59,47,38,99]
PriceList = zip(names,prices)
df = pd.DataFrame(data = PriceList, columns=['Names',’Prices’])
Listing 2-14

Creating a Dataset for the Exercise

Combining Data from Multiple Excel Files

In earlier lessons, we opened single files and put their data into individual dataframes. Sometimes we will need to combine the data from several Excel files into the same dataframe.

We can do this either the long way or the short way. First, let's see the long way (Listing 2-15).

import pandas as pd
import numpy as np
all_data = pd.DataFrame()
df = pd.read_excel("datasets/data1.xlsx")
all_data = all_data.append(df,ignore_index=True)
df = pd.read_excel("datasets/data2.xlsx")
all_data = all_data.append(df,ignore_index=True)
df = pd.read_excel("datasets/data3.xlsx")
all_data = all_data.append(df,ignore_index=True)
all_data.describe()
Listing 2-15

Long Way

  • Line 4: First, let's set all_data to an empty dataframe.

  • Line 6: Load the first Excel file into the dataframe df.

  • Line 7: Append the contents of df to the dataframe all_data.

  • Lines 9 & 10: Basically the same as lines 6 & 7, but for the next Excel file.

Why do we call this the long way? Because if we were loading a hundred files instead of three, it would take hundreds of lines of code to do it this way. In the words of my friends in the startup community, it doesn't scale well. The short way, however, does scale.

Now, let's see the short way (Listing 2-16).

import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob("datasets/data*.xlsx"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)
all_data.describe()
Listing 2-16

Short Way

  • Line 3: Import the glob library.

  • Line 5: Let's set all_data to an empty dataframe.

  • Line 6: This line will loop through all files that match the pattern.

  • Line 7: Load the Excel file in f into the dataframe df.

  • Line 8: Append the contents of df to the dataframe all_data.

Since we only have three datafiles, the difference in code isn't that noticeable. However, if we were loading a hundred files, the difference in the amount of code would be huge. This code will load all the Excel files whose names begin with data that are in the datasets directory no matter how many there are.

Your Turn

In the datasets/weekly_call_data folder, there are 104 files of weekly call data for two years. Your task is to try to load all of that data into one dataframe.

Loading Data from SQL

Normally, our data will come to us as files or database links. Let's learn how to load our data from a sqlite database file (Listing 2-17).

import pandas as pd
from sqlalchemy import create_engine
# Connect to sqlite db
db_file = r'datasets/gradedata.db'
engine = create_engine(r"sqlite:///{}"
        .format(db_file))
sql = 'SELECT * from test'
        'where Grades in (76,77,78)'
sales_data_df = pd.read_sql(sql, engine)
sales_data_df
Listing 2-17

Load Data from sqlite

This code creates a link to the database file called gradedata.db and runs a query against it. It then loads the data resulting from that query into the dataframe called sales_data_df. If you don't know the names of the tables in a sqlite database, you can find out by changing the SQL statement to that shown in Listing 2-18.

sql = "select name from sqlite_master"
    "where type = 'table';"
Listing 2-18

Finding the Table Names

Once you know the name of a table you wish to view (let's say it was test), if you want to know the names of the fields in that table, you can change your SQL statement to that shown in Listing 2-19.

sql = "select * from test;"
Listing 2-19

A Basic Query

Then, once you run sales_data_df.head() on the dataframe, you will be able to see the fields as headers at the top of each column.

As always, if you need more information about the command, you can run the code shown in Listing 2-20.

sales_data_df.read_sql?
Listing 2-20

Get Help on read_sql

Your Turn

Can you load data from the datasets/salesdata.db database?

Saving Data to SQL

See Listing 2-21 for an example of how to do this.

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
GradeList = zip(names,grades)
df = pd.DataFrame(data = GradeList,
    columns=['Names', 'Grades'])
df
Listing 2-21

Create Dataset to Save

To export it to SQL, we can use the code shown in Listing 2-22.

import os
import sqlite3 as lite
db_filename = r'mydb.db'
con = lite.connect(db_filename)
df.to_sql('mytable',
    con,
    flavor='sqlite',
    schema=None,
    if_exists='replace',
index=True,
index_label=None,
chunksize=None,
dtype=None)
con.close()
Listing 2-22

Export Dataframe to sqlite

  • Line 14: mydb.db is the path and name of the sqlite database you wish to use.

  • Line 18: mytable is the name of the table in the database.

As always, if you need more information about the command, you can run the code shown in Listing 2-23.

df.to_sql?
Listing 2-23

Get Help on to_sql

Your Turn

This might be a little tricky, but can you create a sqlite table that contains the data found in datasets/gradedata.csv?

Random Numbers and Creating Random Data

Normally, you will use the techniques in this guide with datasets of real data. However, sometimes you will need to create random values.

Say we wanted to make a random list of baby names. We could get started as shown in Listing 2-24.

import pandas as pd
from numpy import random
from numpy.random import randint
names = ['Bob','Jessica','Mary','John','Mel']
Listing 2-24

Getting Started

First, we import our libraries as usual. In the last line, we create a list of the names we will randomly select from.

Next, we add the code shown in Listing 2-25.

random.seed(500)
Listing 2-25

Seeding Random Generator

This seeds the random number generator. If you use the same seed, you will get the same "random” numbers.

What we will try to do is this:
  1. 1.

    randint(low=0,high=len(names))

    Generates a random integer between zero and the length of the list names.

     
  2. 2.

    names[n]

    Selects the name where its index is equal to n.

     
  3. 3.

    for i in range(n)

    Loops until i is equal to n, i.e., 1,2,3,….n.

     
  4. 4.

    random_names =

    Selects a random name from the name list and does this n times.

     

We will do all of this in the code shown in Listing 2-26.

randnames = []
for i in range(1000):
    name = names[randint(low=0,high=len(names))]
    randnames.append(name)
Listing 2-26

Selecting 1000 Random Names

Now we have a list of 1000 random names saved in our random_names variable. Let's create a list of 1000 random numbers from 0 to 1000 (Listing 2-27).

births = []
for i in range(1000):
    births.append(randint(low=0, high=1000))    
Listing 2-27

Selecting 1000 Random Numbers

And, finally, zip the two lists together and create the dataframe (Listing 2-28).

BabyDataSet2 = list(zip(randnames,births))
df = pd.DataFrame(data = BabyDataSet2,
        columns=['Names', 'Births'])
df
Listing 2-28

Creating Dataset from the Lists of Random Names and Numbers

Your Turn

Create a dataframe called parkingtickets with 250 rows containing a name and a number between 1 and 25.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset