The first stage of data analysis is getting the data. Moving your data from where you have it stored into your analytical tools and back out again can be a difficult task if you don't know what you are doing. Python and its libraries try to make it as easy as possible.
CSV
Excel
SQL
Loading Data from CSV Files
Normally, data will come to us as files or database links. See Listing 2-1 to learn how to load data from a CSV file.
Loading Data from CSV File
Now, let's take a look at what our data looks like (Listing 2-2):
Display First Five Lines of Data
As you can see, our dataframe lacks column headers. Or, rather, there are headers, but they weren't loaded as headers; they were loaded as row one of your data. To load data that includes headers, you can use the code shown in Listing 2-3.
Loading Data from CSV File with Headers
Then, as before, we take a look at what the data looks like by running the code shown in Listing 2-4.
Display First Five Lines of Data
If you have a dataset that doesn't include headers, you can add them afterward. To add them, we can use one of the options shown in Listing 2-5.
Loading Data from CSV File and Adding Headers
Your Turn
Can you make a dataframe from a file you have uploaded and imported on your own? Let's find out. Go to the following website, which contains U.S. Census data ( http://census.ire.org/data/bulkdata.html ), and download the CSV datafile for a state. Now, try to import that data into Python.
Saving Data to CSV
Maybe you want to save your progress when analyzing data. Maybe you are just using Python to massage some data for later analysis in another tool. Or maybe you have some other reason to export your dataframe to a CSV file. The code shown in Listing 2-6 is an example of how to do this.
Exporting a Dataset to CSV
Lines 1 to 6 are the lines that create the dataframe. Line 7 is the code to export the dataframe df to a CSV file called studentgrades.csv.
The only parameters we use are index and header. Setting these parameters to false will prevent the index and header names from being exported. Change the values of these parameters to get a better understanding of their use.
If you want in-depth information about the to_csv method, you can, of course, use the code shown in Listing 2-7.
Getting Help on to_csv
Your Turn
Can you export the dataframe created by the code in Listing 2-8 to CSV?
Creating a Dataset for the Exercise
Loading Data from Excel Files
Normally, data will come to us as files or database links. Let's see how to load data from an Excel file (Listing 2-9).
Loading Data from Excel File
Now, let's take a look at what our data looks like (Listing 2-10).
Display First Five Lines of Data
If you wish to change or simplify your column names, you can run the code shown in Listing 2-11.
Changing Column Names
Your Turn
Can you make a dataframe from a file you have uploaded and imported on your own? Let's find out. Go to https://www.census.gov/support/USACdataDownloads.html and download one of the Excel datafiles at the bottom of the page. Now, try to import that data into Python.
Saving Data to Excel Files
The code shown in Listing 2-12 is an example of how to do this.
Exporting a Dataframe to Excel
If you wish, you can save different dataframes to different sheets, and with one .save() you will create an Excel file with multiple worksheets (see Listing 2-13).
Exporting Multiple Dataframes to Excel
Note
This assumes that you have another dataframe already loaded into the df2 variable.
Your Turn
Can you export the dataframe created by the code shown in Listing 2-14 to Excel?
Creating a Dataset for the Exercise
Combining Data from Multiple Excel Files
In earlier lessons, we opened single files and put their data into individual dataframes. Sometimes we will need to combine the data from several Excel files into the same dataframe.
We can do this either the long way or the short way. First, let's see the long way (Listing 2-15).
Long Way
Line 4: First, let's set all_data to an empty dataframe.
Line 6: Load the first Excel file into the dataframe df.
Line 7: Append the contents of df to the dataframe all_data.
Lines 9 & 10: Basically the same as lines 6 & 7, but for the next Excel file.
Why do we call this the long way? Because if we were loading a hundred files instead of three, it would take hundreds of lines of code to do it this way. In the words of my friends in the startup community, it doesn't scale well. The short way, however, does scale.
Now, let's see the short way (Listing 2-16).
Short Way
Line 3: Import the glob library.
Line 5: Let's set all_data to an empty dataframe.
Line 6: This line will loop through all files that match the pattern.
Line 7: Load the Excel file in f into the dataframe df.
Line 8: Append the contents of df to the dataframe all_data.
Since we only have three datafiles, the difference in code isn't that noticeable. However, if we were loading a hundred files, the difference in the amount of code would be huge. This code will load all the Excel files whose names begin with data that are in the datasets directory no matter how many there are.
Your Turn
In the datasets/weekly_call_data folder, there are 104 files of weekly call data for two years. Your task is to try to load all of that data into one dataframe.
Loading Data from SQL
Normally, our data will come to us as files or database links. Let's learn how to load our data from a sqlite database file (Listing 2-17).
Load Data from sqlite
This code creates a link to the database file called gradedata.db and runs a query against it. It then loads the data resulting from that query into the dataframe called sales_data_df. If you don't know the names of the tables in a sqlite database, you can find out by changing the SQL statement to that shown in Listing 2-18.
Finding the Table Names
Once you know the name of a table you wish to view (let's say it was test), if you want to know the names of the fields in that table, you can change your SQL statement to that shown in Listing 2-19.
A Basic Query
Then, once you run sales_data_df.head() on the dataframe, you will be able to see the fields as headers at the top of each column.
As always, if you need more information about the command, you can run the code shown in Listing 2-20.
Get Help on read_sql
Your Turn
Can you load data from the datasets/salesdata.db database?
Saving Data to SQL
See Listing 2-21 for an example of how to do this.
Create Dataset to Save
To export it to SQL, we can use the code shown in Listing 2-22.
Export Dataframe to sqlite
Line 14: mydb.db is the path and name of the sqlite database you wish to use.
Line 18: mytable is the name of the table in the database.
As always, if you need more information about the command, you can run the code shown in Listing 2-23.
Get Help on to_sql
Your Turn
This might be a little tricky, but can you create a sqlite table that contains the data found in datasets/gradedata.csv?
Random Numbers and Creating Random Data
Normally, you will use the techniques in this guide with datasets of real data. However, sometimes you will need to create random values.
Say we wanted to make a random list of baby names. We could get started as shown in Listing 2-24.
Getting Started
First, we import our libraries as usual. In the last line, we create a list of the names we will randomly select from.
Next, we add the code shown in Listing 2-25.
Seeding Random Generator
This seeds the random number generator. If you use the same seed, you will get the same "random” numbers.
- 1.
randint(low=0,high=len(names))
Generates a random integer between zero and the length of the list names.
- 2.
names[n]
Selects the name where its index is equal to n.
- 3.
for i in range(n)
Loops until i is equal to n, i.e., 1,2,3,….n.
- 4.
random_names =
Selects a random name from the name list and does this n times.
We will do all of this in the code shown in Listing 2-26.
Selecting 1000 Random Names
Now we have a list of 1000 random names saved in our random_names variable. Let's create a list of 1000 random numbers from 0 to 1000 (Listing 2-27).
Selecting 1000 Random Numbers
And, finally, zip the two lists together and create the dataframe (Listing 2-28).
Creating Dataset from the Lists of Random Names and Numbers
Your Turn
Create a dataframe called parkingtickets with 250 rows containing a name and a number between 1 and 25.