In this section, we're going to download and take a bird's eye view of the dataset we'll be using throughout this book—the UK Road Safety Data. In total, this dataset provides more than 15 million rows across three CSV files.
Accidents7904.csv
Casualty7904.csv
Road-Accident-Safety-Data-Guide-1979-2004.xls
Vehicles7904.csv
The CSV files contain the data that we are going to use in the recipes throughout this book. The Excel file is pure magic, though. It contains a reference for all the data, including a list of the fields in each dataset as well as the coding used.
Coding data is a very important preprocessing step. Most analysis tools that you will use expect to see numbers rather than labels such as city or road type. The reason for this is that computers don't understand context like we humans do. Is Paris a city or a person? It depends. Computers can't make that judgment call. To get around this, we assign numbers to each text value. That's been done with this dataset.
It is said that up to 90 percent of the time spent on most data projects is for preparing the data for analysis. Anecdotal evidence from this author and those I speak with holds this to be true. While you will learn a number of techniques for cleaning and standardizing data, also known as preprocessing in the data world, the UK Road Safety Data dataset is an analysis-ready dataset. In addition, it provides a large amount of data—millions of rows—for us to work with.
This dataset contains detailed road safety data about the circumstances of personal injury road accidents in GB from 1979, the types (including Make and Model) of vehicles involved and the consequential casualties.