The pandas forms a core component of the Python data analysis corpus. The distinguishing feature of pandas is the suite of data structures that it provides, which is naturally suited to data analysis, primarily the DataFrame and to a lesser extent Series (1-D vectors) and Panel (3D tables).
Simply put, pandas and statstools can be described as Python's answer to R, the data analysis and statistical programming language that provides both the data structures, such as R-data frames, and a rich statistical library for data analysis.
The benefits of pandas over using a language such as Java, C, or C++ for data analysis are manifold:
Country |
Year |
CO2 Emissions |
Power Consumption |
Fertility Rate |
Internet Usage Per 1000 People |
Life Expectancy |
Population |
---|---|---|---|---|---|---|---|
Belarus |
2000 |
5.91 |
2988.71 |
1.29 |
18.69 |
68.01 |
1.00E+07 |
Belarus |
2001 |
5.87 |
2996.81 |
43.15 |
9970260 | ||
Belarus |
2002 |
6.03 |
2982.77 |
1.25 |
89.8 |
68.21 |
9925000 |
Belarus |
2003 |
6.33 |
3039.1 |
1.25 |
162.76 |
9873968 | |
Belarus |
2004 |
3143.58 |
1.24 |
250.51 |
68.39 |
9824469 | |
Belarus |
2005 |
1.24 |
347.23 |
68.48 |
9775591 |
In a CSV file, this data that we wish to read would look like the following:
Country,Year,CO2Emissions,PowerConsumption,FertilityRate,InternetUsagePer1000, LifeExpectancy, PopulationBelarus,2000,5.91,2988.71,1.29,18.69,68.01,1.00E+07Belarus,2001,5.87,2996.81,,43.15,,9970260Belarus,2002,6.03,2982.77,1.25,89.8,68.21,9925000...Philippines,2000,1.03,514.02,,20.33,69.53,7.58E+07Philippines,2001,0.99,535.18,,25.89,,7.72E+07Philippines,2002,0.99,539.74,3.5,44.47,70.19,7.87E+07...Morocco,2000,1.2,489.04,2.62,7.03,68.81,2.85E+07Morocco,2001,1.32,508.1,2.5,13.87,,2.88E+07Morocco,2002,1.32,526.4,2.5,23.99,69.48,2.92E+07..
The data here is taken from World Bank Economic data available at: http://data.worldbank.org.
In Java, we would have to write the following code:
public class CSVReader { public static void main(String[] args) { String[] csvFile=args[1];CSVReader csvReader = new csvReader();List<Map>dataTable=csvReader.readCSV(csvFile);}public void readCSV(String[] csvFile){BufferedReader bReader=null;String line="";String delim=","; //Initialize List of maps, each representing a line of the csv fileList<Map> data=new ArrayList<Map>(); try {bufferedReader = new BufferedReader(new FileReader(csvFile));// Read the csv file, line by linewhile ((line = br.readLine()) != null){String[] row = line.split(delim);Map<String,String> csvRow=new HashMap<String,String>(); csvRow.put('Country')=row[0]; csvRow.put('Year')=row[1];csvRow.put('CO2Emissions')=row[2]; csvRow.put('PowerConsumption')=row[3];csvRow.put('FertilityRate')=row[4];csvRow.put('InternetUsage')=row[1];csvRow.put('LifeExpectancy')=row[6];csvRow.put('Population')=row[7];data.add(csvRow); } } catch (FileNotFoundException e) {e.printStackTrace(); } catch (IOException e) {e.printStackTrace(); } return data;}
But, using pandas, it would take just two lines of code:
import pandas as pdworldBankDF=pd.read_csv('worldbank.csv')
In addition, pandas is built upon the NumPy libraries and hence, inherits many of the performance benefits of this package, especially when it comes to numerical and scientific computing. One oft-touted drawback of using Python is that as a scripting language, its performance relative to languages like Java/C/C++ has been rather slow. However, this is not really the case for pandas.