This chapter focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This chapter assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/.
By the end of the chapter, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this chapter include the following:
R has five primitive or atomic types:
It also has the following, more complex, container types:
numpy.array
. It can only contain objects of the same type.numpy.matrix
.For this chapter, we will focus on list and DataFrame, which have pandas equivalents as series and DataFrame.
For more information on R data types, refer to the following document at: http://www.statmethods.net/input/datatypes.html.
For NumPy data types, refer to the following document at: http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html and http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html.
R lists can be created explicitly as a list declaration as shown here:
>h_lst<- list(23,'donkey',5.6,1+4i,TRUE) >h_lst [[1]] [1] 23 [[2]] [1] "donkey" [[3]] [1] 5.6 [[4]] [1] 1+4i [[5]] [1] TRUE >typeof(h_lst) [1] "list"
Here is its series equivalent in pandas with the creation of a list and the creation of a series from it:
In [8]: h_list=[23, 'donkey', 5.6,1+4j, True] In [9]: import pandas as pd h_ser=pd.Series(h_list) In [10]: h_ser Out[10]: 0 23 1 donkey 2 5.6 3 (1+4j) 4 True dtype: object
Array indexing starts from 0 in pandas as opposed to R, where it starts at 1. Following is an example of this:
In [11]: type(h_ser) Out[11]: pandas.core.series.Series
We can construct an R DataFrame as follows by calling the data.frame()
constructor and then display it as follows:
>stocks_table<- data.frame(Symbol=c('GOOG','AMZN','FB','AAPL', 'TWTR','NFLX','LINKD'), Price=c(518.7,307.82,74.9,109.7,37.1, 334.48,219.9), MarketCap=c(352.8,142.29,216.98,643.55,23.54,20.15,27.31)) >stocks_table Symbol PriceMarketCap 1 GOOG 518.70 352.80 2 AMZN 307.82 142.29 3 FB 74.90 216.98 4 AAPL 109.70 643.55 5 TWTR 37.10 23.54 6 NFLX 334.48 20.15 7 LINKD 219.90 27.31
Here, we construct a pandas DataFrame and display it:
In [29]: stocks_df=pd.DataFrame({'Symbol':['GOOG','AMZN','FB','AAPL', 'TWTR','NFLX','LNKD'], 'Price':[518.7,307.82,74.9,109.7,37.1, 334.48,219.9], 'MarketCap($B)' : [352.8,142.29,216.98,643.55, 23.54,20.15,27.31] }) stocks_df=stocks_df.reindex_axis(sorted(stocks_df.columns,reverse=True),axis=1) stocks_df Out[29]: Symbol PriceMarketCap($B) 0 GOOG 518.70 352.80 1 AMZN 307.82 142.29 2 FB 74.90 216.98 3 AAPL 109.70 643.55 4 TWTR 37.10 23.54 5 NFLX 334.48 20.15 6 LNKD219.90 27.31