Description of pandas' modules and files

In this section, we provide brief descriptions of the various submodules and files that make up pandas' library.

pandas/core

This module contains the core submodules of pandas. They are discussed as follows:

  • api.py: This imports some key modules for later use.
  • array.py: This isolates pandas' exposure to numPy, that is, all direct numPy usage.
  • base.py: This defines fundamental classes, such as StringMixin, PandasObject which is the base class for various pandas objects such as Period, PandasSQLTable, sparse.array.SparseArray/SparseList, internals.Block, internals.BlockManager, generic.NDFrame, groupby.GroupBy, base.FrozenList, base.FrozenNDArray, io.sql.PandasSQL, io.sql.PandasSQLTable, tseries.period.Period, FrozenList, FrozenNDArray: IndexOpsMixin, and DatetimeIndexOpsMixin.
  • common.py: This defines common utility methods for handling data structures. For example isnull object detects missing values.
  • config.py: This is the module for handling package-wide configurable objects. It defines the following classes: OptionError, DictWrapper, CallableDynamicDoc, option_context, config_init.
  • datetools.py: This is a collection of functions that deal with dates in Python.
  • frame.py: This defines pandas' DataFrame class and its various methods. DataFrame inherits from NDFrame. (see below).
  • generic.py: This defines the generic NDFrame base class, which is a base class for pandas' DataFrame, Series, and Panel classes. NDFrame is derived from PandasObject, which is defined in base.py. An NDFrame can be regarded as an N-dimensional version of a pandas' DataFrame. For more information on this, go to http://nullege.com/codes/search/pandas.core.generic.NDFrame.
  • categorical.py: This defines Categorical, which is a class that derives from PandasObject and represents categorical variables a la R/S-plus. (we will expand your knowledge on this a bit more later).
  • format.py: This defines a whole host of Formatter classes such as CategoricalFormatter, SeriesFormatter, TableFormatter, DataFrameFormatter, HTMLFormatter, CSVFormatter, ExcelCell, ExcelFormatter, GenericArrayFormatter, FloatArrayFormatter, IntArrayFormatter, Datetime64Formatter, Timedelta64Formatter, and EngFormatter.
  • groupby.py: This defines various classes that enable the groupby functionality. They are discussed as follows:
    • Splitter classes: This includes DataSplitter, ArraySplitter, SeriesSplitter, FrameSplitter, and NDFrameSplitter
    • Grouper/Grouping classes: This includes Grouper, GroupBy, BaseGrouper, BinGrouper, Grouping, SeriesGroupBy, NDFrameGroupBy
  • ops.py: This defines an internal API for arithmetic operations on PandasObjects. It defines functions that add arithmetic methods to objects. It defines a _create_methods meta method, which is used to create other methods using arithmetic, comparison, and Boolean method constructors. The add_methods method takes a list of new methods, adds them to the existing list of methods, and binds them to their appropriate classes. The add_special_arithmetic_methods and add_flex_arithmetic_methods methods call _create_methods and add_methods to add arithmetic methods to a class.

    It also defines the _TimeOp class, which is a wrapper for datetime-related arithmetic operations. It contains Wrapper functions for arithmetic, comparison, and Boolean operations on Series, DataFrame and Panel functions—_arith_method_SERIES(..), _comp_method_SERIES(..), _bool_method_SERIES(..), _flex_method_SERIES(..), _arith_method_FRAME(..), _comp_method_FRAME(..), _flex_comp_method_FRAME(..), _arith_method_PANEL(..), _comp_method_PANEL(..).

  • index.py: This defines the Index class and its related functionality. Index is used by all pandas' objects—Series, DataFrame, and Panel—to store axis labels. Underneath it is an immutable array that provides an ordered set that can be sliced.
  • internals.py: This defines multiple object classes. These are listed as follows:
    • Block: This is a homogeneously typed N-dimensional numpy.ndarray object with additional functionality for pandas. For example, it uses __slots__ to restrict the attributes of the object to 'ndim', 'values', and '_mgr_locs'. It acts as the base class for other Block subclasses.
    • NumericBlock: This is the base class for Blocks with the numeric type.
    • FloatOrComplexBlock: This is base class for FloatBlock and ComplexBlock that inherits from NumericBlock
    • ComplexBlock: This is the class that handles the Block objects with the complex type.
    • FloatBlock: This is the class that handles the Block objects with the float type.
    • IntBlock: This is the class that handles the Block objects with the integer type.
    • TimeDeltaBlock, BoolBlock, and DatetimeBlock: These are the Block classes for timedelta, Boolean, and datetime.
    • ObjectBlock: This is the class that handles Block objects for user-defined objects.
    • SparseBlock: This is the class that handles sparse arrays of the same type.
    • BlockManager: This is the class that manages a set of Block objects. It is not a public API class.
    • SingleBlockManager: This is the class that manages a single Block.
    • JoinUnit: This is the utility class for Block objects.
  • matrix.py: This imports DataFrame as DataMatrix.
  • nanops.py: These are the classes and functionality for handling NaN values.
  • ops.py: This defines arithmetic operations for pandas' objects. It is not a public API.
  • panel.py, panel4d.py, and panelnd.py: These provide the functionality for the pandas' Panel object.
  • series.py: This defines the pandas Series class and its various methods that Series inherits from NDFrame and IndexOpsMixin.
  • sparse.py: This defines import for handling sparse data structures. Sparse data structures are compressed whereby data points matching NaN or missing values are omitted. For more information on this, go to http://pandas.pydata.org/pandas-docs/stable/sparse.html.
  • strings.py: These have various functions for handling strings.
    pandas/core

pandas/io

This module contains various modules for data I/O. These are discussed as follows:

  • api.py: This defines various imports for the data I/O API.
  • auth.py: This defines various methods dealing with authentication.
  • common.py: This defines the common functionality for I/O API.
  • data.py: This defines classes and methods for handling data. The DataReader method reads data from various online sources such as Yahoo and Google.
  • date_converters.py: This defines date conversion functions.
  • excel.py: This module parses and converts Excel data. This defines ExcelFile and ExcelWriter classes.
  • ga.py: This is the module for the Google Analytics functionality.
  • gbq.py : This is the module for Google's BigQuery.
  • html.py: This is the module for dealing with HTML I/O.
  • json.py: This is the module for dealing with json I/O in pandas. This defines the Writer, SeriesWriter, FrameWriter, Parser, SeriesParser, and FrameParser classes.
  • packer.py: This is a msgpack serializer support for reading and writing pandas data structures to disk.
  • parsers.py: This is the module that defines various functions and classes that are used in parsing and processing files to create pandas' DataFrames. All the three read_* functions discussed as follows have multiple configurable options for reading. See this reference for more details: http://bit.ly/1e4Xqo1.
    • read_csv(..): This defines the pandas.read_csv() function that is useful to read the contents of a CSV file into a DataFrame.
    • read_table(..): This reads a tab-separated table file into a DataFrame.
    • read_fwf(..): This reads a fixed-width format file into a DataFrame.
    • TextFileReader: This is the class that is used for reading text files.
    • ParserBase: This is the base class for parser objects.
    • CParserWrapper, PythonParser: These are the parser for C and Python respectively. They both inherit from ParserBase.
    • FixedWidthReader: This is the class for reading fixed-width data. A fixed-width data file contains fields in specific positions within the file.
    • FixedWithFieldParser: This is the class for parsing fixed-width fields that have been inherited from PythonParser.
  • pickle.py: This provides methods for pickling (serializing) pandas objects. These are discussed as follows:
    • to_pickle(..): This serializes object to a file.
    • read_pickle(..): This reads serialized object from file into pandas object. It should only be used with trusted sources.
  • pytables.py: This is an interface to PyTables module for reading and writing pandas data structures to files on disk.
  • sql.py: It is a collection of classes and functions used to enable the retrieval of data from relational databases that attempts to be database agnostic. These are discussed as follows:
    • PandasSQL: This is the base class for interfacing pandas with SQL. It provides dummy read_sql and to_sql methods that must be implemented by subclasses.
    • PandasSQLAlchemy: This is the subclass of PandasSQL that enables conversions between DataFrame and SQL databases using SQLAlchemy.
    • PandasSQLTable class: This maps pandas tables (DataFrame) to SQL tables.
    • pandasSQL_builder(..): This returns the correct PandasSQL subclass based on the provided parameters.
    • PandasSQLTableLegacy class: This is the legacy support version of PandasSQLTable.
    • PandasSQLLegacy class: This is the legacy support version of PandasSQLTable.
    • get_schema(..): This gets the SQL database table schema for a given frame.
    • read_sql_table(..): This reads SQL db table into a DataFrame.
    • read_sql_query(..): This reads SQL query into a DataFrame.
    • read_sql(..): This reads SQL query/table into a DataFrame.
  • to_sql(..): This write records that are stored in a DataFrame to a SQL database.
  • stata.py: This contains tools for processing Stata files into pandas DataFrames.
  • wb.py: This is the module for downloading data from World Bank's website.

pandas/tools

  • util.py: This has miscellaneous util functions defined such as match(..), cartesian_product(..), and compose(..).
  • tile.py: This has a set of functions that enable quantization of input data and hence tile functionality. Most of the functions are internal, except for cut(..) and qcut(..).
  • rplot.py: This is the module that provides the functionality to generate trellis plots in pandas.
  • plotting.py: This provides a set of plotting functions that take a Series or DataFrame as an argument.
    • scatter_matrix(..): This draws a matrix of scatter plots
    • andrews_curves(..): This plots multivariate data as curves that are created using samples as coefficients for a Fourier series
    • parallel_coordinates(..): This is a plotting technique that allows you to see clusters in data and visually estimate statistics
    • lag_plot(..): This is used to check whether a dataset or a time series is random
    • autocorrelation_plot(..): This is used for checking randomness in a time series
    • bootstrap_plot(..): This plot is used to determine the uncertainty of a statistical measure such as mean or median in a visual manner
    • radviz(..): This plot is used to visualize multivariate data

      Tip

      Reference for the preceding information is from: http://pandas.pydata.org/pandas-docs/stable/visualization.html

  • pivot.py: This function is for handling pivot tables in pandas. It is the main function pandas.tools.pivot_table(..) which creates a spreadsheet-like pivot table as a DataFrame

    Tip

    Reference for the preceding information is from: http://pandas.pydata.org/pandas-docs/stable/reshaping.html

  • merge.py: This provides functions for combining the Series, DataFrame, and Panel objects such as merge(..) and concat(..)
  • describe.py: This provides a single value_range(..) function that returns the maximum and minimum of a DataFrame as a Series.

pandas/sparse

This is the module that provides sparse implementations of Series, DataFrame, and Panel. By sparse, we mean arrays where values such as missing or NA are omitted rather than kept as 0.

For more information on this, go to http://pandas.pydata.org/pandas-docs/version/stable/sparse.html.

  • api.py: It is a set of convenience imports
  • array.py: It is an implementation of the SparseArray data structure
  • frame.py: It is an implementation of the SparseDataFrame data structure
  • list.py: It is an implementation of the SparseList data structure
  • panel.py: It is an implementation of the SparsePanel data structure
  • series.py: It is an implementation of the SparseSeries data structure

pandas/stats

  • api.py: This is a set of convenience imports.
  • common.py: This defines internal functions called by other functions in a module.
  • fama_macbeth.py: This contains class definitions and functions for the Fama-Macbeth regression. For more information on FM regression, go to http://en.wikipedia.org/wiki/Fama-MacBeth_regression.
  • interface.py: It defines ols(..) which returns an Ordinary Least Squares (OLS) regression object. It imports from pandas.stats.ols module.
  • math.py: This has useful functions defined as follows:
    • rank(..), solve(..), and inv(..): These are used for matrix rank, solution, and inverse respectively
    • is_psd(..): This checks positive-definiteness of matrix
    • newey_west(..): This is for covariance matrix computation
    • calc_F(..): This computes F-statistic
  • misc.py: This is used for miscellaneous functions.
  • moments.py: This provides rolling and expanding statistical measures including moments that are implemented in Cython. These methods include: rolling_count(..), rolling_cov(..), rolling_corr(..), rolling_corr_pairwise(..), rolling_quantile(..), rolling_apply(..), rolling_window(..), expanding_count(..), expanding_quantile(..), expanding_cov(..), expanding_corr(..), expanding_corr_pairwise(..), expanding_apply(..), ewma(..), ewmvar(..), ewmstd(..), ewmcov(..), and ewmcorr(..).
  • ols.py: This implements OLS and provides the OLS and MovingOLS classes. OLS runs a full sample Ordinary Least-Squares Regression, whereas MovingOLS generates a rolling or an expanding simple OLS.
  • plm.py: This provides linear regression objects for Panel data. These classes are discussed as follows:
    • PanelOLS: This is the OLS for Panel object
    • MovingPanelOLS: This is the rolling/expanded OLS for Panel object
    • NonPooledPanelOLS:- This is the nonpooled OLS for Panel object
  • var.py: This provides vector auto-regression classes discussed as follows:
    • VAR: This is the vector auto-regression on multi-variate data in Series and DataFrames
    • PanelVAR: This is the vector auto-regression on multi-variate data in Panel objects

      Tip

      For more information on vector autoregression, go to: http://en.wikipedia.org/wiki/Vector_autoregression

pandas/util

  • testing.py: This provides the assertion, debug, unit test, and other classes/functions for use in testing. It contains many special assert functions that make it easier to check whether Series, DataFrame, or Panel objects are equivalent. Some of these functions include assert_equal(..), assert_series_equal(..), assert_frame_equal(..), and assert_panelnd_equal(..). The pandas.util.testing module is especially useful to the contributors of the pandas code base. It defines a util.TestCase class. It also provides utilities for handling locales, console debugging, file cleanup, comparators, and so on for testing by potential code base contributors.
  • terminal.py: This function is mostly internal and has to do with obtaining certain specific details about the terminal. The single exposed function is get_terminal_size().
  • print_versions.py: This defines the get_sys_info() function that returns a dictionary of systems information, and the show_versions(..) function that displays the versions of available Python libraries.
  • misc.py: This defines a couple of miscellaneous utilities.
  • decorators.py: This defines some decorator functions and classes.

    Tip

    The Substitution and Appender classes are decorators that perform substitution and appending on function docstrings and for more information on Python decorators, go to http://bit.ly/1zj8U0o.

  • clipboard.py: This contains cross-platform clipboard methods to enable the copy and paste functions from the keyboard. The pandas I/O API include functions such as pandas.read_clipboard() and pandas.to_clipboard(..).

pandas/rpy

This module attempts to provide an interface to the R statistical package if it is installed in the machine. It is deprecated in Version 0.16.0 and later. It's functionality is replaced by the rpy2 module that can be accessed from http://rpy.sourceforge.net.

  • base.py: This defines a class for the well-known lm function in R
  • common.py: This provides many functions to enable the conversion of pandas objects into their equivalent R versions
  • mass.py: This is an unimplemented version of rlm—R's lm function
  • var.py: This contains an unimplemented class VAR

pandas/tests

This is the module that provides many tests for various objects in pandas. The names of the specific library files are fairly self-explanatory, and I will not go into further details here, except inviting the reader to explore this.

pandas/compat

The functionality related to compatibility are explained as follows:

  • chainmap.py, chainmap_impl.py: This provides a ChainMap class that can group multiple dicts or mappings, in order to produce a single view that can be updated
  • pickle_compat.py: This provides functionality for pickling pandas objects in the versions that are earlier than 0.12
  • openpyxl_compat.py: This checks the compatibility of openpyxl

pandas/computation

This is the module that provides functionality for computation and is discussed as follows:

  • api.py: This contains imports for eval and expr.
  • align.py: This implements functions for data alignment.
  • common.py: This contains a couple of internal functions.
  • engines.py: This defines Abstract Engine, NumExprEngine, and PythonEngine. PythonEngine evaluates an expression and is used mainly for testing purposes.
  • eval.py: This defines the all-important eval(..) function and also a few other important functions.
  • expressions.py: This provides fast expression evaluation through numexpr. The numexpr function is used to accelerate certain numerical operations. It uses multiple cores as well as smart chunking and caching speedups. It defines the evaluate(..) and where(..) methods.
  • ops.py: This defines the operator classes used by eval. These are Term, Constant, Op, BinOp, Div, and UnaryOp.
  • pytables.py: This provides a query interface for the PyTables query.
  • scope.py: This is a module for scope operations. It defines a Scope class, which is an object to hold scope.

Tip

For more information on numexpr, go to https://code.google.com/p/numexpr/. For information of the usage of this module, go to http://pandas.pydata.org/pandas-docs/stable/computation.html.

pandas/tseries

  • api.py: This is a set of convenience imports
  • converter.py: This defines a set of classes that are used to format and convert datetime-related objects. Upon import, pandas registers a set of unit converters with matplotlib.
    • This is done via the register() function explained as follows:
      In [1]: import matplotlib.units as munits
      In [2]: munits.registry
      Out[2]: {}
      
      In [3]: import pandas
      In [4]: munits.registry
      Out[4]: 
      {pandas.tslib.Timestamp: <pandas.tseries.converter.DatetimeConverter instance at 0x7fbbc4db17e8>,
       pandas.tseries.period.Period: <pandas.tseries.converter.PeriodConverter instance at 0x7fbbc4dc25f0>,
       datetime.date: <pandas.tseries.converter.DatetimeConverter instance at 0x7fbbc4dc2fc8>,
       datetime.datetime: <pandas.tseries.converter.DatetimeConverter instance at 0x7fbbc4dc2a70>,
       datetime.time: <pandas.tseries.converter.TimeConverter instance at 0x7fbbc4d61e18>}
      
    • Converter: This class includes TimeConverter, PeriodConverter, and DateTimeConverter
    • Formatters: This class includes TimeFormatter, PandasAutoDateFormatter, and TimeSeries_DateFormatter
    • Locators: This class includes PandasAutoDateLocator, MilliSecondLocator, and TimeSeries_DateLocator

    Note

    The Formatter and Locator classes are used for handling ticks in matplotlib plotting.

  • frequencies.py: This defines the code for specifying frequencies—daily, weekly, quarterly, monthly, annual, and so on—of time series objects.
  • holiday.py: This defines functions and classes for handling holidays— Holiday, AbstractHolidayCalendar, and USFederalHolidayCalendar are among the classes defined.
  • index.py: This defines the DateTimeIndex class.
  • interval.py: This defines the Interval, PeriodInterval, and IntervalIndex classes.
  • offsets.py: This defines various classes including Offsets that deal with time-related periods. These are explained as follows:
    • DateOffset: This is an interface for classes that provide the time period functionality such as Week, WeekOfMonth, LastWeekOfMonth, QuarterOffset, YearOffset, Easter, FY5253, and FY5253Quarter.
    • BusinessMixin: This is the mixin class for business objects to provide functionality with time-related classes. This will be inherited by the BusinessDay class. The BusinessDay subclass is derived from BusinessMixin and SingleConstructorOffset and provides an offset in business days.
    • MonthOffset: This is the interface for classes that provide the functionality for month time periods such as MonthEnd, MonthBegin, BusinessMonthEnd, and BusinessMonthBegin.
    • MonthEnd and MonthBegin: This is the date offset of one month at the end or the beginning of a month.
    • BusinessMonthEnd and BusinessMonthBegin: This is the date offset of one month at the end or the beginning of a business day calendar.
    • YearOffset: This offset is subclassed by classes that provide year period functionality—YearEnd, YearBegin, BYearEnd, BYearBegin
    • YearEnd and YearBegin: This is the date offset of one year at the end or the beginning of a year.
    • BYearEnd and BYearBegin: This is the date offset of one year at the end or the beginning of a business day calendar.
    • Week: This provides the offset of 1 week.
    • WeekDay: This provides mapping from weekday (Tue) to day of week (=2).
    • WeekOfMonth and LastWeekOfMonth: This describes dates in a week of a month
    • QuarterOffset: This is subclassed by classes that provide quarterly period functionality—QuarterEnd, QuarterrBegin, BQuarterEnd, and BQuarterBegin.
    • QuarterEnd, QuarterrBegin, BQuarterEnd, and BQuarterBegin: This is same as for Year* classes except that the period is quarter instead of year.
    • FY5253, FY5253Quarter: These classes describe a 52-53 week fiscal year. This is also known as a 4-4-5 calendar. You can get more information on this at http://en.wikipedia.org/wiki/4–4–5_calendar.
    • Easter: This is the DateOffset for the Easter holiday.
    • Tick: This is the base class for Time unit classes such as Day, Hour, Minute, Second, Milli, Micro, and Nano.
  • period.py: This defines the Period and PeriodIndex classes for pandas TimeSeries.
  • plotting.py: This defines various plotting functions such as tsplot(..), which plots a Series.
  • resample.py: This defines TimeGrouper, a custom groupby class for time-interval grouping.
  • timedeltas.py: This defines the to_timedelta(..) method, which converts its argument into a timedelta object.
  • tools.py: This defines utility functions such as to_datetime(..), parse_time_string(..), dateutil_parse(..), and format(..).
  • util.py: This defines more utility functions as follows:
    • isleapyear(..): This checks whether the year is a leap year
    • pivot_annual(..): This groups a series by years, accounting for leap years

pandas/sandbox

This module handles the integration of pandas DataFrame into the PyQt framework. For more information on PyQt, go to

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset