Chapter 1. The Rise of Pentaho Analytics along with Big Data

Pentaho, headquartered in Orlando, has a team of BI veterans with an excellent track record. In fact, Pentaho is the first commercial open source BI platform, which became popular quickly because of its seamless integration with many third-party software. It can comfortably talk to data sources: MongoDB, OLAP tools: Palo, or Big Data frameworks: Hadoop and Hive.

The Pentaho brand has been built up over the last 9 years to help unify and manage a suite of open source projects that provide alternatives to proprietary software BI vendors. Just to name, a few open source projects are Kettle, Mondrian, Weka, and JFreeReport. This unification helped to grow Pentaho's community and provided a centralized place. Pentaho claims that its community stands somewhere between 8,000 and 10,000 members strong, a fact that aids its ability to stay afloat offering just technical support, management services, and product enhancements for its growing list of enterprise BI users. In fact, this is how Pentaho mainly generates revenue for its growth.

For research and innovation, Pentaho has its "think tank", named Pentaho Labs, to innovate the breakthrough of Big Data-driven technologies in areas such as predictive and real-time analysis.

The core of business intelligence domain is always the underlined data. In fact, 70 years ago, they encountered the first attempt to quantify the growth rate of volume of data as "information explosion". This term first was used in 1941, according to Oxford English Dictionary. By 2010, this industrial revolution of data gained full momentum fueled by social media sites, and then scientists and computer engineers coined a new term for this phenomenon, "Big Data". Big Data is a collection of data sets, so large and complex that it becomes difficult to process with conventional database management tools. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. As of 2012, the limits on the size of data sets that are feasible to process in a reasonable amount of time was in the order of exabytes (1 billion gigabytes) of data.

Data sets grow in size partly because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies, digital cameras, software logs, microphones, RFID readers, and so on, apart from scientific research data such as micro-array analysis. One EMC-sponsored IDC study projected nearly 45-fold annual data growth by 2020!

So with the pressing need for software to store this variety of huge data, Hadoop was born. To analyze this huge data, the industry needed an easily manageable, commercially viable solution, which integrates with these Big Data software. Pentaho has come up with a perfect suite of software to address all the challenges posed by Big Data.

Pentaho BI Suite – components

Pentaho is a trailblazer when it comes to business intelligence and analysis, offering a full suite of capabilities for the ETL (Extract, Transform, and Load) processes, data discovery, predictive analysis, and powerful visualization. It has the flexibility of deploying on premise, in cloud, or can be embedded in custom applications.

Pentaho is a provider of a Big Data analytics solution that spans data integration, interactive data visualization, and predictive analytics. As depicted in the following diagram, this platform contains multiple components, which are divided into three layers: data, server, and presentation:

Pentaho BI Suite – components

Let us take a detailed look at each of the components in the previous diagram.

Data

This is one of the biggest advantages of Pentaho; that it integrates with multiple data sources seamlessly. In fact, Pentaho Data Integration 4.4 Community Edition (referred as CE hereafter) supports 44 open source and proprietary databases, flat files, spreadsheets, and more out of box third-party software. Pentaho introduced Adaptive Big Data Layer as part of the Pentaho Data Integration engine to support the evolution of the Big Data stores. This layer accelerates access and integration to the latest version and capabilities of the Big Data stores. It natively supports third-party Hadoop distributions from MapR, Cloudera, Hortonworks, as well as popular NoSQL databases such as Cassandra and MongoDB. These new Pentaho Big Data initiatives bring greater adaptability, abstraction from change, and increased competitive advantage to companies facing the never-ceasing evolution of the Big Data ecosystem. Pentaho also supports analytic databases such as Greenplum and Vertica.

Server applications

The Pentaho Administration Console (PAC) server in CE or Pentaho Enterprise Console (PEC) server in EE (Enterprise Edition) is a web interface used to create, view, schedule, and apply permissions to reports and dashboards. It also provides an easy way to manage security, scheduling, and configuration for the Business Application Server and Data Integration Server along with repository management. The server applications are as follows:

  • Business Analytics (BA) Server: This is a Java-based BI platform with a report management system and lightweight process-flow engine. This platform also provides an HTML5-based web interface for creating, scheduling, and sharing various artifacts of BI such as interactive reporting, data analysis, and a custom dashboard. In CE, we have a parallel application called Business Intelligence (BI) Server.
  • Data Integration (DI) Server: This is a commercially available enterprise class server for the ETL processes and Data Integration. It helps to execute ETL and Data Integration jobs smoothly. It also provides scheduling to automate jobs and supports content management with the help of revision history and security integration.

Thin Client Tools

The Thin Client Tools all run inside Pentaho User Console (PUC) in a web browser (such as Internet Explorer, Chrome, or Firefox). Let's have a look at each of the tools:

  • Pentaho Interactive Reporting: This is a "What You See is What You Get" (WYSIWYG) type of design interface used to build simple and ad hoc reports on the fly without having to rely on IT support. Any business user can design reports using the drag-and-drop feature by connecting to the desired data source and then do rich formatting or use the existing templates.
  • Pentaho Analyzer: This provides an advanced web-based, multiple browser- supported OLAP viewer with support for drag-and-drop. It is an intuitive analytical visualization application with the capability to filter and drill down further into business information data, which is stored in its own Pentaho Analysis (Mondrian) data source. You can also perform other activities such as sorting, creating derived measures, and chart visualization.
  • Pentaho Dashboard Designer (EE): This is a commercial plugin that allows users to create dashboards with great usability. Dashboards can contain a centralized view of key performance indicators (KPI) and other business data movement, dynamic filter controls with customizable layout and themes.

Design tools

Let's take a quick look at each of these tools:

  • Schema Workbench: This is a Graphical User Interface (GUI) for designing Rolap cubes for Pentaho Analysis (Mondrian). It also provides the capability of data exploration and analysis for end BI users without having to understand the MultiDimensional eXpressions (MDX) language.
  • Aggregation Designer: This is based on Pentaho Analysis (Mondrian) schema files in XML and the database with the underlying tables described by the schema XML to generate pre-calculated, pre-aggregated answers, which improve the performance of analysis work and MDX queries executed against Mondrian to a great extent.
  • Metadata Editor: This is a tool used to create logical business models and acts as an abstraction layer from the underlying physical data layer. The resulting metadata mappings are used by Pentaho's Interactive Reporting (the community-based Saiku Reporting), to create reports within the BA Server without any other external desktop application.
  • Report Designer: This is a banded report designing tool with a rich GUI, which can also contain sub-reports, charts, and graphs. It can query and use data from a range of data sources from text files to RDBMS to Big Data, which addresses the requirements of financial, operational, and production reporting. Even standalone reports can be executed from the user console or used within a dashboard. Pentaho Report Designer consists of a reporting engine at its core, which accepts a .ppt template to process reports. This file is in a ZIP format with XML resources to define the report design.
  • Data Integration: This is also known as "Kettle", and consists of a core integration (ETL) engine and GUI application that allows the user to design Data Integration jobs and transformations. It also supports distributed deployment on the cluster or cloud environment as well as on single node computers. It has an adaptive Big Data layer, which supports different Big Data stores by insulating Hadoop, so that you only need to focus on analysis without bothering much about modification of the Big Data stores.
  • Design Studio: This is an Eclipse-based application and plugin, facilitating to create business process flow with a special XML script to define action sequences called xactions and other forms of automation in the platform. Action sequences define a lightweight, result-oriented business flow within the Pentaho BA Server.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset