© Thomas W. Dinsmore 2016

Thomas W. Dinsmore, Disruptive Analytics, 10.1007/978-1-4842-1311-7_3

3. Open Source Analytics

The Disruptive Power of Free

Thomas W. Dinsmore

(1)Newton, Massachusetts, USA

In every software category, free and open source software is a growing presence. In this chapter, we address the following questions:

  • What is free and open source software?

  • How can free and open source software make business sense?

  • What are the leading free and open source software projects for analytics?

The most obvious attribute of free and open source software is something it lacks: software licensing fees. We will show, in this chapter, that free and open source software is viable, and that it is disrupting incumbents in the analytic software industry.

Open Source Fundamentals

The precise definition of open source software is a matter of some debate. We review the competing definitions first, then cover the fundamentals of open source software projects, including governance, licensing, code management and distribution, and the question of donated software.

Definitions

Words like “free” and “open” may seem unambiguous. In respect to software, however, they have multiple meanings.

Under a standard commercial software license model, the developer offers a license to use the software in return for a license fee. The license may be perpetual, or it may be limited to a specific term. There may be other restrictions as well: named users, specific computing devices, or specific applications, for example. The developer distributes the software in compiled form, often with a license key that prevents usage outside the limits of the software license. These measures protect the developer’s economic interest in the software intellectual property.

Free and open source software operates under a completely different model. A developer distributes the software source code itself together with a license granting rights to examine, modify, and redistribute the code. The developer asserts no economic claim to the software intellectual property and foregoes a license fee for use of the software.

Two organizations, the Open Source Initiative (OSI) and the Free Software Foundation (FSF) , define “free software” and “open source software” in slightly different ways:

  • The Free Software Foundation publishes the Free Software Definition, which defines software to be free if the user can run, study, redistribute, and distribute modified versions of the software.

  • The Open Source Initiative publishes the Open Source Definition, which details ten criteria for open source software, including access to source code, right to redistribute, and so forth.

The Free Software Definition defines a set of rights; if the software user is able to exercise those rights, the software is “free”. (“Free” in the sense of “liberated” and not simply “at no cost”.) The Open Source Definition defines a set of characteristics; if the software has those characteristics, it is “open source”. Neither the Free Software Definition nor the Open Source Definition explicitly states that developers may not charge license fees; the distribution of source code makes it impossible to do so.

Suppose that Mary releases source code for some software under an open source license. Joe takes Mary’s source code and redistributes it unmodified at a price of $100. Customers will soon figure out that they do not need to pay Joe for something they can get from Mary for free. If Joe conceals the source of the software by issuing it under a commercial license, Mary can sue Joe for infringing on her open source license.

The differences between the two organizations are primarily philosophical. The Free Software Foundation stems from the Free Software Movement, which argues that proprietary software and intellectual property rights in general “ought” not to exist, a perspective that is inherently political. Founded in 1985, the Free Software Foundation sponsors the GNU Project, a mass collaboration project.

In contrast to the Free Software Foundation, the Open Source Initiative takes the view that open source is a better and more practical way to develop software, and eschews political radicalism. Political differences aside, there are few practical differences between the two, and most open source licenses meet the criteria of both organizations.

While all open source software is “free” in the sense that anyone can acquire the source code without paying a license fee, not all “free” software is open source. Commercial software vendors can and do offer closed source, or proprietary software at no charge to the user. They may do this as a means to build awareness and trial of a commercial software product, to permit evaluations, or under a dual licensing model, which we discuss further later in this chapter.

In a similar vein, not all software labeled as “open” is “open source”. Commercial vendors sometimes label software as “open” because it has a published API, or because it implements an open standard like ANSI SQL. Pricing and documentation do not make a closed product open; software is free and open if and only if it is distributed under a free and open source license.

While respecting the differences between the Free Software Foundation and the Open Source Initiative , in this chapter and throughout the book we will use the term “open source” to mean software that complies with the definitions of both organizations.

Project Governance

Open source software projects , like any other project, require an organization structure with clear accountabilities. They also need a legal framework to take ownership of software assets and issue licenses to users. Larger projects, such as R and Python, have their own governance framework; we discuss these separately later in this chapter. In this section, we review two entities, the Apache Software Foundation and the Eclipse Foundation, which together account for many widely used open source projects.

The Apache Software Foundation (ASF) is a 501(c)(3) charitable organization funded by individual and corporate donations, whose mission is to provide software for the public good. To support this mission, ASF provides a legal framework for intellectual property, accepting donated and contributed software and distributing software under a free and open source license. ASF currently supports more than 350 open source projects. In 2013, the latest year for which its IRS filings are available, ASF reported donations of $1.1 million and operating expenses of $653,000.

Each ASF project operates under a Project Management Committee (PMC) , whose membership is elected from among the committers to the project. The PMC oversees the project, defines release strategy, sets community and technical direction, and manages individual releases. PMCs are responsible to ensure that the project follows core requirements set by ASF, such as rules governing legal aspects and branding.

Apache projects include contributors and committers. Contributors support the project in various ways, but only committers can create new revisions in the source repository.

Apache projects usually start as Incubator projects. During the Incubator phase, the project establishes its PMC , ensures compliance with Apache legal standards, and begins to build a community. When a project meets a defined set of project goals, it graduates to Apache top-level status. As of April 2016, there are 56 projects in Incubator status; of these, 20 have been in the program for more than a year. Over the life of the Incubator program, 159 projects graduated to top-level status through April 2016; 42 projects retired before graduating, mostly due to inactivity.

The Eclipse Foundation is a 501(c)(6) not-for-profit member supported corporation that supports and maintains Eclipse, an open source software development platform. The foundation also supports BIRT, a business intelligence platform discussed later in this chapter.

Open Source Licenses

Prior to implementation of the Berne Convention in 1988, software distributed without a copyright notice passed into the public domain. Under the Berne Convention, copyright attaches to software automatically when it is created. The Berne Convention also defined long time periods, or terms for copyrights; since 1988, there are no known examples of software that have reached the end of copyright and passed into the public domain. Software already in the public domain in 1988, such as the BASIC programming language, remains in the public domain.

Since copyright attaches automatically to software, open source licenses are needed to explicitly waive the copyright privilege for the user. The Open Source Initiative (OSI) and the Free Software Foundation (FSF) issue separate guidelines for open source and free software licenses, respectively.

According to Black Duck Software, a privately held company that tracks open source software projects, there are more than 2,400 unique software licenses currently in use. The most popular1 licenses are the MIT license, GNU General Public License (GPL) 2.0 and 3.0, the Apache License 2.0, and the BSD License 2.0. All of these licenses meet both the OSI and FSF criteria.

MIT License: A free and permissive software license developed by the Massachusetts Institute of Technology (MIT). The MIT license explicitly grants the end user rights to use, copy, modify, merge, publish, distribute, sublicense, and sell the software to which it is attached.

A permissive software license grants the licensee the right to redistribute derived software under a different license. In other words, a developer can modify software obtained under a permissive license and redistribute it under a commercial license.

GNU General Public License: A free and non-permissive, or “copyleft” software license originally developed for the Free Software Foundation. GPL grants the licensee rights to run, study, redistribute, and improve the software to which it is attached. Version 2.0, released in 1991, and Version 3.0, released in 2007, differ in respect to detailed aspects of intellectual property law.

Copyleft, or restrictive software, licenses mandate that any software derived from software distributed under the license must be distributed under the same license. In other words, if a developer modifies software distributed under the GPL license, the modified software must also be distributed under the GPL license.

Apache Software License: A free and permissive software license developed by the Apache Software Foundation (ASF). All ASF projects distribute software under this license, and so do many other non-ASF projects. The license grants the user rights to use, distribute, modify, and redistribute derived software. While the user can distribute modified software under a different license, unmodified parts of the code must remain under the Apache license.

BSD License: A family of free and permissive software licenses developed in 1989 for the Berkeley Software Distribution, an operating system. In several different versions (known as the 2-clause, 3-clause, and 4-clause licenses), advocates for the BSD license argue that it is more compatible with proprietary licenses. The original BSD license does not meet OSI standards, but the modified versions do.

Code Management and Distribution

Every open source project must decide how and where it should store source code, how to maintain versions and revisions, and how to distribute the code to prospective users.

Larger projects operate their own distribution platforms. The R Project, for example, operates the Comprehensive R Archive Network (CRAN), a network of more than 100 FTP and web servers around the world that store identical up-to-date versions of code and documentation for R. Similarly, the Python Software Foundation hosts a repository containing a reference implementation of Python. Commercial open source vendors generally operate their own distribution platforms.

Software projects in the early stages of development cannot afford to develop their own systems for code management. Two open source projects serve as key enablers to the open source community: Subversion and Git.

Apache Subversion is a software versioning and revision control system widely used by the open source software community and corporate users alike. Collabnet, a privately held software and application lifecycle code management company, developed the original version of Subversion in 2000 and donated it to the Apache Software Foundation (ASF) in 2009. ASF distributes the software under an Apache license. While Subversion itself lacks a user interface, numerous commercial and open source clients, or Integrated Development Environments (IDEs), work with Subversion.

Git is a free and open source distributed version control system developed beginning in 2005 by a team working on the Linux kernel; the software is available under a GPL license . Git itself operates from a command-line interface. GitHub, a privately held startup founded in 2008, offers a web-based version of Git, together with other features such as access control, bug tracking, feature requests, and task management. GitHub claims more than 12 million users and 35 million projects.

Donated Software

Few open source projects begin from scratch. In most cases, projects start with a software core developed either as an academic research project or as a commercial project. The copyright owners donate the source code to an entity committed to open source software, such as the Apache Software Foundation (ASF) .

Examples of such donations include:

  • Spark, a distributed in-memory project, donated to the ASF by the University of California in 2013.

  • Storm, a streaming analytics engine, donated2 to the ASF by Twitter in 2011.

  • Hawq, an SQL-on-Hadoop engine, donated3 to ASF by Pivotal Software in 2015.

  • Impala, an analytic database, donated4 to ASF by Cloudera in 2015.

  • SystemML, a machine learning language, donated5 to ASF by IBM in 2015.

The motivations behind such donations vary. In some cases, the original software developers deem the project unlikely to succeed as a commercial venture. Rather than investing further, the original developer donates the project to open source, takes a tax deduction for development costs, and harvests some goodwill.6

Donations may also be motivated by a change in strategic direction. Pivotal, for example, donated most of its software assets to open source when it changed its business model from software development to services delivery. In the two years prior to this change, the company earned substantially more from services revenue than it did from software licensing.

Corporate acquisitions can also trigger software donations when the acquiring company concludes that the software is not part of its core business. Twitter acquired the Storm software assets when it acquired Backtype in 2011; the company donated the software to open source soon thereafter.

In the case of academic software projects, universities rarely have either the desire or the infrastructure in place to manage software projects outside of the university’s core mission. Developers at the University of California—Berkeley’s AMPLab developed Spark as a research project; the University held the software copyright, but donated it to ASF.

A company’s decision to donate a software project may or may not convey information about the project’s potential value. Apache Hive, software for data warehousing on Hadoop , is a highly successful project. Facebook developed the original software and donated7 it to ASF in 2008; since then, many other contributors have enhanced it. Hortonworks has invested in the Stinger project to improve the software, which is widely used and included in every commercial Hadoop distribution.

Hadoop is an open source framework for distributed computing and storage. We discuss Hadoop and its ecosystem in Chapter Four.

On the other hand, the bones of dead donated projects litter the open source world. Apache SAMOA, for example, a stream processing framework donated8 to ASF by Yahoo in 2013, remains in Incubator status today, with just seven contributors over the lifetime of the project.

The Business of Open Source

Under a proprietary licensing model, the software developer invests time and money to develop software with the expectation that future revenue from software licensing fees will recoup the investment and return a profit. Once developed, the marginal cost to deliver a copy of the software to a new customer is minimal; hence successful software products deliver extraordinary returns on investment.

Copyright laws grant the developer exclusive rights to reproduce the software; hence, the developer has an economic monopoly in that product. The developer seeks to maximize the value of these rights by positioning the product to deliver unique benefits to the customer, for which there are no substitutes. This differentiation may be through software features, documentation, technical support, training, or through branding and marketing.

The customer’s initial selection of the software takes place in a competitive market. While the developer tries to position the software uniquely, customers usually have several options from which they can choose and bargain for the best possible price.

Enterprise software , however, is complex, challenging to implement, and may require significant customization; thus, an organization that chooses to standardize on a proprietary software product risks vendor lock-in, a situation where the customer lacks bargaining power due to the high costs of switching.

Perpetual licenses, where the developer grants the customer a permanent right to use the software, is one response to the lock-in problem. Under a perpetual license, the customer pays the developer a one-time fee during the initial software selection, when the customer has the most bargaining power; in return, the developer grants a license to use the software “forever”.

There are two issues with perpetual licenses . First, the high initial cost leads to very long evaluation and purchase cycles, and correspondingly low rates of adoption. The second issue relates to ongoing software development and maintenance. There are few cases where a software product, once developed, remains state of the art for very long. Customers expect software to improve continuously; this requires an ongoing partnership with the software developer. Rather than aligning the interests of developers and users, perpetual licenses create an incentive for software vendors to ignore customer needs once the product is sold.

Aside from the slower rate of adoption, proprietary software licensing creates three other issues that inhibit innovation.

First, there is the need for license keys and other mechanisms necessary to protect the developer’s economic interest in the software by preventing unauthorized (e.g., unlicensed) use. These are not only annoying to the user, they can make it difficult to integrate software into an enterprise.

A second issue is the need to distribute the software in compiled form, so the user cannot inspect the source code. This is a critical limitation in analytics, where users rely on developers to implement algorithms accurately, and where small coding errors can produce spurious results.

Third, the proprietary licensing relationship places the developer and the user in an arms-length and even adversarial relationship. If the user is locked in to the software and switching is expensive, the developer has little or no incentive to add a requested feature. Prospective customers, who are not locked in yet, have much more leverage than existing customers when the developer sets priorities for enhancements.

The most obvious attribute of open source software is the absence of software licensing fees. While this is a clear benefit to the user, it raises an obvious question: without the possibility of earning revenue from license fees, why would anyone create software in the first place?

The motivations are mixed. In some cases, altruism and politics are clear motivators. Software developers may be inspired by a sense of purpose or a desire to create a better world. At least some contributions to open source projects are so motivated.

Alternatively, the developer may believe that if the software project succeeds and secures a high level of adoption, there will be ancillary opportunities to generate revenue through customization, services, training, and so forth. We discuss these approaches later in this chapter.

Without license fees , there are no barriers to trial, adoption, and use. Prospective users for open source software need not endure an extended and adversarial negotiation where they often lack the information they need to bargain in their own best interest. Instead, prospective users simply inspect the software and conduct a trial. If they are concerned about ensuring that the internal math is correct, they simply inspect the code.

With a higher rate of adoption and use than commercially licensed software, open source developers benefit from rapid and copious user feedback. Moreover, since users can donate enhancements back to the project, open source software has a potentially much larger pool of contributors. Due to the combination of these two factors, open source software projects tend to develop much more rapidly than proprietary software projects.

Open source languages are the best choice for custom development, for several reasons. First, the open and freely available source code enables enterprises to readily integrate analytic applications with other production applications. Second, the business model for open source software ensures that the enterprise fully realizes the value created by its investment in custom development.

Most commercial analytic software packages include proprietary programming languages; the SAS Programming Language (SPL) is an example. While courts have declared that the SAS Programming Language is in the public domain, code written in SPL requires a commercial runtime compiler to execute, which the user must license from SAS or a third party. This makes it a poor choice for developers and lacks the disruptive potential of open source languages.

Community Open Source

Under a community development model , a project has a broad base of contributors, who work independently with minimal guidance from a central authority. Many users are also contributors. The project itself has standards that govern code submission, as well as test protocols that limit the ability of bad actors to submit malicious code.

Community projects tend to choose one of two organizing models9:

  • Under a top-down model, the project distributes source code with each release, but only a core group of developers can access and modify code between releases.

  • Under a bottom-up model, the source code is accessible to all developers at all times.

Under a compromise model, the project has two tiers: a core platform and packages that run on the core platform. The project’s governing body exercises tight control over submissions to the core platform, but minimal direct control over package submissions. A core team assumes responsibility for enhancements to the core platform, while package developers take responsibility for the packages they publish.

Commercial ventures operate on the periphery of a community open source project, offering support, consulting, education, and training. These ventures may or may not operate under the sanction of the project’s governing body. They tend to have relatively little influence on the overall direction of the project.

Experienced software engineer and open source theorist Eric S. Raymond advocates for the bottoms-up model in The Cathedral and the Bazaar, arguing that more eyes on the software speed development, improve quality, and expedite problem resolution. On the other hand, absence of a strong central authority to establish design standards, for example, leads to complicated and inconsistent approaches to solving similar problems, making the end product difficult to navigate and use.

The best examples of community open source in analytics are the R Project and Python, which we discuss in depth later in this chapter. While Apache Hadoop is a community open source project, most organizations use commercially supported products based on Apache Hadoop, which are quite different from the open source core.

Commercial and Hybrid Open Source

Under purely commercial open source models, a commercial venture seeks to define a sustainable business model while operating within the constraints of an open source software licensing model. The venture controls project governance and most contributors are employees of the same venture. There are two distinct types of commercial open source models: the open core model and the services model.

Firms that operate under an open core model offer multiple software editions with at least one edition available under an open source license and at least one edition available under a commercial license. Typically, the commercial version has additional features that are not present in the open source edition. This model enables prospective customers to evaluate the basic software at no charge and without restriction, and provides a revenue stream from customers who choose the more feature-rich edition.

Examples of this model include:

  • Talend offers Talend Open Studio as open source software and several other commercially licensed software products.

  • Oracle offers the open source Oracle R Distribution, and the commercially licensed Oracle R Enterprise, which includes additional features.

Commercial ventures operating under a services model distribute software exclusively under open source licenses and sell services to users. Services may consist of cloud services, technical support subscriptions, training, and professional consulting services for implementation or custom development.

Examples of this model include :

  • H2O.ai distributes H2O, an open source machine learning software package, and it sells subscription services into the user base10.

  • Google distributes TensorFlow, an open source Deep Learning software package, and it sells a managed service for the software on its Cloud Platform.

Under a hybrid business model, the commercial venture does not control project governance, but exerts strong influence over it through roles on the project’s governing body. Employees of the commercial venture make important contributions to the open source project, but so do others.

The best examples of this are the Apache projects, where the Apache Software Foundation’s governance model prohibits exclusive control by a commercial venture:

  • Databricks leads development of the Apache Spark and offers cloud services, training, certification, and conferences.

  • Hortonworks exercises strong influence over the direction of the Apache Hadoop project, and it offers its own open source Hadoop distribution together with services and training.

Overall, commercial ventures play a key role in open source software by promoting interest in the project and providing enterprises with the services necessary to drive value. However, there are numerous examples of widely used open source projects without a commercial ecosystem.

Open Source Analytics

Open source business models are pervasive among emerging analytics technologies. Consequently, we cover specific open source projects in other chapters:

  • Chapter Four:   Apache Hadoop and its ecosystem.

  • Chapter Five: Apache Spark and other in-memory platforms.

  • Chapter Six: Streaming analytics, including Apache Flink, Apache Storm, and other packages.

  • Chapter Eight: Machine learning and Deep Learning, including CNTK, DL4J, H2O, TensorFlow, Theano, and other packages.

Among relational databases , open source MySQL and PostgreSQL ranked11 second and fifth, respectively, in the DB-Engines Ranking in April 2016. Open source NoSQL databases MongoDB, Cassandra, and Redis ranked fourth, eighth, and ninth. Overall, open source databases account for five of the top ten most popular databases.

DB-Engines measures12 database popularity by trac]king the number of mentions on Google and Bing, search interest in Google Trends, frequency of technical discussions on Stack Overflow and DBA Stack Exchange, job offers on Indeed and Simply Hired, profile mentions in LinkedIn, and Twitter mentions.

In January 2013, open source databases accounted for 36% of the total popularity measured by DB-Engines ; in April 2016, they captured 45% of total measured database popularity. In certain database categories, including wide column stores, graph databases, document stores, time series databases, key-value datastores, and search engines, open source dominates.

Three commercially driven open source projects offer integrated platforms for business intelligence: Jaspersoft, Pentaho, and Talend. All three operate under an open core model and offer commercially licensed editions with additional features.

  • Jaspersoft Community includes tools for ETL and reporting (including BI for mobile devices and OLAP).

  • Pentaho Community offers tools for business analytics, data integration, reporting, aggregation, schema definition, and metadata management.

  • Talend Open Studio includes capabilities for scalable ETL, data quality, and master data management.

The Business Intelligence and Reporting Tools project, or BIRT, a community open source project, is a top-level project of the Eclipse Foundation. BIRT’s functionality includes a report designer and report execution engine. Actuate, a subsidiary of OpenText, provides technical support and consulting, but BIRT is independently governed.

There are two commercially-driven open source projects for advanced analytics: RapidMiner and KNIME. Both projects started in Europe as academic projects and both offer visual interfaces for the business user. RapidMiner commercially licenses its most current software version and distributes prior releases under a free and open source license. KNIME distributes its base platform under open source license and offers extensions under a commercial license. We discuss these projects separately in Chapter Nine on self-service analytics.

In this chapter, we cover three analytic programming languages : R, Python, and Scala. R is a tool developed by statisticians and analysts expressly for analysis; Python is a general-purpose programming language with rich analytics capability. Scala is an elegant programming tool most notable for its strong Spark APIs.

While R’s analytic functionality exceeds what is currently available in Python, Python is catching up quickly. At present, more people use R than Python for analytics, but that is also changing rapidly.

Licensing is a key differentiator between R and Python. R’s GPU license is a “poison pill” for commercial developers, as products derived from R can only be redistributed as open source software. This is not an issue for Python, as the Python license is permissive. Python’s governance model is also more open and broad-based, in contrast to R’s closed governance.

For analysts whose primary goal is insight, R’s breadth of analytic tools and visualization capabilities make it the preferred choice. For analytic developers, on the other hand, whose goal is to build applications with embedded analytics, Python’s general-purpose functionality makes it the best choice.

While there is growing interest in Scala among developers and data scientists with a software engineering background, its native analytics capabilities are still limited. We include Scala in this chapter primarily because it is one of the principal interfaces to the Spark platform for distributed analytics.

The R Project

The R Project (“R”) is a popular free open source language for statistical analysis, graphics, and advanced analytics. It runs on a variety of platforms and supports a wealth of functionality. While R has limits, it is so widely used by researchers and statisticians that many call it the lingua franca of advanced analytics.

Ross Ihaka and Robert Gentleman wrote the original code base for R in 1993, using the syntax of the S programming language. In 1995, they published the source code under the Free Software Foundation’s GNU license13.

User interest grew steadily as contributors ported existing packages to R and developed new features from scratch. Ihaka and Gentleman established the R Core Development Team in 1997 to lead ongoing enhancement to the core software environment. The R Core Development Team develops the R core software, while individual developers contribute packages with specific features. The code base is diverse; as of Release 2.13.1, 22% of the code is in R itself, 52% is in C, and 26% is in Fortran.14

In 2002, Ihaka and Gentleman donated the software to the R Foundation for Statistical Computing, a not-for-profit public interest organization located in Vienna, Austria . The foundation holds and administers the copyright and serves as an official voice for the project. Governance of the foundation rests in a self-selecting body of Ordinary Members, who are selected for their (non-monetary) contributions to the project; as of this writing, there are 29 Ordinary Members from the United States, Canada, United Kingdom, Norway, Denmark, Germany, France, Switzerland, Italy, Austria, India, New Zealand, and Australia.15

The R Foundation distributes R under the GNU license . This license makes it difficult for commercial software developers to add value to R, as any modifications or enhancements become part of the free distribution. Developers who distribute commercial applications built with R must distribute the enhanced source code. This requirement does not apply to enterprises who build applications with R for internal use only.

R supports analytic projects from beginning to end, including:

  • Data import and export, including interfaces with most commercial and open source databases

  • Custom programming, including conditionals, loops, and recursive operations

  • Data management and storage

  • Array and matrix manipulation

  • Exploratory analysis, discovery, and statistics

  • Graphics and visualization

  • Statistical modeling and machine learning

  • Content analytics

The R distribution includes 14 “base” packages that support basic statistic, graphics, and valuable utilities. Users may selectively add packages from CRAN or other archives. Due to the broad developer community and low barriers to contribution, the breadth of functionality available in R far exceeds that of commercial analytic software.

As of April 2016, there are 11,531 R packages in all major repositories worldwide, of which 8,239 are in the Comprehensive R Archive Network (CRAN) , the most widely used R archive16. While these statistics demonstrate the astonishing breadth of capability included in R, they are misleading measures of usefulness. Since package developers work independently of one another, there is a great deal of overlap in functionality; a search on one repository for packages that support “linear regression” returns more than 50 packages. Package quality, documentation, and developer support is also uneven; hence, ordinary users tend to rely on a limited number of packages.

Open source R operates exclusively in memory. Lacking a capability for out-of-memory operations, R will fail if the user attempts to work with a data set that is larger than memory.

The plyr package provides the user with a framework to split large data sets, apply a function for each subset of data, and combine the results. The dplyr package extends this framework and provides a set of efficient data-handling tools together with interfaces to popular open source databases (such as PostgreSQL, MySQL, and Google BigQuery).

There are several other open source packages available in R for working with Big Data . These include:

  • Programming with Big Data in R (pbdR), a suite of packages designed to support a variety of methods for high-performance computing

  • Simple Network of Workstations (snow) supports parallel computing on a network of workstations using R

  • Rdsmimplements a threads programming environment for R, either across clustered machines or on a single multicore machine

As a rule, these packages work best for embarrassingly parallel tasks. However, many tasks in predictive analytics are not embarrassingly parallel; for these tasks, a distributed platform is the better tool. We cover these later in the chapter.

Most Big Data platforms support an R interface so that an R user can pass commands to the data platform. These interfaces do not “make R run in the database”; they convert R commands to platform-specific commands and then return the result to the user as an R object. The quality, utility, and level of support for these interfaces varies considerably across platforms; most vendors do not support or warrant the R code itself.

Microsoft offers an enhanced open source R distribution and a commercially licensed version with additional features, especially the ability to analyze data that exceeds the size of the computer’s main memory. Microsoft provides technical support, training, and consulting services for organizations implementing R.

Oracle offers a free enhanced R distribution (Oracle R Distribution). It also bundles an enhanced version (Oracle R Enterprise) together with Oracle Data Mining in the Oracle Advanced Analytics Option for the Oracle Big Data Appliance. Oracle offers technical support for Oracle R Enterprise to customers who license the Advanced Analytics Option or the Big Data Appliance.

Tibco offers Tibco Enterprise Runtime for R (TERR) , which is a commercial R implementation written from the ground up by professional programmers. As a result, it’s generally faster and better at memory management. In particular, its ability to optimize loops written in the style of other languages is faster than open source R.

The core R distribution includes a bare-bones interface for interactive use and script development. Most users prefer to use an integrated development environment, or IDE; the most popular of these is RStudio. RStudio offers open source and commercially licensed versions of its software; the commercial version includes additional features for enterprise deployment.

It is difficult to say exactly how many people use R. In 2009, The New York Times reported17 a user base of 250,000; others estimate the user base in the millions.18

Analyst surveys generally show R to be among the most popular tools analyst available today, but the sampling methods for these surveys make it difficult to generalize them to the population at large.

Data mining web site KDnuggets.comregularly polls its readers; each year, it asks readers to identify analytic software tools used in the past 12 months “for actual projects”. In the 2016 poll, 49% of respondents said they used R more than any other tool.

Rexer Analytics conducts an annual survey of working data miners. Among analysts surveyed in the 2013 survey, 70% said they use R, up from 48% in 2011.

O’Reilly Media’s tracking survey of data scientists gathers information about salaries and tool usage among working data scientists. For the 12 months prior to September 2015, 52% of respondents reported using R, which ranked third behind SQL and Microsoft Excel.

Table 3-1 summarizes results from the three surveys.

Table 3-1. Analytic Programming Tool Usage : R
   

Response for R

Survey

Date

Question

Percent Use

Rank

KDnuggets 19

June 2016

What analytics, Big Data, data mining, or data science software, have you used (in the) past 12 months for a real project?

49%

1

Rexer 20

Q1 2013

What (analytic) tools did you use in the past year? (Total)

70%

1

O’ReillyMedia 21

September 2015

Which programming languages do you use (for data science)?

52%

3

The TIOBE Programming Community Index is an indicator of the relative popularity of programming languages, and it combines those aimed at analytics with general-purpose languages. It is based on mind share, measured by search results and other indicators. As of May 2015, R ranks 12 out of 100 ranked languages22, up from 33rd in May 2014. However, R is the #1 application-specific language associated with analytics (might include SAS’ rank here too). (Below the top ten, rankings in this index are volatile.)

R’s key strengths are its functionality, extensibility, and low cost of ownership. The free distribution eliminates barriers to entry and enables users to get started quickly.

R’s key weakness is its bazaar-like nature, which appears to the novice as a plethora of conflicting and redundant functionality, loose standards, and mixed quality. While experienced users tend to revel in R’s diversity and community development, users accustomed to commercial products may find it unattractive.

Python

Python is a scripting language whose syntax enables programmers to write efficient and concise code. While not as feature-rich for analytics as R, Python’s capabilities for scientific computing are expanding rapidly.

In 1989, Guido van Rossum started developing Python as a hobby project while he was working for the Centrum Wiskunde & Informatica (CWI) in the Netherlands.23 After using it internally at CWI, van Rossum published release 0.9.0 to alt.sources in 1991. With a growing base of users and contributors, the code base expanded steadily24, reaching 1.0 status in 1994, 2.0 status in 2000, and 3.0 status in 2008.

In 1995, Jim Hugunin of MIT developed Numeric, a Python extension module based on ideas from the matrix-sig interest group. Over the next several years, a group of Python users from the scientific and engineering communities contributed to Numeric and developed other packages for scientific computing. In 2003, Travis Oliphant, Eric Jones, and Pearu Peterson released the SciPy package, which offered standard numerical operations running on top of Numeric. Around the same time, Fernando Perez released the first version of IPython, an interactive development environment designed to serve the scientific community.25

The Python Software Foundation (PSF) , founded in 2001, owns the intellectual property rights to Python Releases 2.1 and higher and issues open source licenses under the Python Software Foundation License (PSFL) . PSFL is approved by the Open Source Initiative and the Free Software Foundation; it permits developers to modify the code and produce derivative works without publishing the source code. This makes Python an attractive development platform.

PSF produces the core Python distribution , which is written in C (CPython), and promotes development of the Python user and contributor community. The foundation’s board consists of 11 directors, elected annually by the voting members of the community.

As a general-purpose language, Python natively supports core capabilities needed in an analytic language, such as data import and program control. For advanced analytics, two packages (NumPy and SciPy) provide foundation functions. Together, NumPy and SciPy add an interactive shell; data handling tools; multidimensional arrays, sparse matrix handling; statistical functions; linear algebra; optimization; spatial analytics; and an interface to R.

For data manipulation and analysis, many Python users work with pandas, a package designed to handle structured data. Pandas supports SQL-like operations, such as join, merge, group, insert, and delete. It also handles more complex operations, such as missing value treatment and time series functionality.

The richest and most widely used Python package for advanced analytics is scikit-learn. This package includes algorithms for classification, regression, and clustering, including logistic regression, naïve Bayes classifier, support vector machines, ensemble models gradient boosting and random forests and k-means. The package also includes tools for dimension reduction, model selection, and pre-processing.

Pybrainis designed for use by entry-level Python users. It supports a variety of techniques for both supervised and unsupervised learning, reinforcement learning, and black-box optimization. The package emphasizes network architectures.

Patternis explicitly designed to support web mining. The package includes tools for web services, web crawling, and domain parsing, natural language processing, machine learning, and network analysis.

Many Python packages support highly specialized and advanced analytics. A few examples include:

  • For anomaly detection and streaming analytics, the NuPic package supports Hierarchical Temporal Memory (HTM) algorithms, an extension of Bayesian techniques.

  • The Nilearn package provides multivariate analytics for Neuroimaging data.

  • Hebel supports GPU-accelerated Deep Learning neural networks with CUDA (through PyCUDA).

The growth rate of Python functionality is stunning. At the end of May, 2014, there were just over 44,000 packages in the Python Package Index (PyPI) . A year later, at the end of May, 2015, there are just over 60,700 packages listed. Of these, 4,007 are tagged as Scientific/Engineering.

Like R, Python is memory constrained and cannot work with data sets that are larger than memory. As with R, an expert programmer can work around this constraint, but doing so negates some of the reasons for using a high-level language in the first place.

Ipython.parallelprovides an architecture for parallel and distributed computing, thus enabling the user to:

  • Visualize large distributed data sets with IPython

  • Parallelize execution for embarrassingly parallel tasks, such as model scoring

  • Write custom code to parallelize algorithms that are not embarrassingly parallel

Scalable machine learning platforms , such as Apache Spark and H2O, eliminate the need to write custom code in the latter case. These platforms support Python APIs.

Most massively parallel processing (MPP) databases run Python scripts as table functions. IBM PureData, Pivotal Greenplum, and Teradata Aster support this capability; Teradata Database does not.

Continuum Analytics publishes Anaconda, a free Python distribution that includes a number of enhancements for scientific computing and predictive analytics. These include:

  • Pre-selected Python packages for science, math, engineering, and data analysis

  • An online repository with the most up-to-date version of each package

  • A set of plug-ins for Microsoft Excel

The commercial server version of Anaconda includes technical support, a private and secure package repository with a graphical user interface, customized installers and mirrors, and comprehensive licensing.

There is a wide variety of IDEs available for Python users. The IPython project (ipython.org) offers an architecture for interactive computing, including an interactive shell, browser-based notebook, visualization support, embeddable Python interpreters, and a framework for parallel computing.

Rodeo is a recently introduced IDE designed expressly for data science.

Python consistently ranks as one of the most popular programming languages measured by the TIOBE Programming Community Index . However, while this index tells us something about Python’s overall popularity, it says little about its popularity as an analytics language. Many Python users do not use it for analytics.

In the three analyst surveys (see Table 3-2) described earlier in this chapter, Python ranks below R but above all other languages except SQL. Analysts’ use of Python is growing rapidly, as is shown by the KDnuggets annual survey; in the most recent poll, reported Python use surged from 20% in 2014 to 46% in 2016.

Table 3-2. Analytic Programming Tool Usage : Python
   

Response for Python

Survey

Date

Question

Percent Use

Rank

KDnuggets

June 2016

What analytics, Big Data, data mining, or data science software, have you used (in the) past 12 months for a real project?

46%

2

Rexer 26

Q1 2013

What (analytic) tools did you use in the past year? (Total)

24%

3

O’Reilly Media 27

September 2015

Which programming languages do you use (for data science)?

51%

4

As a production-capable scripting language, Python is an excellent tool for analytic applications. Python supports a strong testing framework, which enables straightforward code transition from development to deployment. Since Python is widely used among developers, its use by data scientists reduces or eliminates the cultural barrier that sometimes impedes predictive model deployment.

Python’s liberal open source license is another key strength. Developers may use, sell or distribute Python-based applications without permission.

Compared to R and to end user analytic tools, visualization in Python is more difficult and less compelling. While Python’s statistics and machine learning capabilities are growing, it still falls short of R. Due to its history as a general-purpose scripting language, the Python community appears less attractive to business analysts and prospective users whose background is in statistics and data mining rather than software engineering.

Scala

Responding to limitations of the Java programming language, Martin Odersky started work on Scala while working at the Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland.28 In 2004, the Scala team released the software to the public on Java and .NET. Odersky started Typesafe, Inc. in 2011 to provide commercial support and training; in 2012, the venture raised $14 million from Greylock Partners and other venture capitalists.29

EPFL holds the copyright to all Scala development prior to 2011; Typesafe and EPFL jointly hold the rights to enhancements from 2011 and later. End user licenses are under the open source modified BSD license.

Scala’s native capabilities for analysis are immature. ScalaNLP is the most widely used Scala package for scientific computing and machine learning. The package includes several pertinent libraries:

  • Breeze supports numerical computing, linear algebra, optimization, and signal processing

  • Breeze-viz is used for visualization

  • Epic is a natural language processing component with parsing capabilities for eight languages: English, Basque, French, German, Hungarian, Korean, Polish, and Swedish

  • Junto supports semi-supervised learning, including label propagation, adsorption, and modified adsorption

  • Nak is a machine learning library that supports k-means clustering, logistic regression, support vector machines, naïve Bayes, and neural networks

  • Puck supports natural language processing on a GPU chip

Other packages for Scala include :

  • Bioscala supports bioinformatics. Including DNA and RNA sequencing

  • Chalk is a library for natural language processing

  • Figaro is a package for developing probabilistic models

Data scientists using Scala can work with Apache Spark; interest in Spark is driving interest in Scala.

As of May, 2016, Scala ranks 32nd in the TIOBE Programming Community Index, well below R and Python. In the 2016 KDnuggets poll referenced previously in this chapter, 6% of respondents said they used Scala in the past year.

As an analytic programming language, Scala’s primary strength is its Spark API, which is stronger than its Python API. Scala’s main weakness is its lack of mature native analytic capabilities.

The Disruptive Power of Open Source

Open source business models disrupt established software markets in two ways.

First, in the absence of software license fees , open source software offers no initial barriers to trial. While commercial software vendors tend to develop increasingly complex and feature-rich products to justify their software license fees, open source software provides basic value at significantly lower overall cost. This is a classic example of “low-end disruption,” where the functionality of existing products overshoots what many potential customers can actually use.

For example, most commercial statistical packages include a dazzling array of techniques for statistics and machine learning; they cost anywhere from hundreds to thousands of dollars. But if a practitioner simply needs to use linear regression, open source R offers an excellent alternative.

Second, due to open source software projects’ rapid cadence of development and close interaction with users, developers tend to introduce the most innovative techniques in open source software first. Commercial software vendors tend to set priorities for enhancements based on short-term revenue impact; enhancements catering to niche markets or new methods that may take some time to develop tend to take a lower priority. Open source software, by contrast, offers no barriers to entry for innovators.

As an example of this process in action, consider the case of Random Forests, a machine learning technique that is highly popular today and widely used. Leo Breiman’s initial paper detailing the technique first appeared30 in 2001. Soon thereafter, in April, 2002, developers ported Breiman’s Fortran code to R and published the randomForest package. It took another 10 years for SAS, the industry leader in commercial software for statistics and machine learning, to offer the technique in any of its products.

The inherently innovative nature of open source software development enables open source projects to provide solutions to problems that are not effectively addressed by industry incumbents. In the next chapter, we discuss open source Hadoop and its ecosystem, and how it has permanently disrupted the data warehousing industry.

Footnotes

6 Corporations may deduct the actual cost of software donated to charitable organizations. However, the Apache Software Foundation reported no non-cash contributions in 2013, the last year for which its IRS return is available.

9 For more details on the two models, see The Cathedral and the Bazaar, available at http://www.catb.org/~esr/writings/cathedral-bazaar/ .

10 As of August 2016, H2O.ai is currently testing a new product (branded as “Steam”) which will be commercially licensed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset