Appendix C. Resources: where to go for more

C.1. Spark

The number of books on Spark finally started growing in 2015—six years after Spark development first began. But Spark development is still moving fast, and the best resources are online.

Apache mailing lists

As with any open source project, especially one from Apache, the mailing lists are the best sources of information, and subscribing to them—and asking questions when you can’t find answers on the web—should be considered the minimum you have to do. The mailing lists are known as [email protected] and [email protected]. You can subscribe to them from https://spark.apache.org/community.html.

Databricks forums

Databricks is the commercialization of Spark that offers a commercial product of a Spark notebook in the cloud. But the forums on www.databricks.com aren’t limited to only the commercial product. As a large percentage of the commits to Apache Spark come from Databricks, the Databricks forums also contain a lot of general-purpose information about Spark, including future plans that pertain to the open source Apache Spark as well as the commercial Databricks product.

Conference and meetup videos

There are four major sources of Spark videos. None should be overlooked; they are all outstanding. Spark is moving fast, and watching these videos on your smartphone while on the treadmill or as a bedtime story is sometimes the only way to keep up:

1.  Spark Summit (West, East, and Europe)

2.  AMPLab AMPCamp

3.  Bay Area Spark Meetup

4.  O’Reilly Strata Conference (West and East)

Jira

If staying current with Spark is important to you, there’s no substitute to following the Spark Jira. Create an Apache Jira account if you don’t already have one, list all the issues every day in reverse chronological order, and click Watch for the issues that are important to you. That way you can know what new features, bug fixes, performance improvements, architectural changes, and support for third-party systems (file systems, cluster managers, database connectors, compression formats, serialization schemes, and so on) are coming down the way—and, more importantly, which versions they’re being targeted for.

There are some long-standing gems of planned features buried within Jira from the early days that are still being worked on or planned for, so, as painful and time-consuming as it may sound, the first time you list Spark Jira tickets, it’s probably worth your while to go through all of those that are still open.

Twitter

If you think Twitter is just about celebrities and that nothing useful could possibly be expressed in 140 characters, you’re in for a surprise.

There’s a lot on Twitter in terms of Big Data, data science, and machine learning. You can regard Twitter as a link aggregator to hot or important blog posts, news stories, or Git repositories.

spark-packages.org

Because the developers of Apache Spark are reluctant to overload the official distribution with too many features and sub-packages, they set up the website spark-packages .org. Available add-on packages are broken up into categories such as machine learning, graphs, Python, and so on.

AMPLab

Spark came out of AMPLab, and AMPLab continues to develop new modules that work with Spark, as well as some other brand-new technologies unrelated to Spark. Modules that come out of AMPLab have a tendency to either be incorporated directly into the Apache Spark distribution (such as GraphX, Catalyst, which became Spark SQL, and SparkR) or at least semi-officially supported, such as Tachyon.

Google Scholar Alerts

You’re likely familiar with Google Alerts, which sends you an email whenever a page is updated. But there’s something completely different called Google Scholar Alerts, part of scholar.google.com, which sends an email whenever a new paper is published that cites a paper you’re tracking.

If you set Google Scholar Alerts on some of the seminal Spark papers, such as Matei Zaharia’s “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” or Gonzalez et al’s “GraphX: Graph Processing in a Distributed Dataflow Framework,” you can keep track of the latest advances in academia before they become commercialized.

Author blogs

If you do all that we’ve suggested so far, you won’t need to read these blogs. But if you want to save time and read only a distilled version of what’s coming in the future for Spark, Big Data, data science, and machine learning—at least through Michael Malak’s personal crystal ball—then his blogs are good resources:

C.2. Scala

The best Scala resources are books. Some Scala books are quite long. But because Scala has so many tricks, an alternative is to get the ones that are encyclopedias of tricks:

  • Scala Cookbook by Alvin Alexander (O’Reilly, 2013)
  • Scala Puzzlers by Andrew Phillips (Artima, 2014)

C.3. Graphs

There are tons of books on graph theory, many of them highly theoretical, either for use as college textbooks or for use by researchers. Practitioners, however, may find the following useful:

  • Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev (Cambridge University Press, 2011)
  • Graph Databases by Ian Robinson et al (O’Reilly, 2015)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset