Chapter 30. Looking Ahead

Apache Spark is a fast moving project.

We have seen that Spark Streaming is an older, relatively low-level API built on top of Resilient Distributed Datasets (RDDs) and the usual Java, Scala, or Python objects that every programmer is used to. Spark Streaming is battle tested and deployed at many production-level applications. We can consider it a stable API where the efforts go mostly to maintenance.

Structured Streaming, being built upon the Dataset and Dataframe APIs of Spark, takes full advantage of this the impressive optimization work Apache Spark introduced through Spark SQL, such as the Catalyst engine and the code generation and memory management from project Tungsten. In this sense, Structured Streaming is the future of streaming in Apache Spark, and where the main development efforts will lie for the foreseeable future. As such, Structured Streaming is delivering exciting new developments such as continuous processing.

We need to mention that Structured Streaming is a newer framework for stream processing and, as such, is less mature, something we have outlined in particular in the machine learning chapters of this book. It is important to keep this in mind, especially when embarking on a project heavy on machine learning. Given the current interest in machine learning, we should expect that future versions of Spark will bring improvements in this area and more algorithms supported in streaming mode. We hope to have equipped you with all the elements to make an accurate evaluation of the offerings of both APIs.

There is one remaining question that we would like to address: how to keep learning and improving in this space.

Stay Plugged In

One of the strongest aspects of Apache Spark has always been its community.

Apache Spark, as an open source project, has been very successful at harnessing contributions from individuals and companies into a comprehensive and consistent code base, as is demonstrated in Figure 30-1.

spas 3001
Figure 30-1. Spark contribution timeline

The GitHub page of Apache Spark is evidence of its steady development pace, where more than 200 developers contribute to each release and the total number of contributors is into the thousands.

There are several established channels to get in touch with the community.

Seek Help on Stack Overflow

Stack Overflow, the well-known Q&A community, is a very active place to discuss Spark-related questions. It’s advisable to first search for existing answers in this website before asking new questions, because the chances are that people before you already had the same or similar query.

Start Discussions on the Mailing Lists

The Apache Spark community has always relied heavily on two mailing lists, where the core developers and creators of Apache Spark are committed to helping users and fellow contributors on a regular basis. The user mailing list, at [email protected] is for readers who are trying to find the best ways to use Spark, while the developer mailing list, at [email protected], caters to those who are improving the Spark framework itself.

You’ll find up-to-date details on how to subscribe to those mailing lists for free online.

Attend Conferences

The Spark Summit is the bi-annual conference cycle promoted by Databricks, the startup that is the direct shepherd of the Apache Spark project. On top of events with a conference program dedicated to Spark, this conference has offered a watering hole for Spark developers to meet with the community and one another. You can find more information online.

Attend Meetups

If you live near a city with a big technology footprint, consider attending user groups or meetups. They are often free and a great opportunity to see early previews of conference talks or more intimate demonstrations and use cases of Apache Spark applications.

Read Books

We have previously mentioned the 2015 book Learning Spark by Matei Zaharia and the other founders of the Spark project as a good entry point to build a foundation on the functioning of Apache Spark. Numerous publications on Spark, in general by O’Reilly, are volumes we would definitely recommend, as well. Let us mention just the 2017 volume Spark: The Definitive Guide by Matei Zaharia and Bill Chambers as an essential refresher on more recent evolutions in the Spark platform.

On the more theoretical side, you might find yourself looking for more in-depth knowledge about streaming algorithms and machine learning in the abstract, before implementing more of those notions using Apache Spark. There is too much material in this domain for us to recommend exhaustively, but let us just mention that Alex Smola’s 2012 course on data streams at Berkeley is a good entry point, with a rich bibliography.

Contributing to the Apache Spark Project

When you do want to contribute back what your algorithmic adventures have created to the open source community, you’ll find that the Apache Spark development is organized as follows:

The developer workflow for Spark includes design documents for the larger developments, which you’ll find listed in the resources mentioned previously, and that offer a wonderful window into the development process. Another way to get this entry into the developer work on Apache Spark is to watch the videos of Holden Karau, Apache Spark developer and PMC member, who live streams her pull request reviews and even coding sessions. You’ll find this unique “day in the life of a Spark developer” experience on:

All of these resources should give you the tools to not only master stream processing with Apache Spark, but also provide you with the means to contribute to the collective endeavor that makes this system better every day.

We hope you have enjoyed this book!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset