Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, provides a much more complete introduction to Spark that this chapter can provide. I thoroughly recommend it.
If you are interested in learning more about information theory, I recommend David MacKay's book Information Theory, Inference, and Learning Algorithms.
Information Retrieval, by Manning, Raghavan, and Schütze, describes how to analyze textual data (including lemmatization and stemming). An online