Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Valliappa Lakshmanan
Data Science on the Google Cloud Platform
Preface
Who This Book Is For
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
1. Making Better Decisions Based on Data
Many Similar Decisions
The Role of Data Engineers
The Cloud Makes Data Engineers Possible
The Cloud Turbocharges Data Science
Case Studies Get at the Stubborn Facts
A Probabilistic Decision
Data and Tools
Getting Started with the Code
Summary
2. Ingesting Data into the Cloud
Airline On-Time Performance Data
Knowability
Training–Serving Skew
Download Procedure
Dataset Fields
Why Not Store the Data in Situ?
Scaling Up
Scaling Out
Data in Situ with Colossus and Jupiter
Ingesting Data
Reverse Engineering a Web Form
Dataset Download
Exploration and Cleanup
Uploading Data to Google Cloud Storage
Scheduling Monthly Downloads
Ingesting in Python
Cloud Functions
Securing the URL
Scheduling the Cloud Function
Improving the Cloud Function Design
Summary
Code Break
3. Creating Compelling Dashboards
Explain Your Model with Dashboards
Why Build a Dashboard First?
Accuracy, Honesty, and Good Design
Loading Data into Google Cloud SQL
Create a Google Cloud SQL Instance
Interacting with Google Cloud Platform
Controlling Access to MySQL
Create Tables
Populating Tables
Building Our First Model
Contingency Table
Threshold Optimization
Machine Learning
Building a Dashboard
Getting Started with Data Studio
Creating Charts
Adding End-User Controls
Showing Proportions with a Pie Chart
Explaining a Contingency Table
Summary
4. Streaming Data: Publication and Ingest
Designing the Event Feed
Time Correction
Apache Beam/Cloud Dataflow
Parsing Airports Data
Adding Time Zone Information
Converting Times to UTC
Correcting Dates
Creating Events
Running the Pipeline in the Cloud
Publishing an Event Stream to Cloud Pub/Sub
Get Records to Publish
Paging Through Records
Building a Batch of Events
Publishing a Batch of Events
Real-Time Stream Processing
Streaming in Java Dataflow
Executing the Stream Processing
Analyzing Streaming Data in BigQuery
Real-Time Dashboard
Summary
5. Interactive Data Exploration
Exploratory Data Analysis
Loading Flights Data into BigQuery
Advantages of a Serverless Columnar Database
Staging on Cloud Storage
Access Control
Federated Queries
Ingesting CSV Files
Exploratory Data Analysis in Cloud AI Platform Notebooks
Jupyter Notebooks
Cloud AI Platform Notebooks
Installing Packages in Cloud AI Platform Notebooks
Jupyter Magic for Google Cloud Platform
Quality Control
Oddball Values
Outlier Removal: Big Data Is Different
Filtering Data on Occurrence Frequency
Arrival Delay Conditioned on Departure Delay
Applying Probabilistic Decision Threshold
Empirical Probability Distribution Function
The Answer Is...
Evaluating the Model
Random Shuffling
Splitting by Date
Training and Testing
Summary
6. Bayes Classifier on Cloud Dataproc
MapReduce and the Hadoop Ecosystem
How MapReduce Works
Apache Hadoop
Google Cloud Dataproc
Need for Higher-Level Tools
Jobs, Not Clusters
Initialization Actions
Quantization Using Spark SQL
JupyterLab on Cloud Dataproc
Independence Check Using BigQuery
Spark SQL in JupyterLab
Histogram Equalization
Dynamically Resizing Clusters
Bayes Classification Using Pig
Running a Pig Job on Cloud Dataproc
Automating Cloud Dataproc with Workflow Templates
Limiting to Training Days
The Decision Criteria
Evaluating the Bayesian Model
Summary
7. Machine Learning: Logistic Regression in Spark and BigQuery
Logistic Regression
Spark ML Library
Getting Started with Spark Machine Learning
Spark Logistic Regression
Creating a Training Dataset
Dealing with Corner Cases
Creating Training Examples
Training
Predicting by Using a Model
Evaluating a Model
Feature Engineering
Experimental Framework
Creating the Held-Out Dataset
Feature Selection
Scaling and Clipping Features
Feature Transforms
Categorical Variables
Scalable Machine Learning Models in BigQuery
Repeatable, Real Time
Summary
8. Time-Windowed Aggregate Features
The Need for Time Averages
Dataflow in Java
Setting Up Development Environment
Filtering with Beam
Pipeline Options and Text I/O
Run on Cloud
Parsing into Objects
Computing Time Averages
Grouping and Combining
Parallel Do with Side Input
Debugging
BigQueryIO
Mutating the Flight Object
Sliding Window Computation in Batch Mode
Running in the Cloud
Monitoring, Troubleshooting, and Performance Tuning
Troubleshooting Pipeline
Side Input Limitations
Redesigning the Pipeline
Removing Duplicates
Summary
9. Machine Learning Classifier Using TensorFlow
Toward More Complex Models
Reading Data into TensorFlow
Training and Evaluation in Keras
Model Function
Input and Features
Training and Evaluating Input Functions
Saving and Exporting
Performing a Training Run
Training in the Cloud
Wide-and-Deep Model
Hyperparameter Tuning
Deploying the Model
Predicting with the Model
Explaining the Model
Summary
10. Real-Time Machine Learning
Invoking Prediction Service
Java Classes for Request and Response
Post Request and Parse Response
Client of Prediction Service
Adding Predictions to Flight Information
Batch Input and Output
Data Processing Pipeline
Identifying Inefficiency
Batching Requests
Streaming Pipeline
Flattening PCollections
Executing Streaming Pipeline
Late and Out-of-Order Records
Watermarks and Triggers
Transactions, Throughput, and Latency
Possible Streaming Sinks
Cloud Bigtable
Designing Tables
Designing the Row Key
Streaming into Cloud Bigtable
Querying from Cloud Bigtable
Evaluating Model Performance
The Need for Continuous Training
Evaluation Pipeline
Evaluating Performance
Marginal Distributions
Checking Model Behavior
Identifying Behavioral Change
Summary
Book Summary
A. Considerations for Sensitive Data within Machine Learning Datasets
Handling Sensitive Information
Identifying Sensitive Data
Protecting Sensitive Data
Removing Sensitive Data
Masking Sensitive Data
Coarsening Sensitive Data
Establishing a Governance Policy
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Data Science on the Google Cloud Platform
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset