Preface |
Making Better Decisions Based on Data / 1: |
Many Similar Decisions |
The Role of Data Engineers |
The Cloud Makes Data Engineers Possible |
The Cloud Turbocharges Data Science |
Case Studies Get at the Stubborn Facts |
A Probabilistic Decision |
Data and Tools |
Getting Started with the Code |
Summary |
Ingesting Data into the Cloud / 2: |
Airline On-Time Performance Data |
Knowability |
Training-Serving Skew |
Download Procedure |
Dataset Attributes |
Why Not Store the Data in Situ? |
Scaling Up |
Scaling Out |
Data in Situ with Colossus and Jupiter |
Ingesting Data |
Reverse Engineering a Web Form |
Dataset Download |
Exploration and Cleanup |
Uploading Data to Google Cloud Storage |
Scheduling Monthly Downloads |
Ingesting in Python |
Flask Web App |
Running on App Engine |
Securing the URL |
Scheduling a Cron Task |
Code Break |
Creating Compelling Dashboards / 3: |
Explain Your Model with Dashboards |
Why Build a Dashboard First? |
Accuracy, Honesty, and Good Design |
Loading Data into Google Cloud SQL |
Create a Google Cloud SQL Instance |
Interacting with Google Cloud Platform |
Controlling Access to MySQL |
Create Tables |
Populating Tables |
Building Our First Model |
Contingency Table |
Threshold Optimization |
Machine Learning |
Building a Dashboard |
Getting Started with Data Studio |
Creating Charts |
Adding End-User Controls |
Showing Proportions with a Pie Chart |
Explaining a Contingency Table |
Streaming Data: Publication and Ingest / 4: |
Designing the Event Feed |
Time Correction |
Apache Beam/Cloud Dataflow |
Parsing Airports Data |
Adding Time Zone Information |
Converting Times to UTC |
Correcting Dates |
Creating Events |
Running the Pipeline in the Cloud |
Publishing an Event Stream to Cloud Pub/Sub |
Get Records to Publish |
Paging Through Records |
Building a Batch of Events |
Publishing a Batch of Events |
Real-Time Stream Processing |
Streaming in Java Dataflow |
Executing the Stream Processing |
Analyzing Streaming Data in BigQuery |
Real-Time Dashboard |
Interactive Data Exploration / 5: |
Exploratory Data Analysis |
Loading Flights Data into BigQuery |
Advantages of a Serverless Columnar Database |
Staging on Cloud Storage |
Access Control |
Federated Queries |
Ingesting CSV Files |
Exploratory Data Analysis in Cloud Datalab |
Jupyter Notebooks |
Cloud Datalab |
Installing Packages in Cloud Datalab |
Jupyter Magic for Google Cloud Platform |
Quality Control |
Oddball Values |
Outlier Removal: Big Data Is Different |
Filtering Data on Occurrence Frequency |
Arrival Delay Conditioned on Departure Delay |
Applying Probabilistic Decision Threshold |
Empirical Probability Distribution Function |
The Answer Is… |
Evaluating the Model |
Random Shuffling |
Splitting by Date |
Training and Testing |
Bayes Classifier on Cloud Dataproc / 6: |
MapReduce and the Hadoop Ecosystem |
How MapReduce Works |
Apache Hadoop |
Google Cloud Dataproc |
Need for Higher-Level Tools |
Jobs, Not Clusters |
Initialization Actions |
Quantization Using Spark SQL |
Google Cloud Datalab on Cloud Dataproc |
Independence Check Using BigQuery |
Spark SQL in Google Cloud Datalab |
Histogram Equalization |
Dynamically Resizing Clusters |
Bayes Classification Using Pig |
Running a Pig Job on Cloud Dataproc |
Limiting to Training Days |
The Decision Criteria |
Evaluating the Bayesian Model |
Machine Learning: Logistic Regression on Spark / 7: |
Logistic Regression |
Spark ML Library |
Getting Started with Spark Machine Learning |
Spark Logistic Regression |
Creating a Training Dataset |
Dealing with Corner Cases |
Creating Training Examples |
Training |
Predicting by Using a Model |
Evaluating a Model |
Feature Engineering |
Experimental Framework |
Creating the Held-Out Dataset |
Feature Selection |
Scaling and Clipping Features |
Feature Transforms |
Categorical Variables |
Scalable, Repeatable, Real Time |
Time-Windowed Aggregate Features / 8: |
The Need for Time Averages |
Dataflow in Java |
Setting Up Development Environment |
Filtering with Beam |
Pipeline Options and Text I/O |
Run on Cloud |
Parsing into Objects |
Computing Time Averages |
Grouping and Combining |
Parallel Do with Side Input |
Debugging |
BigQueryIO |
Mutating the Flight Object |
Sliding Window Computation in Batch Mode |
Running in the Cloud |
Monitoring, Troubleshooting, and Performance Tuning |
Troubleshooting Pipeline |
Side Input Limitations |
Redesigning the Pipeline |
Removing Duplicates |
Machine Learning Classifier Using TensorFlow / 9: |
Toward More Complex Models |
Reading Data into TensorFlow |
Setting Up an Experiment |
Linear Classifier |
Training and Evaluating Input Functions |
Serving Input Function |
Creating an Experiment |
Performing a Training Run |
Distributed Training in the Cloud |
Improving the ML Model |
Deep Neural Network Model |
Embeddings |
Wide-and-Deep Model |
Hyperparameter Tuning |
Deploying the Model |
Predicting with the Model |
Explaining the Model |
Real-Time Machine Learning / 10: |
Invoking Prediction Service |
Java Classes for Request and Response |
Post Request and Parse Response |
Client of Prediction Service |
Adding Predictions to Flight Information |
Batch Input and Output |
Data Processing Pipeline |
Identifying Inefficiency |
Batching Requests |
Streaming Pipeline |
Flattening PCollections |
Executing Streaming Pipeline |
Late and Out-of-Order Records |
Watermarks and Triggers |
Transactions, Throughput, and Latency |
Possible Streaming Sinks |
Cloud Bigtable |
Designing Tables |
Designing the Row Key |
Streaming into Cloud Bigtable |
Querying from Cloud Bigtable |
Evaluating Model Performance |
The Need for Continuous Training |
Evaluation Pipeline |
Evaluating Performance |
Marginal Distributions |
Checking Model Behavior |
Identifying Behavioral Change |
Book Summary |
Considerations for Sensitive Data within Machine Learning Datasets / A: |
Index |
Preface |
Making Better Decisions Based on Data / 1: |
Many Similar Decisions |
The Role of Data Engineers |
The Cloud Makes Data Engineers Possible |
The Cloud Turbocharges Data Science |