Preface |
Introduction / 1: |
Machine Perception / 1.1: |
An Example / 1.2: |
Related Fields / 1.2.1: |
Pattern Recognition Systems / 1.3: |
Sensing / 1.3.1: |
Segmentation and Grouping / 1.3.2: |
Feature Extraction / 1.3.3: |
Classification / 1.3.4: |
Post Processing / 1.3.5: |
The Design Cycle / 1.4: |
Data Collection / 1.4.1: |
Feature Choice / 1.4.2: |
Model Choice / 1.4.3: |
Training / 1.4.4: |
Evaluation / 1.4.5: |
Computational Complexity / 1.4.6: |
Learning and Adaptation / 1.5: |
Supervised Learning / 1.5.1: |
Unsupervised Learning / 1.5.2: |
Reinforcement Learning / 1.5.3: |
Conclusion / 1.6: |
Summary by Chapters |
Bibliographical and Historical Remarks |
Bibliography |
Bayesian Decision Theory / 2: |
Bayesian Decision Theory--Continuous Features / 2.1: |
Two-Category Classification / 2.2.1: |
Minimum-Error-Rate Classification / 2.3: |
Minimax Criterion / 2.3.1: |
Neyman-Pearson Criterion / 2.3.2: |
Classifiers, Discriminant Functions, and Decision Surfaces / 2.4: |
The Multicategory Case / 2.4.1: |
The Two-Category Case / 2.4.2: |
The Normal Density / 2.5: |
Univariate Density / 2.5.1: |
Multivariate Density / 2.5.2: |
Discriminant Functions for the Normal Density / 2.6: |
Case 1: [Sigma subscript i] = [sigma superscript 2]I / 2.6.1: |
Case 2: [Sigma ubscript i] = [Sigma] / 2.6.2: |
Case 3: [Sigma subscript i] = arbitrary / 2.6.3: |
Decision Regions for Two-Dimensional Gaussian Data / Example 1: |
Error Probabilities and Integrals / 2.7: |
Error Bounds for Normal Densities / 2.8: |
Chernoff Bound / 2.8.1: |
Bhattacharyya Bound / 2.8.2: |
Error Bounds for Gaussian Distributions / Example 2: |
Signal Detection Theory and Operating Characteristics / 2.8.3: |
Bayes Decision Theory--Discrete Features / 2.9: |
Independent Binary Features / 2.9.1: |
Bayesian Decisions for Three-Dimensional Binary Data / Example 3: |
Missing and Noisy Features / 2.10: |
Missing Features / 2.10.1: |
Noisy Features / 2.10.2: |
Bayesian Belief Networks / 2.11: |
Belief Network for Fish / Example 4: |
Compound Bayesian Decision Theory and Context / 2.12: |
Summary |
Problems |
Computer exercises |
Maximum-Likelihood and Bayesian Parameter Estimation / 3: |
Maximum-Likelihood Estimation / 3.1: |
The General Principle / 3.2.1: |
The Gaussian Case: Unknown [mu] / 3.2.2: |
The Gaussian Case: Unknown [mu] and [Sigma] / 3.2.3: |
Bias / 3.2.4: |
Bayesian Estimation / 3.3: |
The Class-Conditional Densities / 3.3.1: |
The Parameter Distribution / 3.3.2: |
Bayesian Parameter Estimation: Gaussian Case / 3.4: |
The Univariate Case: p([mu]|D) / 3.4.1: |
The Univariate Case: p(x|D) / 3.4.2: |
The Multivariate Case / 3.4.3: |
Bayesian Parameter Estimation: General Theory / 3.5: |
Recursive Bayes Learning |
When Do Maximum-Likelihood and Bayes Methods Differ? / 3.5.1: |
Noninformative Priors and Invariance / 3.5.2: |
Gibbs Algorithm / 3.5.3: |
Sufficient Statistics / 3.6: |
Sufficient Statistics and the Exponential Family / 3.6.1: |
Problems of Dimensionality / 3.7: |
Accuracy, Dimension, and Training Sample Size / 3.7.1: |
Overfitting / 3.7.2: |
Component Analysis and Discriminants / 3.8: |
Principal Component Analysis (PCA) / 3.8.1: |
Fisher Linear Discriminant / 3.8.2: |
Multiple Discriminant Analysis / 3.8.3: |
Expectation-Maximization (EM) / 3.9: |
Expectation-Maximization for a 2D Normal Model |
Hidden Markov Models / 3.10: |
First-Order Markov Models / 3.10.1: |
First-Order Hidden Markov Models / 3.10.2: |
Hidden Markov Model Computation / 3.10.3: |
Hidden Markov Model / 3.10.4: |
Decoding / 3.10.5: |
HMM Decoding |
Learning / 3.10.6: |
Nonparametric Techniques / 4: |
Density Estimation / 4.1: |
Parzen Windows / 4.3: |
Convergence of the Mean / 4.3.1: |
Convergence of the Variance / 4.3.2: |
Illustrations / 4.3.3: |
Classification Example / 4.3.4: |
Probabilistic Neural Networks (PNNs) / 4.3.5: |
Choosing the Window Function / 4.3.6: |
k[subscript n]-Nearest-Neighbor Estimation / 4.4: |
k[subscript n]-Nearest-Neighbor and Parzen-Window Estimation / 4.4.1: |
Estimation of A Posteriori Probabilities / 4.4.2: |
The Nearest-Neighbor Rule / 4.5: |
Convergence of the Nearest Neighbor / 4.5.1: |
Error Rate for the Nearest-Neighbor Rule / 4.5.2: |
Error Bounds / 4.5.3: |
The k-Nearest-Neighbor Rule / 4.5.4: |
Computational Complexity of the k-Nearest-Neighbor Rule / 4.5.5: |
Metrics and Nearest-Neighbor Classification / 4.6: |
Properties of Metrics / 4.6.1: |
Tangent Distance / 4.6.2: |
Fuzzy Classification / 4.7: |
Reduced Coulomb Energy Networks / 4.8: |
Approximations by Series Expansions / 4.9: |
Linear Discriminant Functions / 5: |
Linear Discriminant Functions and Decision Surfaces / 5.1: |
Generalized Linear Discriminant Functions / 5.2.1: |
The Two-Category Linearly Separable Case / 5.4: |
Geometry and Terminology / 5.4.1: |
Gradient Descent Procedures / 5.4.2: |
Minimizing the Perceptron Criterion Function / 5.5: |
The Perceptron Criterion Function / 5.5.1: |
Convergence Proof for Single-Sample Correction / 5.5.2: |
Some Direct Generalizations / 5.5.3: |
Relaxation Procedures / 5.6: |
The Descent Algorithm / 5.6.1: |
Convergence Proof / 5.6.2: |
Nonseparable Behavior / 5.7: |
Minimum Squared-Error Procedures / 5.8: |
Minimum Squared-Error and the Pseudoinverse / 5.8.1: |
Constructing a Linear Classifier by Matrix Pseudoinverse |
Relation to Fisher's Linear Discriminant / 5.8.2: |
Asymptotic Approximation to an Optimal Discriminant / 5.8.3: |
The Widrow-Hoff or LMS Procedure / 5.8.4: |
Stochastic Approximation Methods / 5.8.5: |
The Ho-Kashyap Procedures / 5.9: |
The Descent Procedure / 5.9.1: |
Some Related Procedures / 5.9.2: |
Linear Programming Algorithms / 5.10: |
Linear Programming / 5.10.1: |
The Linearly Separable Case / 5.10.2: |
Support Vector Machines / 5.10.3: |
SVM Training / 5.11.1: |
SVM for the XOR Problem |
Multicategory Generalizations / 5.12: |
Kesler's Construction / 5.12.1: |
Convergence of the Fixed-Increment Rule / 5.12.2: |
Generalizations for MSE Procedures / 5.12.3: |
Multilayer Neural Networks / 6: |
Feedforward Operation and Classification / 6.1: |
General Feedforward Operation / 6.2.1: |
Expressive Power of Multilayer Networks / 6.2.2: |
Backpropagation Algorithm / 6.3: |
Network Learning / 6.3.1: |
Training Protocols / 6.3.2: |
Learning Curves / 6.3.3: |
Error Surfaces / 6.4: |
Some Small Networks / 6.4.1: |
The Exclusive-OR (XOR) / 6.4.2: |
Larger Networks / 6.4.3: |
How Important Are Multiple Minima? / 6.4.4: |
Backpropagation as Feature Mapping / 6.5: |
Representations at the Hidden Layer--Weights / 6.5.1: |
Backpropagation, Bayes Theory and Probability / 6.6: |
Bayes Discriminants and Neural Networks / 6.6.1: |
Outputs as Probabilities / 6.6.2: |
Related Statistical Techniques / 6.7: |
Practical Techniques for Improving Backpropagation / 6.8: |
Activation Function / 6.8.1: |
Parameters for the Sigmoid / 6.8.2: |
Scaling Input / 6.8.3: |
Target Values / 6.8.4: |
Training with Noise / 6.8.5: |
Manufacturing Data / 6.8.6: |
Number of Hidden Units / 6.8.7: |
Initializing Weights / 6.8.8: |
Learning Rates / 6.8.9: |
Momentum / 6.8.10: |
Weight Decay / 6.8.11: |
Hints / 6.8.12: |
On-Line, Stochastic or Batch Training? / 6.8.13: |
Stopped Training / 6.8.14: |
Number of Hidden Layers / 6.8.15: |
Criterion Function / 6.8.16: |
Second-Order Methods / 6.9: |
Hessian Matrix / 6.9.1: |
Newton's Method / 6.9.2: |
Quickprop / 6.9.3: |
Conjugate Gradient Descent / 6.9.4: |
Additional Networks and Training Methods / 6.10: |
Radial Basis Function Networks (RBFs) / 6.10.1: |
Special Bases / 6.10.2: |
Matched Filters / 6.10.3: |
Convolutional Networks / 6.10.4: |
Recurrent Networks / 6.10.5: |
Cascade-Correlation / 6.10.6: |
Regularization, Complexity Adjustment and Pruning / 6.11: |
Stochastic Methods / 7: |
Stochastic Search / 7.1: |
Simulated Annealing / 7.2.1: |
The Boltzmann Factor / 7.2.2: |
Deterministic Simulated Annealing / 7.2.3: |
Boltzmann Learning / 7.3: |
Stochastic Boltzmann Learning of Visible States / 7.3.1: |
Missing Features and Category Constraints / 7.3.2: |
Deterministic Boltzmann Learning / 7.3.3: |
Initialization and Setting Parameters / 7.3.4: |
Boltzmann Networks and Graphical Models / 7.4: |
Other Graphical Models / 7.4.1: |
Evolutionary Methods / 7.5: |
Genetic Algorithms / 7.5.1: |
Further Heuristics / 7.5.2: |
Why Do They Work? / 7.5.3: |
Genetic Programming / 7.6: |
Nonmetric Methods / 8: |
Decision Trees / 8.1: |
Cart / 8.3: |
Number of Splits / 8.3.1: |
Query Selection and Node Impurity / 8.3.2: |
When to Stop Splitting / 8.3.3: |
Pruning / 8.3.4: |
Assignment of Leaf Node Labels / 8.3.5: |
A Simple Tree |
Multivariate Decision Trees / 8.3.6: |
Priors and Costs / 8.3.9: |
Missing Attributes / 8.3.10: |
Surrogate Splits and Missing Attributes |
Other Tree Methods / 8.4: |
ID3 / 8.4.1: |
C4.5 / 8.4.2: |
Which Tree Classifier Is Best? / 8.4.3: |
Recognition with Strings / 8.5: |
String Matching / 8.5.1: |
Edit Distance / 8.5.2: |
String Matching with Errors / 8.5.3: |
String Matching with the "Don't-Care" Symbol / 8.5.5: |
Grammatical Methods / 8.6: |
Grammars / 8.6.1: |
Types of String Grammars / 8.6.2: |
A Grammar for Pronouncing Numbers |
Recognition Using Grammars / 8.6.3: |
Grammatical Inference / 8.7: |
Rule-Based Methods / 8.8: |
Learning Rules / 8.8.1: |
Algorithm-Independent Machine Learning / 9: |
Lack of Inherent Superiority of Any Classifier / 9.1: |
No Free Lunch Theorem / 9.2.1: |
No Free Lunch for Binary Data |
Ugly Duckling Theorem / 9.2.2: |
Minimum Description Length (MDL) / 9.2.3: |
Minimum Description Length Principle / 9.2.4: |
Overfitting Avoidance and Occam's Razor / 9.2.5: |
Bias and Variance / 9.3: |
Bias and Variance for Regression / 9.3.1: |
Bias and Variance for Classification / 9.3.2: |
Resampling for Estimating Statistics / 9.4: |
Jackknife / 9.4.1: |
Jackknife Estimate of Bias and Variance of the Mode |
Bootstrap / 9.4.2: |
Resampling for Classifier Design / 9.5: |
Bagging / 9.5.1: |
Boosting / 9.5.2: |
Learning with Queries / 9.5.3: |
Arcing, Learning with Queries, Bias and Variance / 9.5.4: |
Estimating and Comparing Classifiers / 9.6: |
Parametric Models / 9.6.1: |
Cross-Validation / 9.6.2: |
Jackknife and Bootstrap Estimation of Classification Accuracy / 9.6.3: |
Maximum-Likelihood Model Comparison / 9.6.4: |
Bayesian Model Comparison / 9.6.5: |
The Problem-Average Error Rate / 9.6.6: |
Predicting Final Performance from Learning Curves / 9.6.7: |
The Capacity of a Separating Plane / 9.6.8: |
Combining Classifiers / 9.7: |
Component Classifiers with Discriminant Functions / 9.7.1: |
Component Classifiers without Discriminant Functions / 9.7.2: |
Unsupervised Learning and Clustering / 10: |
Mixture Densities and Identifiability / 10.1: |
Maximum-Likelihood Estimates / 10.3: |
Application to Normal Mixtures / 10.4: |
Case 1: Unknown Mean Vectors / 10.4.1: |
Case 2: All Parameters Unknown / 10.4.2: |
k-Means Clustering / 10.4.3: |
Fuzzy k-Means Clustering / 10.4.4: |
Unsupervised Bayesian Learning / 10.5: |
The Bayes Classifier / 10.5.1: |
Learning the Parameter Vector / 10.5.2: |
Unsupervised Learning of Gaussian Data |
Decision-Directed Approximation / 10.5.3: |
Data Description and Clustering / 10.6: |
Similarity Measures / 10.6.1: |
Criterion Functions for Clustering / 10.7: |
The Sum-of-Squared-Error Criterion / 10.7.1: |
Related Minimum Variance Criteria / 10.7.2: |
Scatter Criteria / 10.7.3: |
Clustering Criteria |
Iterative Optimization / 10.8: |
Hierarchical Clustering / 10.9: |
Definitions / 10.9.1: |
Agglomerative Hierarchical Clustering / 10.9.2: |
Stepwise-Optimal Hierarchical Clustering / 10.9.3: |
Hierarchical Clustering and Induced Metrics / 10.9.4: |
The Problem of Validity / 10.10: |
On-line clustering / 10.11: |
Unknown Number of Clusters / 10.11.1: |
Adaptive Resonance / 10.11.2: |
Learning with a Critic / 10.11.3: |
Graph-Theoretic Methods / 10.12: |
Component Analysis / 10.13: |
Nonlinear Component Analysis (NLCA) / 10.13.1: |
Independent Component Analysis (ICA) / 10.13.3: |
Low-Dimensional Representations and Multidimensional Scaling (MDS) / 10.14: |
Self-Organizing Feature Maps / 10.14.1: |
Clustering and Dimensionality Reduction / 10.14.2: |
Mathematical Foundations / A: |
Notation / A.1: |
Linear Algebra / A.2: |
Notation and Preliminaries / A.2.1: |
Inner Product / A.2.2: |
Outer Product / A.2.3: |
Derivatives of Matrices / A.2.4: |
Determinant and Trace / A.2.5: |
Matrix Inversion / A.2.6: |
Eigenvectors and Eigenvalues / A.2.7: |
Lagrange Optimization / A.3: |
Probability Theory / A.4: |
Discrete Random Variables / A.4.1: |
Expected Values / A.4.2: |
Pairs of Discrete Random Variables / A.4.3: |
Statistical Independence / A.4.4: |
Expected Values of Functions of Two Variables / A.4.5: |
Conditional Probability / A.4.6: |
The Law of Total Probability and Bayes' Rule / A.4.7: |
Vector Random Variables / A.4.8: |
Expectations, Mean Vectors and Covariance Matrices / A.4.9: |
Continuous Random Variables / A.4.10: |
Distributions of Sums of Independent Random Variables / A.4.11: |
Normal Distributions / A.4.12: |
Gaussian Derivatives and Integrals / A.5: |
Multivariate Normal Densities / A.5.1: |
Bivariate Normal Densities / A.5.2: |
Hypothesis Testing / A.6: |
Chi-Squared Test / A.6.1: |
Information Theory / A.7: |
Entropy and Information / A.7.1: |
Relative Entropy / A.7.2: |
Mutual Information / A.7.3: |
Index / A.8: |