Preface to the First Edition |
Preface to the Second Edition |
List of Abbreviations |
Human Speech Communication / 1: |
Value of speech for human-machine communication / 1.1: |
Ideas and language / 1.2: |
Relationship between written and spoken language / 1.3: |
Phonetics and phonology / 1.4: |
The acoustic signal / 1.5: |
Phonemes, phones and allophones / 1.6: |
Vowels, consonants and syllables / 1.7: |
Phonemes and spelling / 1.8: |
Prosodic features / 1.9: |
Language, accent and dialect / 1.10: |
Supplementing the acoustic signal / 1.11: |
The complexity of speech processing / 1.12: |
Chapter 1 summary |
Chapter 1 exercises |
Mechanisms and Models of Human Speech Production / 2: |
Introduction / 2.1: |
Sound sources / 2.2: |
The resonant system / 2.3: |
Interaction of laryngeal and vocal tract functions / 2.4: |
Radiation / 2.5: |
Waveforms and spectrograms / 2.6: |
Speech production models / 2.7: |
Excitation models / 2.7.1: |
Vocal tract models / 2.7.2: |
Chapter 2 summary |
Chapter 2 exercises |
Mechanisms and Models of the Human Auditory System / 3: |
Physiology of the outer and middle ears / 3.1: |
Structure of the cochlea / 3.3: |
Neural response / 3.4: |
Psychophysical measurements / 3.5: |
Analysis of simple and complex signals / 3.6: |
Models of the auditory system / 3.7: |
Mechanical filtering / 3.7.1: |
Models of neural transduction / 3.7.2: |
Higher-level neural processing / 3.7.3: |
Chapter 3 summary |
Chapter 3 exercises |
Digital Coding of Speech / 4: |
Simple waveform coders / 4.1: |
Pulse code modulation / 4.2.1: |
Deltamodulation / 4.2.2: |
Analysis/synthesis systems (vocoders) / 4.3: |
Channel vocoders / 4.3.1: |
Sinusoidal coders / 4.3.2: |
LPC vocoders / 4.3.3: |
Formant vocoders / 4.3.4: |
Efficient parameter coding / 4.3.5: |
Vocoders based on segmental/phonetic structure / 4.3.6: |
Intermediate systems / 4.4: |
Sub-band coding / 4.4.1: |
Linear prediction with simple coding of the residual / 4.4.2: |
Adaptive predictive coding / 4.4.3: |
Multipulse LPC / 4.4.4: |
Code-excited linear prediction / 4.4.5: |
Evaluating speech coding algorithms / 4.5: |
Subjective speech intelligibility measures / 4.5.1: |
Subjective speech quality measures / 4.5.2: |
Objective speech quality measures / 4.5.3: |
Choosing a coder / 4.6: |
Chapter 4 summary |
Chapter 4 exercises |
Message Synthesis from Stored Human Speech Components / 5: |
Concatenation of whole words / 5.1: |
Simple waveform concatenation / 5.2.1: |
Concatenation of vocoded words / 5.2.2: |
Limitations of concatenating word-size units / 5.2.3: |
Concatenation of sub-word units: general principles / 5.3: |
Choice of sub-word unit / 5.3.1: |
Recording and selecting data for the units / 5.3.2: |
Varying durations of concatenative units / 5.3.3: |
Synthesis by concatenating vocoded sub-word units / 5.4: |
Synthesis by concatenating waveform segments / 5.5: |
Pitch modification / 5.5.1: |
Timing modification / 5.5.2: |
Performance of waveform concatenation / 5.5.3: |
Variants of concatenative waveform synthesis / 5.6: |
Hardware requirements / 5.7: |
Chapter 5 summary |
Chapter 5 exercises |
Phonetic synthesis by rule / 6: |
Acoustic-phonetic rules / 6.1: |
Rules for formant synthesizers / 6.3: |
Table-driven phonetic rules / 6.4: |
Simple transition calculation / 6.4.1: |
Overlapping transitions / 6.4.2: |
Using the tables to generate utterances / 6.4.3: |
Optimizing phonetic rules / 6.5: |
Automatic adjustment of phonetic rules / 6.5.1: |
Rules for different speaker types / 6.5.2: |
Incorporating intensity rules / 6.5.3: |
Current capabilities of phonetic synthesis by rule / 6.6: |
Chapter 6 summary |
Chapter 6 exercises |
Speech Synthesis from Textual or Conceptual Input / 7: |
Emulating the human speaking process / 7.1: |
Converting from text to speech / 7.3: |
TTS system architecture / 7.3.1: |
Overview of tasks required for TTS conversion / 7.3.2: |
Text analysis / 7.4: |
Text pre-processing / 7.4.1: |
Morphological analysis / 7.4.2: |
Phonetic transcription / 7.4.3: |
Syntactic analysis and prosodic phrasing / 7.4.4: |
Assignment of lexical stress and pattern of word accents / 7.4.5: |
Prosody generation / 7.5: |
Timing pattern / 7.5.1: |
Fundamental frequency contour / 7.5.2: |
Implementation issues / 7.6: |
Current TTS synthesis capabilities / 7.7: |
Speech synthesis from concept / 7.8: |
Chapter 7 summary |
Chapter 7 exercises |
Introduction to automatic speech recognition: template matching / 8: |
General principles of pattern matching / 8.1: |
Distance metrics / 8.3: |
Filter-bank analysis / 8.3.1: |
Level normalization / 8.3.2: |
End-point detection for isolated words / 8.4: |
Allowing for timescale variations / 8.5: |
Dynamic programming for time alignment / 8.6: |
Refinements to isolated-word DP matching / 8.7: |
Score pruning / 8.8: |
Allowing for end-point errors / 8.9: |
Dynamic programming for connected words / 8.10: |
Continuous speech recognition / 8.11: |
Syntactic constraints / 8.12: |
Training a whole-word recognizer / 8.13: |
Chapter 8 summary |
Chapter 8 exercises |
Introduction to stochastic modelling / 9: |
Feature variability in pattern matching / 9.1: |
Introduction to hidden Markov models / 9.2: |
Probability calculations in hidden Markov models / 9.3: |
The Viterbi algorithm / 9.4: |
Parameter estimation for hidden Markov models / 9.5: |
Forward and backward probabilities / 9.5.1: |
Parameter re-estimation with forward and backward probabilities / 9.5.2: |
Viterbi training / 9.5.3: |
Vector quantization / 9.6: |
Multi-variate continuous distributions / 9.7: |
Use of normal distributions with HMMs / 9.8: |
Probability calculations / 9.8.1: |
Estimating the parameters of a normal distribution / 9.8.2: |
Baum-Welch re-estimation / 9.8.3: |
Model initialization / 9.8.4: |
Gaussian mixtures / 9.10: |
Calculating emission probabilities / 9.10.1: |
Re-estimation using the most likely state sequence / 9.10.2: |
Initialization of Gaussian mixture distributions / 9.10.4: |
Tied mixture distributions / 9.10.5: |
Extension of stochastic models to word sequences / 9.11: |
Implementing probability calculations / 9.12: |
Using the Viterbi algorithm with probabilities in logarithmic form / 9.12.1: |
Adding probabilities when they are in logarithmic form / 9.12.2: |
Relationship between DTW and a simple HMM / 9.13: |
State durational characteristics of HMMs / 9.14: |
Chapter 9 summary |
Chapter 9 exercises |
Introduction to front-end analysis for automatic speech recognition / 10: |
Pre-emphasis / 10.1: |
Frames and windowing / 10.3: |
Filter banks, Fourier analysis and the mel scale / 10.4: |
Cepstral analysis / 10.5: |
Analysis based on linear prediction / 10.6: |
Dynamic features / 10.7: |
Capturing the perceptually relevant information / 10.8: |
General feature transformations / 10.9: |
Variable-frame-rate analysis / 10.10: |
Chapter 10 summary |
Chapter 10 exercises |
Practical techniques for improving speech recognition performance / 11: |
Robustness to environment and channel effects / 11.1: |
Feature-based techniques / 11.2.1: |
Model-based techniques / 11.2.2: |
Dealing with unknown or unpredictable noise corruption / 11.2.3: |
Speaker-independent recognition / 11.3: |
Speaker normalization / 11.3.1: |
Model adaptation / 11.4: |
Bayesian methods for training and adaptation of HMMs / 11.4.1: |
Adaptation methods based on linear transforms / 11.4.2: |
Discriminative training methods / 11.5: |
Maximum mutual information training / 11.5.1: |
Training criteria based on reducing recognition errors / 11.5.2: |
Robustness of recognizers to vocabulary variation / 11.6: |
Chapter 11 summary |
Chapter 11 exercises |
Automatic speech recognition for large vocabularies / 12: |
Historical perspective / 12.1: |
Speech transcription and speech understanding / 12.3: |
Speech transcription / 12.4: |
Challenges posed by large vocabularies / 12.5: |
Acoustic modelling / 12.6: |
Context-dependent phone modelling / 12.6.1: |
Training issues for context-dependent models / 12.6.2: |
Parameter tying / 12.6.3: |
Training procedure / 12.6.4: |
Methods for clustering model parameters / 12.6.5: |
Constructing phonetic decision trees / 12.6.6: |
Extensions beyond triphone modelling / 12.6.7: |
Language modelling / 12.7: |
N-grams / 12.7.1: |
Perplexity and evaluating language models / 12.7.2: |
Data sparsity in language modelling / 12.7.3: |
Discounting / 12.7.4: |
Backing off in language modelling / 12.7.5: |
Interpolation of language models / 12.7.6: |
Choice of more general distribution for smoothing / 12.7.7: |
Improving on simple N-grams / 12.7.8: |
Decoding / 12.8: |
Efficient one-pass Viterbi decoding for large vocabularies / 12.8.1: |
Multiple-pass Viterbi decoding / 12.8.2: |
Depth-first decoding / 12.8.3: |
Evaluating LVCSR performance / 12.9: |
Measuring errors / 12.9.1: |
Controlling word insertion errors / 12.9.2: |
Performance evaluations / 12.9.3: |
Speech understanding / 12.10: |
Measuring and evaluating speech understanding performance / 12.10.1: |
Chapter 12 summary |
Chapter 12 exercises |
Neural networks for speech recognition / 13: |
The human brain / 13.1: |
Connectionist models / 13.3: |
Properties of ANNs / 13.4: |
ANNs for speech recognition / 13.5: |
Hybrid HMM/ANN methods / 13.5.1: |
Chapter 13 summary |
Chapter 13 exercises |
Recognition of speaker characteristics / 14: |
Characteristics of speakers / 14.1: |
Verification versus identification / 14.2: |
Assessing performance / 14.2.1: |
Measures of verification performance / 14.2.2: |
Speaker recognition / 14.3: |
Text dependence / 14.3.1: |
Methods for text-dependent/text-prompted speaker recognition / 14.3.2: |
Methods for text-independent speaker recognition / 14.3.3: |
Acoustic features for speaker recognition / 14.3.4: |
Evaluations of speaker recognition performance / 14.3.5: |
Language recognition / 14.4: |
Techniques for language recognition / 14.4.1: |
Acoustic features for language recognition / 14.4.2: |
Chapter 14 summary |
Chapter 14 exercises |
Applications and performance of current technology / 15: |
Why use speech technology? / 15.1: |
Speech synthesis technology / 15.3: |
Examples of speech synthesis applications / 15.4: |
Aids for the disabled / 15.4.1: |
Spoken warning signals, instructions and user feedback / 15.4.2: |
Education, toys and games / 15.4.3: |
Telecommunications / 15.4.4: |
Speech recognition technology / 15.5: |
Characterizing speech recognizers and recognition tasks / 15.5.1: |
Typical recognition performance for different tasks / 15.5.2: |
Achieving success with ASR in an application / 15.5.3: |
Examples of ASR applications / 15.6: |
Command and control / 15.6.1: |
Dictation / 15.6.2: |
Data entry and retrieval / 15.6.4: |
Applications of speaker and language recognition / 15.6.5: |
The future of speech technology applications / 15.8: |
Chapter 15 summary |
Chapter 15 exercises |
Future research directions in speech synthesis and recognition / 16: |
Speech synthesis / 16.1: |
Speech sound generation / 16.2.1: |
Prosody generation and higher-level linguistic processing / 16.2.2: |
Automatic speech recognition / 16.3: |
Advantages of statistical pattern-matching methods / 16.3.1: |
Limitations of HMMs for speech recognition / 16.3.2: |
Developing improved recognition models / 16.3.3: |
Relationship between synthesis and recognition / 16.4: |
Automatic speech understanding / 16.5: |
Chapter 16 summary |
Chapter 16 exercises |
Further Reading / 17: |
Books / 17.1: |
Journals / 17.2: |
Conferences and workshops / 17.3: |
The Internet / 17.4: |
Reading for individual chapters / 17.5: |
References |
Solutions to Exercises |
Glossary |
Index |