Energy-Based Models

with Applications to Speech and Language Processing

ICASSP2022 Tutorial

14:00-17:30 (UTC+8), 22 May, 2022


Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are radically different from other popular probabilistic models, which is self-normalized (i.e., sum to one), such as hidden Markov models (HMMs), auto-regressive models, Generative Adversarial Nets (GANs) and Variational Auto-encoders (VAEs). During these years, EBMs have attracted increasing interests not only from core machine learning but also from application domains such as speech, vision, natural language processing (NLP) and so on, with significant theoretical and algorithmic progress. To the best of our knowledge, there are no tutorials about EBMs with applications to speech and language processing. The sequential nature of speech and language also presents special challenges and needs treatment different from processing fix-dimensional data (e.g., images).
The purpose of this tutorial is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing, which is organized into four chapters. First, we will introduce basics for EBMs, including classic models, recent models parameterized by neural networks, and various learning algorithms from the classic methods to the most advanced ones. The next three chapters will present how to apply EBMs in three different scenarios respectively: 1) EBMs for language modeling, 2) EBMs for speech recognition and natural language labeling, and 3) EBMs for semi-supervised natural language labeling. In addition, we will introduce open-source toolkits to help the audience to get familiar with the techniques for developing and applying energy-based models.


Zhijian Ou

Tsinghua University, Beijing


ICASSP2022 Tutorial Energy-Based Models with Applications to Speech and Language Processing.pdf


Chapter1: Basics for EBMs

Watch on YouTube

Chapter2: EBMs for language modeling

Watch on YouTube

Chapter3: EBMs for speech recognition and natural language labeling

Watch on YouTube

Chapter4: EBMs for semi-supervised natural language labeling

Watch on YouTube

If you cannot access YouTube, please access our video on Bilibili


Chapter 1: Basics for EBMs
  • Probabilistic graphical modeling (PGM) framework and EBM model examples (classic & modern)
  • Learning EBMs by Monte Carlo methods
  • Learning EBMs by noise-contrastive estimation (NCE)
Chapter 2: EBMs for language modeling
  • Trans-dimensional random field (TRF) LMs for speech recognition
  • Residual energy-based models for text generation
  • Electric: an energy-based cloze model for representation learning over text
Chapter 3: EBMs for speech recognition and natural language labeling
  • CRFs as conditional EBMs
  • CRFs for speech recognition
  • CRFs for sequence labeling in NLP
Chapter 4: EBMs for semi-supervised natural language labeling
  • Upgrading EBMs to Joint EBMs (JEMs) for fixed-dimensional data
  • Upgrading CRFs to Joint random fields (JRFs) for sequential data
  • JRFs for semi-supervised natural language labeling


  1. D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  2. Eric Fosler-Lussier, et al. Conditional random fields in speech, audio, and language processing. Proceedings of the IEEE, 2013.
  3. Zhijian Ou. A Review of Learning with Deep Generative Models from Perspective of Graphical Modeling. arXiv:1808.01630, 2018.
  4. Bin Wang, Zhijian Ou, Zhiqiang Tan. Trans-dimensional Random Fields for Language Modeling. Annual Meeting of the Association for Computational Linguistics (ACL Long Paper), 2015.
  5. Bin Wang, Zhijian Ou, Zhiqiang Tan. Learning Trans-dimensional Random Fields with Applications to Language Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018, vol.40, no.4, pp.876-890.
  6. Bin Wang, Zhijian Ou. Language modeling with neural trans-dimensional random fields. ASRU, 2017.
  7. Bin Wang, Zhijian Ou. Learning neural trans-dimensional random field language models with noise-contrastive estimation. ICASSP, 2018.
  8. Bin Wang, Zhijian Ou. Improved training of neural trans-dimensional random field language models with dynamic noise-contrastive estimation. SLT, 2018.
  9. Silin Gao, Zhijian Ou, Wei Yang, Huifang Xu. Integrating discrete and neural features via mixed-feature trans-dimensional random field language models. ICASSP (Oral), 2020.
  10. Yunfu Song, Zhijian Ou. Learning Neural Random Fields with Inclusive Auxiliary Generators. arXiv:1806.00271, 2018.
  11. Hongyu Xiang, Zhijian Ou. CRF-based Single-stage Acoustic Modeling with CTC Topology. ICASSP (Oral Paper), 2019.
  12. Kai Hu, Zhijian Ou, Min Hu, Junlan Feng. Neural CRF Transducers for Sequence Labeling. ICASSP, 2019.
  13. Yunfu Song, Zhijian Ou, Zitao Liu, Songfan Yang. Upgrading CRFs to JRFs and its benefits to sequence modeling and labeling. ICASSP, 2020.
  14. Yunfu Song, Huahuan Zheng, Zhijian Ou. An empirical comparison of joint-training and pre-training for domain-agnostic semi-supervised learning via energy-based models. IEEE Machine Learning for Signal Processing Workshop (MLSP), 2021.
  15. Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. ICLR 2020.
  16. Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, and Marc'Aurelio Ranzato. Residual energy-based models for text generation, ICLR 2020.
  17. Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. Pre-training transformers as energy-based cloze models, EMNLP 2020.