Fully Differentiable Neural Forced Alignment
via Soft Dynamic Programming

Precise phoneme- and word-level boundary timestamps from a waveform and a transcript — trained end-to-end through a differentiable alignment.

Give it audio plus a transcript as phonemes (.phn), words (.wrd), or even just plain text (.txt) — FALCON does the rest, returning both phoneme- and word-level alignment and a downloadable Praat .TextGrid.

Rotem Rousso  ·  Eyal Cohen  ·  Joseph Keshet

Speech, Language & Deep Learning Lab · arXiv:2606.25460

Watch it align

The cursor sweeps the waveform and each predicted phoneme lights up at its timestamp — it loops on its own. Press Play to hear the audio in sync, or click anywhere on the waveform to seek.

0.00 / 3.17 s

Utterance: “Don't ask me to carry an oily rag like that.” (TIMIT) — the boundaries shown are FALCON's predicted alignment.

Overview

FALCON is an end-to-end, fully differentiable neural architecture for phoneme alignment.

Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, forced alignment has not experienced comparable progress, and traditional HMM–GMM frameworks remain widely adopted and highly competitive. The model consists of an encoder that processes the input signal and a decoder that produces alignment decisions. The encoder is structured into two complementary branches: one dedicated to phoneme identity verification and the other to phoneme boundary detection. The decoder is implemented as a trainable module based on differentiable soft dynamic programming. The entire system is optimized end-to-end using a novel contrastive loss that encourages clear separation between steady-state phoneme regions and transition boundaries. The proposed approach outperforms the current state of the art in phoneme alignment on hand-annotated English benchmarks, achieves strong word-level generalization results, and demonstrates generalization on unseen languages.

How it works

A spectral encoder and a contextual encoder turn the raw waveform into frame-level features; two branches verify phoneme identity and detect boundaries (the latter via a margin contrastive score), and a differentiable Soft-DP decoder reads both to emit the alignment. Gradients flow through the entire pipeline, so the encoder, classifier, and decoder are trained jointly.

FALCON architecture: spectral and contextual encoders feeding a soft dynamic-programming decoder

Example alignments

Each panel is the demo's own output — waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score — on a real test utterance.

English alignment
English — TIMIT · read speech, phoneme-level.
Dutch alignment
Dutch — IFA · zero-shot cross-lingual, phoneme-level.
German alignment
German — PHONDAT · zero-shot, phoneme-level.
Hebrew alignment
Hebrew · zero-shot, word-level, romanized (no G2P model).

Run it in your browser

Upload audio + a transcript (.phn / .wrd / .txt), or click a built-in example, and get a boundary table, a Praat .TextGrid, and the time-aligned visualization.

▶ Open the live demo
Demo — boundary tables and downloadable TextGrid Demo — time-aligned visualizations

Citation

@article{falcon2026,
  title   = {Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming},
  author  = {Rousso, Rotem and Cohen, Eyal and Keshet, Joseph},
  journal = {arXiv preprint arXiv:2606.25460},
  year    = {2026},
  url     = {https://arxiv.org/abs/2606.25460}
}