Precise phoneme- and word-level boundary timestamps from a waveform and a transcript — trained end-to-end through a differentiable alignment.
Give it audio plus a transcript as phonemes (.phn), words (.wrd), or even just plain text (.txt) — FALCON does the rest, returning both phoneme- and word-level alignment and a downloadable Praat .TextGrid.
Speech, Language & Deep Learning Lab · arXiv:2606.25460
The cursor sweeps the waveform and each predicted phoneme lights up at its timestamp — it loops on its own. Press Play to hear the audio in sync, or click anywhere on the waveform to seek.
Utterance: “Don't ask me to carry an oily rag like that.” (TIMIT) — the boundaries shown are FALCON's predicted alignment.
FALCON is an end-to-end, fully differentiable neural architecture for phoneme alignment.
Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, forced alignment has not experienced comparable progress, and traditional HMM–GMM frameworks remain widely adopted and highly competitive. The model consists of an encoder that processes the input signal and a decoder that produces alignment decisions. The encoder is structured into two complementary branches: one dedicated to phoneme identity verification and the other to phoneme boundary detection. The decoder is implemented as a trainable module based on differentiable soft dynamic programming. The entire system is optimized end-to-end using a novel contrastive loss that encourages clear separation between steady-state phoneme regions and transition boundaries. The proposed approach outperforms the current state of the art in phoneme alignment on hand-annotated English benchmarks, achieves strong word-level generalization results, and demonstrates generalization on unseen languages.
A spectral encoder and a contextual encoder turn the raw waveform into frame-level features; two branches verify phoneme identity and detect boundaries (the latter via a margin contrastive score), and a differentiable Soft-DP decoder reads both to emit the alignment. Gradients flow through the entire pipeline, so the encoder, classifier, and decoder are trained jointly.
Each panel is the demo's own output — waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score — on a real test utterance.
Upload audio + a transcript (.phn / .wrd / .txt), or click a built-in example, and get a boundary table, a Praat .TextGrid, and the time-aligned visualization.
@article{falcon2026,
title = {Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming},
author = {Rousso, Rotem and Cohen, Eyal and Keshet, Joseph},
journal = {arXiv preprint arXiv:2606.25460},
year = {2026},
url = {https://arxiv.org/abs/2606.25460}
}