FALCON — Fully Differentiable Neural Forced Alignment

Watch it align

The cursor sweeps the waveform and each predicted phoneme lights up at its timestamp — it loops on its own. Press Play to hear the audio in sync, or click anywhere on the waveform to seek.

— 0.00 / 3.17 s

Utterance: “Don't ask me to carry an oily rag like that.” (TIMIT) — the boundaries shown are FALCON's predicted alignment.

Overview

FALCON is an end-to-end, fully differentiable neural architecture for phoneme alignment.

Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, forced alignment has not experienced comparable progress, and traditional HMM–GMM frameworks remain widely adopted and highly competitive. The model consists of an encoder that processes the input signal and a decoder that produces alignment decisions. The encoder is structured into two complementary branches: one dedicated to phoneme identity verification and the other to phoneme boundary detection. The decoder is implemented as a trainable module based on differentiable soft dynamic programming. The entire system is optimized end-to-end using a novel contrastive loss that encourages clear separation between steady-state phoneme regions and transition boundaries. The proposed approach outperforms the current state of the art in phoneme alignment on hand-annotated English benchmarks, achieves strong word-level generalization results, and demonstrates generalization on unseen languages.

How it works

A spectral encoder and a contextual encoder turn the raw waveform into frame-level features; two branches verify phoneme identity and detect boundaries (the latter via a margin contrastive score), and a differentiable Soft-DP decoder reads both to emit the alignment. Gradients flow through the entire pipeline, so the encoder, classifier, and decoder are trained jointly.

FALCON architecture: spectral and contextual encoders feeding a soft dynamic-programming decoder

Results

From the paper (arXiv:2606.25460). Specialist = trained on the target English corpus; joint = one model jointly trained on TIMIT+Buckeye. Multilingual results are zero-shot — no target-language training data. Accuracy = % of reference boundaries matched within the ms tolerance; accent marks the best in each column-block.

Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)

Dataset	Model	t≤10	t≤25	t≤50	t≤100
TIMIT	MFA	38.6	72.3	81.1	84.6
	FALCON specialist	37.66	83.88	94.85	98.62
	FALCON joint	34.70	82.62	94.91	98.60
Buckeye	MFA	35.3	60.6	68.9	72.7
	FALCON specialist	29.69	69.93	90.07	97.40
	FALCON joint	28.87	69.40	89.53	97.13

Phoneme-Level: Unseen Multilingual Generalization Accuracy

Test set	Model	≤10	≤15	≤20	≤25	≤50	≤100
Dutch — IFA	FALCON joint	26.85	36.16	44.56	51.17	69.94	84.11
	FALCON specialist	26.86	35.79	43.85	50.34	68.68	83.22
	MFA	11.01	14.70	19.05	21.80	33.90	51.02
German — PHONDAT	FALCON joint	25.63	34.12	41.87	49.07	70.04	84.58
	FALCON specialist	25.08	33.37	40.76	47.43	68.27	82.44
	MFA	20.60	31.75	37.17	45.83	66.78	79.19
Hebrew	FALCON joint	21.98	30.10	36.91	42.78	63.07	80.41
	FALCON specialist	21.03	27.78	34.30	39.79	59.38	77.76

Word-Level Alignment Accuracy [%]: Comparative Analysis

Dataset	Model	t≤10	t≤25	t≤50	t≤100
TIMIT	FALCON spec (MFA-G2P)	49.22	81.79	93.04	98.37
	FALCON joint (MFA-G2P)	49.50	80.60	92.86	98.46
	MFA	41.60	72.80	89.40	97.40
	MMS	18.60	43.50	75.70	94.70
	WhisperX	22.40	52.70	82.40	94.20
	Nvidia-Canary-1b	9.23	23.11	44.23	72.81
Buckeye	FALCON spec (MFA-G2P)	50.06	77.85	91.51	96.63
	FALCON joint (MFA-G2P)	50.42	77.98	91.01	96.55
	MFA	39.80	69.90	84.90	91.80
	MMS	25.00	52.70	75.00	87.90
	WhisperX	18.80	43.10	67.40	77.40
	Nvidia-Canary-1b	8.06	18.83	36.31	63.29

Word-Level: Unseen Multilingual Generalization Accuracy

Dataset	Model	t≤10	t≤25	t≤50	t≤100
German — PHONDAT	FALCON (MFA-G2P)	44.20	68.48	86.12	95.11
	MFA	29.9	65.4	82.1	94.3
	MMS	21.8	44.3	74.9	91.8
Dutch — IFA	FALCON (MFA-G2P)	26.38	45.15	61.16	76.49
	MFA	4.7	7.3	11.6	19.0
	MMS	16.0	37.9	62.9	76.6
Hebrew	FALCON	31.91	56.72	75.18	87.89
	MMS	14.3	41.3	76.5	94.7

Example alignment

The panel is the demo's own output — waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score — on a real TIMIT test utterance.

English alignment — **English — TIMIT** · read speech, phoneme-level · *“Don't ask me to carry an oily rag like that.”*

Run it in your browser

Upload audio + a transcript (.phn / .wrd / .txt), or click a built-in example, and get a boundary table, a Praat .TextGrid, and the time-aligned visualization.

Open the live demo

Demo — boundary tables and downloadable TextGrid

Citation

@article{falcon2026,
  title   = {Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming},
  author  = {Rousso, Rotem and Cohen, Eyal and Keshet, Joseph},
  journal = {arXiv preprint arXiv:2606.25460},
  year    = {2026},
  url     = {https://arxiv.org/abs/2606.25460}
}