Available Tools

Word Duration

DeepWDM generates automatic measurements of word durations. Input to the system is a sound file of an arbitrary length with a single word. The system outputs a textgrid with the duration of the word. The tool has three RNN models available for use for this taks. The trained model does not require phonetic or orthographic transcription of the words as input.

DeepWDM: the word tooth pronunounced by a male.


AutoVowelDuration generates automatic measurements of vowel durations. The system takes as input an arbitrary length segment of an acoustic signal containing a single vowel that is preceded and followed by consonants, and outputs the duration of the vowel. A DNN-based algorithm (Adi et al., 2015) and structured prediction-based algorithm (Adi et al., 2016) were designed for this task. Critically, the trained model can automatically estimate vowel durations without phonetic or orthographic transcription.

The vowel duration annotator has been compared to the current state-of-the-art alignment system for segment-level annotation: the FAVE forced aligner (Rosenfelder et al. 2011), based on the Penn Phonetics Lab Forced Aligner (Yuan & Liberman, 2008). Our automatic annotation system performs at a high degree of accuracy: it matches the vowel boundaries selected by manual annotation, resulting in a closer match to these manual annotations than the FAVE aligner; and recovers the same parameters of a statistical model based on manual measurements (Adi et al., 2015; Adi et al., 2016).

Adi, Y., Keshet, J., & Goldrick, M. (2015). Vowel duration measurement using deep neural networks, IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
Adi, Y., Keshet, J., Cibelli, E., Gustafson, E., Clopper, C., and Goldrick, M. (2016). Automatic measurement of vowel duration via structured prediction. Manuscript submitted for publication
Yuan. J. & Liberman, M. (2008). Speaker identification on the SCOTUS corpus. Journal of the Acoustical Society of America, 123(5):3878, 2008
Rosenfelder, I., Fruehwald, J., Evanini, K., & Jiahong Y. (2011). FAVE (Forced Alignment and Vowel Extraction) Program Suite. http://fave.ling.upenn.edu

AutoVowelDuraiton: the word goose pronunounced by a male.

AutoVOT and DeepVOT

Many studies are concerned with measurement of voice onset time (VOT) of stop consonants. While most of the work on automatic measurement of VOT has focused on positive VOT, common in American English stops, in many languages VOT can be negative, reflecting a period of voicing prior to the release of the stop burst. In this tool we try to address the problem of accurate estimation of both positive and negative VOTs. The input to the algorithm is a speech segment of an arbitrary length containing a single word-initial stop consonant, and the output is the burst onset, the onset of the voicing of the following vowel, and the time of the prevoicing onset (if prevoicing is present). Manually labeled data was used to train a recurrent neural network that can model the dynamic temporal behavior of the input signal, and outputs the events’ onsets (and duration of VOT).

Comparisons to manually-annotated data show that the VOT system is able to meet or exceed performance of most existing systems (Henry, Sonderegger, & Keshet, 2012). Because the VOT system is capable of flexible input, it is robust in measurements where variable speaking rates, VOT durations, or speech errors are anticipated (Goldrick et al., 2016).

Yossi Adi, Joseph Keshet, Olga Dmitrieva, and Matt Goldrick, Automatic Measurement of Voice Onset Time and Prevoicing using Recurrent Neural Networks, The 17th Annual Conference of the International Speech Communication Association (Interspeech), San Francisco, CA, 2016.
Matthew Goldrick, Joseph Keshet, Erin Gustafson, Jordana Heller, and Jeremy Needle, Automatic Analysis of Slips of the Tongue: Insights into the Cognitive Architecture of Speech Production, Cognition, 149, 31-39, 2016
Morgan Sonderegger and Joseph Keshet, Automatic Discriminative Measurement of Voice Onset Time, Journal of the Acoustical Society of America, Vol. 132, Issue 6, pp. 3965−3979, 2012.
Katharine Henry, Morgan Sonderegger and Joseph Keshet, Automatic Measurement of Positive and Negative Voice Onset Time, The 13th Annual Conference of the International Speech Communication Association (Interspeech), Portland, Oregon, 2012.

AutoVOT: the word peach pronunounced by a male.


Formant frequency estimation and tracking are among the most fundamental problems in speech processing. Speech analyses require the accurate estimation of formants from a stationary point in time (e.g. the midpoint of a vowel), as well as dynamic tracking of frequencies across a signal. Traditionally, formant estimation and tracking is done using ad-hoc signal processing methods. DeepFormants (Dissen & Keshet, 2016) is based on machine learning techniques trained on an annotated corpus of read speech (Deng et al., 2006). Our acoustic signal representation is composed of LPC-based cepstral coefficients with a range of model orders (number of LPC poles) and pitch-synchronous cepstral coefficients. Two deep network architectures are used as learning algorithms: a deep feed-forward network for the estimation task and a recurrent neural network for the tracking task. The performance of our methods compares favorably with mainstream LPC-based implementations and state-of-the-art tracking algorithms. Preliminary results suggest that, like existing systems such as DARLA and FAVE (Reddy & Stanford, 2015), our system can capture dialect-level differences in the tense-lax distinction of front vowels that differ between speakers of Northern and Southern varieties of American English.

Formants - Table 2
Performance of the formant estimation. Mean absolute error in Hz relative to the hand annottions.

Formants - Table 3
Performance of the formant tracker. Mean absolute error in Hz relative to the hand annottions.

Yehoshua Dissen and Joseph Keshet, Formant Estimation and Tracking using Deep Learning, The 17th Annual Conference of the International Speech Communication Association (Interspeech), San Francisco, CA, 2016.
L. Deng, X. Cui, R. Pruvenok, Y. Chen, S. Momen, and A. Alwan, "A database of vocal tract resonance trajectories for research in speech processing," in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, vol. 1. IEEE, 2006, pp. I–I.
S. Reddy and J. N. Stanford, "Toward completely automated vowel extraction: Introducing darla," Linguistics Vanguard, 2015.