DeepWDM generates automatic measurements of word durations. Input to the system is a sound file of an arbitrary length with a single word. The system outputs a textgrid with the duration of the word. The tool has three RNN models available for use for this taks. The trained model does not require phonetic or orthographic transcription of the words as input.
AutoVowelDuration generates automatic measurements of vowel durations. The system takes as input an arbitrary length segment of an acoustic signal containing a single vowel that is preceded and followed by consonants, and outputs the duration of the vowel. A DNN-based algorithm (Adi et al., 2015) and structured prediction-based algorithm (Adi et al., 2016) were designed for this task. Critically, the trained model can automatically estimate vowel durations without phonetic or orthographic transcription.
The vowel duration annotator has been compared to the current state-of-the-art alignment system for segment-level annotation: the FAVE forced aligner (Rosenfelder et al. 2011), based on the Penn Phonetics Lab Forced Aligner (Yuan & Liberman, 2008). Our automatic annotation system performs at a high degree of accuracy: it matches the vowel boundaries selected by manual annotation, resulting in a closer match to these manual annotations than the FAVE aligner; and recovers the same parameters of a statistical model based on manual measurements (Adi et al., 2015; Adi et al., 2016).
Many studies are concerned with measurement of voice onset time (VOT) of stop consonants. While most of the work on automatic measurement of VOT has focused on positive VOT, common in American English stops, in many languages VOT can be negative, reflecting a period of voicing prior to the release of the stop burst. In this tool we try to address the problem of accurate estimation of both positive and negative VOTs. The input to the algorithm is a speech segment of an arbitrary length containing a single word-initial stop consonant, and the output is the burst onset, the onset of the voicing of the following vowel, and the time of the prevoicing onset (if prevoicing is present). Manually labeled data was used to train a recurrent neural network that can model the dynamic temporal behavior of the input signal, and outputs the events’ onsets (and duration of VOT).
Comparisons to manually-annotated data show that the VOT system is able to meet or exceed performance of most existing systems (Henry, Sonderegger, & Keshet, 2012). Because the VOT system is capable of flexible input, it is robust in measurements where variable speaking rates, VOT durations, or speech errors are anticipated (Goldrick et al., 2016).
Formant frequency estimation and tracking are among the most fundamental problems in speech processing. Speech analyses require the accurate estimation of formants from a stationary point in time (e.g. the midpoint of a vowel), as well as dynamic tracking of frequencies across a signal. Traditionally, formant estimation and tracking is done using ad-hoc signal processing methods. DeepFormants (Dissen & Keshet, 2016) is based on machine learning techniques trained on an annotated corpus of read speech (Deng et al., 2006). Our acoustic signal representation is composed of LPC-based cepstral coefficients with a range of model orders (number of LPC poles) and pitch-synchronous cepstral coefficients. Two deep network architectures are used as learning algorithms: a deep feed-forward network for the estimation task and a recurrent neural network for the tracking task. The performance of our methods compares favorably with mainstream LPC-based implementations and state-of-the-art tracking algorithms. Preliminary results suggest that, like existing systems such as DARLA and FAVE (Reddy & Stanford, 2015), our system can capture dialect-level differences in the tense-lax distinction of front vowels that differ between speakers of Northern and Southern varieties of American English.