Travelogue: Defined.ai at ICASSP 2022 – Part 2: My Top 3 Papers of ICASSP 2022
Welcome back, dear readers, to Defined.ai’s highlights of ICASSP 2022!
In part II of our adventure chronicle, I’ll be getting nerdy and highlighting some of the best papers—in my humble opinion—that we read and discussed in Singapore. While there was an excellent system for “best papers” at the conference, I’m mainly interested in focusing on those with objective metrics for text-to-speech (TTS) systems since Evaluation of Experience (EoE) is one of our key offerings at Defined.ai, and of course, data papers! (We are a data company after all!)
Below is a summary of our top three papers, with some thoughts on their promising and innovative contributions to speech processing and acoustics.
There is an excellent system for best papers at the conference but here I wanted to focus on my personal favorites. This year I was mainly interested in papers that used objective metrics for TTS systems, since Quality of Experience is one of our key offerings at Defined and, of course, data papers! We are a data company after all.
1. Speaker Generation
Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby and David Kao
Google Research, USA
This article caught my eye for two reasons:
- One, objective measures for text-to-speech (TTS) need to be further explored. At Defined.ai, we do a great deal of Mean Opinion Score (MOS) testing. More objective measures make for great quality checking, but we still don’t have them yet. Nevertheless, the article takes a stab at it.
- Secondly, the article is very similar to the paper I published last year with my student at Interspeech and was cited by the article as well.
The goal of the article is to generate novel speakers in a self-contained TTS system. Their system, dubbed TacoSpawn, is a Tacotron system which uses speaker embeddings from the training corpus, fits a parametric distribution in finite dimensional space, and samples from the distribution at generation time. The main difference from our article is that our goal was zero-shot learning, meaning that a small sample the speaker’s voice being generated is seen at inference to build a TTS from that person’s voice.
Essentially, the flow is quite similar. What I most liked about this article was the dissection of what they call “d-vectors.” From my understanding reading the article and discussing it with one of the authors, Matt, this is essentially the same metric as the Speaker Embedding Cosine Similarity (SECS) that we used. What is interesting is that they dissect this metric and compare it for both speaker generation performance and speaker fidelity.
For speaker generation performance, they compare the metric as such:
- a typical speaker to other nearby speakers (s2s)
- how close a typical generated speaker is to nearby speakers (g2s)
- and how close a typical generated speaker is to nearby generated speakers (g2g).
For speaker fidelity, they measure how similar a synthesized audio from a typical training speaker is to the ground truth from the same speaker (s2t-same), and how similar a synthesized audio from a typical training speaker is to the ground truth audio from other nearby speakers (s2t). They test their approach on libriclean—all clean subsets of the LibriTTS corpus—and on two proprietary corpora, enus1100 and en1468.

The results bring some interesting insights like that s2t-same is less than s2t. This shows that the speaker identity is being learned efficiently and the model is performing well. The other metrics are virtually the same across the board which shows us that the generated speakers have similar diversity to the training set.
Another interesting find which was consistent with our findings was that the objective metrics correlate fairly well, but not perfectly, with MoS. This shows that they are reliable in predicting diverse and natural sounding voices, but it may not be fair to make direct comparisons about naturalness where scores are fairly close.

This is encouraging news for objective metrics, but we still have some ground to cover before it can be used in production. You still need to have a ground truth for comparison using any of these metrics, and in this article, they also use some metadata which we hadn’t, namely gender and locale. The authors do not explain what impact this had on the generation, but I would expect that it contributed to the positive results. Still, it’s exciting to see this nice correlation and proves that it should be an area to watch in the future.
2. Towards Measuring Fairness in Speech Recognition: Casual Conversations Datatset Transcriptions
Chunxi Liu, Michael Picheny, Leda Sari, Pooja Chitkara, Alex Xiao, Xiahui Zhang, Mark Chou, Andres Alvardo, Caner Hazirbas, Yatharth Saraf
Meta AI, USA
This paper addresses the ever-important issue of bias in Automatic Speech Recognition (ASR)—specifically, gender, age, and skin tone. It has been long assumed that ASR applications, like any ML application which relies on distributions, are not fair across a range of biased factors, but testing these assumptions is not always trivial.
The main contribution of this paper is a dataset for benchmarking bias. Here, meta researchers labeled an already released dataset, dubbed Casual Conversations, which is composed of 45k videos of approximately one minute each from 3,011 participants. This totals about 846 hours of speech data. The participants explicitly provided their age and gender, while skin tones were manually labeled along the Fitzpatrick Scale. For those unfamiliar, it is a scale from 1 – 6, where 1 is someone who always burns/never tans with ultraviolet light exposure (very fair skin) and 6 is someone who never burns (very dark skin).
I really like this methodology because it is objective. Anyone who understands the scale can look at a picture or video and have little doubt as to the correct classification. On the other hand, what I don’t like about it is that it can’t be used as a good proxy for population groups. There are plenty of people who are ethnic Europeans who have relatively dark skin and people of African descent who have lighter skin. Hispanic, Arabic, and Asian groups can also be placed across the full spectrum.
I am not an anthropologist, but my linguistics background adamantly suggests that skin color does not influence speech. What does influence speech are social and regional dialects among a great deal of other well-studied—and some less-studied—factors. As a speech engineer, I would posit that speakers of American inner-city or rural dialects would be harder to recognize than the likes of George W. Bush or Barack Obama, even though they may fall into the same Fitzpatrick categories.
Still, this is a good proxy for a superficial look at performance—just mind the caveats mentioned above. Some processing was done like removing videos with no primary speaker and converting the videos to audio segments which contained only the primary speaker’s speech. This left 572.6 hours for the final dataset. The dataset was also normalized for transcriptions by treating disfluencies, acronyms, etc. These details can be found in the paper.
The authors then built four models:
- LibreSpeech model: An RNN-T trained on LibreSpeech, a large corpus built from audiobook recordings;
- Video model (supervised): a 14k hour model with manually transcribed social media videos;
- Video model (semi-supervised): 2 million hours of unlabeled social media videos with the 14k manually transcribed;
- Video model (semi-supervised teacher): a final teacher model trained on the over two million social media videos.
They used SpecAugment on all of the models to add robustness.



Here we can see a large bias towards female speakers in general. It’s also really interesting that the elderly were well recognized by these models. Maybe it has something to do with the elderly population that typically reads audiobooks or is present on social media, which seems like that could be a scientific article by itself. In the industry, we typically assume that the main consumer groups (25 – 40 give or take, depending on the application) of virtual assistants and similar products are well represented. Social media seems to favor even younger generations of late, but the older generations was a shocker.
Skin type follows the general trend that one would expect where speakers with darker skin types have higher Word Error Rate (WER) than lighter speakers. It was also interesting that the Librespeech model had more occurrences of significant WER differences. The assumptions as stated by the authors is that the dataset is less diverse than the social media dataset, which makes sense.
The authors also conducted some experiments with fine-tuning on in-domain data by splitting the data by skin type for fine-tuning. While this sounds like a good idea, the results were mixed. This is mainly due to the amounts of data available for the under-represented groups and the large variability of the data.
All in all, this article is well-written and easy to follow. The authors acknowledge the short-comings well and provide an interesting benchmark set which should prompt a great deal of future investigation. Taken at face value, this could be a great sanity check for any ASR model you may have in production.
3. MULTISV: Dataset for Far-Field Multi-Channel Speaker Verification
Ladislav Mošner, Oldřich Plchot, Lukáš Burget, Jan “Honza” Černocký
BRNO University of Technology
Czechia
This was another great outcome as a data paper. Commercially, there are so many use cases for multichannel, far-field speaker verification but the truth is that data collection for it is costly for most organizations.
In this paper, we also see a new benchmark being presented and it comes with a number of contributions. Firstly, it comes with 4-microphone arrays with about 77 hours per microphone. It contains data for both background noise and reverberation which is labeled with references, meaning that beyond the fact that these are important aspects here, it can also be used for other research purposes like dereverberation, denoising, and speech enhancement. The dataset is public which makes it easy to use and compare across works as a benchmark. It also extends the previous work by including both single and multi-channel enrollment segments.
The data is simulated within the microphone array which is a compromise in favor of cost. The dataset is based on Voxceleb2 which is heavily used in training embeddings. Evaluation data includes diverse sources like Librispeech. The researchers selected the cleanest recordings from Voxceleb2, or those with >20dB signal-to-noise ratio (SNR) values. This preselection allows us to manipulate the inserted noise from datasets like the MUSAN noise dataset (5h), FMA music dataset (66.3h), noise recordings from Freesound.org, and self-recorded noises (20.1h).
Development and evaluation trials were based on the VOiCES challenge rules where speech was retransmitted from Librispeech together with babble, television, music, or no background noise. The conditions based on enrollment properties for the experiments were as follows:
- CE (clean enrollment)
- SRE (single-channel retransmitted enrollment)
- MRE (multi-channel retransmitted enrollment)
- and MRE hard (similar to MRE, but with noise playing in the background)




Experiments were performed on two baseline systems. Both utilize the same embedding extractor and minimum variance distortionless response. The front-end differs in that the first is a mask-based beamforming model that estimates per-channel speech and noise masks given the magnitude spectra. The second system includes a time-domain enhancement model trained with an SNR loss. The architecture is a scaled-down Conv-TasNet to reduce the number of trainable parameters.
There are two evaluation sets:
- v1 which is analogous to the prescription of VOiCES with and L shaped array. There is a small problem in that some microphones may have a negative SNR due to noise created by a nearby distractor speaker, refrigerator, or being behind a corner.
- This is corrected in v2 which replaces the problematic microphones with non-problematic ones. It can be said that v1 is more suitable for meeting-like scenarios, whereas v2 could be used for investigation scenarios, like hidden microphones covering a large space. v1 can generally be considered more difficult due to the problematic cases mentioned above.
This work is important because it releases a ready-to-use benchmark dataset with built-in baselines and can be used for a number of use cases including far-field and multi-speaker verification, as well as audio experiments for speech enhancement, noise robustness, and dereverberation. All of this is made available at: https://github.com/BUTSpeechFit/MultiSV.