A 2022 Speech AI Guide: Key Application Methods and Technologies
A detailed look at AI and ML for speech recognition models.
Customers have been speaking to the machines, websites, and technology in their lives for a long time now. For example, “Why are you taking me this way?” to their in-car GPS, or “How can I contact you?” while browsing a company website. They’re one-sided conversations at best, unless speech recognition technology is being used in one of a variety of ways.
Done badly, speech recognition can frustrate users and alienate customers, but done right it is a powerful tool that can alter the way people interact with machines – and therefore the way customers interact with the products and services around them.
In every industry, at various points along the customer journey, there are opportunities to employ speech technology to simplify operations, facilitate interactions with the customer, and ultimately, increase accessibility for a wider, diverse audience of potential customers.
An intro to speech AI
Speech recognition technology is creating a world in which:
- barriers to completing a purchase are removed: a potential customer buys what they need by simply reading off their shopping list during their morning commute
.
- human error is greatly reduced: a doctor, stressed and overworked seeing COVID-19 patients, dictates notes, capturing critical data and saving time spent on each patient
.
- self-service for buying new products or to upgrade existing service: a new or existing customer can simply go to a company’s site, speak about what they are interested in buying or upgrading, and seamlessly complete the purchase without needing to interact with a human.
What is speech recognition?
Speech recognition refers to technologies that identify and make sense of human speech using large amounts of data and statistics. It works by way of a system of programs and machine learning algorithms that interact with each other, like pronunciation and acoustic models that “hear” and recognize spoken language, as well as language models that determine what was said in an attempt to form the likeliest meaning. The technology is also known as Automatic Speech Recognition (ASR), and has existed since the late 20th century, only to have skyrocketed in accuracy and efficiency in the 21st thanks to modern AI approaches such as deep learning.
How does speech recognition work?
It works, basically, by taking recordings of human speech gathered into datasets, and breaking these down into smaller and smaller bits of information – from full speech to topical utterances and their corresponding transcriptions (text). From these audio samples and text transcriptions, the technology learns to recognize and interpret more complex speech patterns, vocabulary, and meaning.
While that’s typically adequate in a lab setting, “understanding” speech, however, is exceptionally more complicated. Machines also need to understand utterances spoken in a variety of settings, for example when people are speaking in noisy environments (from the metro or bus, or around construction sites, or if a person is speaking from afar or at a low volume.
Additionally, the technology must consider variations in accent, syntax, local expressions, and different ways of saying the same things within each individual language. Add in the more subjective elements such as intent and sentiment, and it becomes even more complicated for a speech recognition system to perform at its best.
The key uses of speech recognition
Here are some real-life examples of speech recognition technology being used in a few key industries around the world.
Speech AI in Call Centers
The market for AI in call centers is already a large one, and it’s growing fast. Forecasts predict that the call center AI market size will grow to USD 2.8 billion by 2024, with a growth rate of 28.5% from 2019-2024. This reflects the growth of customer service through social media and an increase in the amount of customer data available from social and other platforms.
The conversation about speech technology use in call centers is also quickly evolving. Traditionally, AI implementation has been focused on applying it within the context of cost savings for the organization.
There is a shift now happening as companies begin to recognize the importance of customer experience and customer loyalty. One recent study showed that companies most frequently cite customer service as the main driver for AI implementation, above cost reduction or revenue growth.
From call center routing, to answering basic queries, speech recognition technology in call centers is becoming an essential part of operations. While the cost savings benefits are great from an operational perspective, the real magic of using speech technology in call centers lies in its ability to improve customer experience across the board.
Speech AI in Banking
Major customer priorities within banking right now are security and customer experience. The use of AI in banking, particularly ASR systems can assist both.
From the security side, many banks are using ASR to enable payments within mobile and online banking. Voice authentication with mobile banking applications is a big use case, giving customers an easy method of identity verification on top of cumbersome passwords and 2-factor authentication methods without their typical hassle.
From the customer service side, using an ASR to do mobile banking, and handle customer service inquiries without having to have clients wait in long service or support queues to speak to human agents for relatively straightforward resolutions results in a streamlined process.
Speech AI in Telecommunications
Like many other industries, the biggest value that speech recognition technology can bring to the telecommunications industry revolves around conversational AI. These speech recognition systems that can recognize and interact with casual conversation, and increasingly understand human speech, enhance and add value to existing telecommunication services. It also helps to improve overall customer experience, enhance targeted marketing efforts, and allow for self-service. How?
Customers are able to find what they need in a shorter amount of time, and in many cases, can sign up for new services or add-ons without even interacting with a human. Employing self-service virtual assistants, powered by ASR technology, helps with all of the above. For example, in 2017, European telecommunications giant Vodafone introduced their chatbot TOBi that is able to handle customer service requests, from giving information to aiding in the purchase of services.
Speech AI in Healthcare
In a world turned upside down and pushed towards technology adoption by the COVID-19 pandemic, the possibilities for speech recognition technology in the healthcare sector exploded.
Speech recognition has become an essential tool for healthcare providers to spend much less time on data entry, and more time on treating patients. It has facilitated remote screening of symptoms, delivering critical information to patients in times of high confusion, and overall reducing exposure for healthcare providers while still allowing them to provide their patients with essential care. ASR has already played a large role in remote healthcare, and will only continue to evolve. ,
Reducing administrative time working on Electronic Health Records, removing some of the burden doctors have of time spent at the computer, entering data, allowing them to focus on the patient instead. As speech recognition becomes more specialized, dictation AI will further understand common and medical vocabulary, styles of speaking, etc., paving the way for more advanced note taking, reducing the amount of data entry work required while still recording critical patient data.
Speech collection: defining your needs
The most important thing to remember when preparing your ASR model, whether building it from scratch or fine-tuning an existing one, is that you get out what you put in. High-quality data is the most important element of a successful ASR. ,
So, it follows that choosing the right training data is the next logical step in ensuring that your system is prepared to perform in the best possible way.
Getting started, the first question is: what type of datasets will you use? Off-the-shelf, or custom?
Off the shelf vs custom datasets
Off-the shelf data has the advantage of quick turnaround time, as these datasets are already collected, transcribed, (typically) vetted, and ready for use. They can be used immediately for training. Defined.ai’s off-the-shelf datasets are available in a variety of languages, domains, and recording environments, and are ethically sourced, in compliance with international data privacy best practices.
Custom speech data collection
With customized speech data collection, there are two basic types of datasets: Monologue and Dialogue.
As the names suggest, the difference between the two is simply how many speakers are recorded.
Within those two types, it is then necessary to decide what type of conversation you need. Will the speaker or speakers read from a script or speak freely? This is where you have the choice between Scripted or Spontaneous speech data.
In scripted, the speaker(s) will read directly from a specific script, while in spontaneous data collection, the speaker(s) will speak freely about a certain topic or within a specified domain. (For example: making a call to customer support to upgrade a service).
Within dialogue speech collection, you can specify further: should the dialogue happen between two humans, or one human and a machine, the machine being an Interactive Voice Response System or IVR?
Once you have these parameters defined, the data collection begins.
Transcription and Validation
Once the data is collected, the process doesn’t end there. The datasets must first be transcribed (audio and the corresponding transcription are key to AI training). A complete speech dataset contains not only the audio data but transcription “labels” that help train the speech recognition model to properly identify what words are in phrases by what they sound like. The audio, transcriptions, and other aspects of the data are then validated to ensure quality.
Where to find speech recognition data?
The bottom line: data is no longer an obscure unit of information but is contextualized with the process and the agents who contributed to it.
“Crowdsourcing” data is a great way to improve the diversity of AI training datasets. Known contributors can be actively targeted to optimize diversity to train models that speak to everyone, everywhere. Or in other words, using a diverse crowd allows us to collect, annotate and validate datasets with a wide distribution of demographics.
At Defined.ai, we have made diversity a core pillar in our product offering. With our global crowd of over 500,000 contributors and market-leading workflow automations, we can provide the diverse training data required to fuel speech recognition, natural language processing (NLP), and computer vision technologies.

Besides our large, global crowd (who represent over 50 languages and dialects from over 70 countries), we are currently using (or working to implement) algorithms to ensure diversity of data in the following areas:
Gender: Automatically ensuring a mix of gender representation in each dataset.
Speaker uniqueness and consistency: Ensuring speaker diversity and consistency, by automatically detecting different voices in a large dataset, along with age distributions and accents.
Quality control in gathering data
As mentioned above, quality control is an extremely important element in the speech data collection process. At Defined.ai we use several standards to ensure that clients are getting high quality data. They include:
Dynamic quality checks
As those in the crowd are completing tasks, we constantly evaluate the quality of data output. Dynamic quality checks occur during the course of tasks completed by crowd workers, where real time quality checks are inserted in the task flow to assess data consistency and quality. As these checks are indistinguishable from any given task, it gives us an objective way of seamlessly measuring data accuracy and consistency without sacrificing quality or disrupting task workflow, and helps to manage the time and expectations of our crowd.
Ground Truth Data
This refers to datasets that are the gold standard for us, or ground truth data to measure all other datasets against. This means that we have standards of what “good quality” is – a benchmark for high quality data.
What makes a good speech dataset?
Now, the model is built, as you begin training it on some basic speech data, it’s time to evaluate – is the data good?
Word Error Rate
Word Error Rate (WER) is among the most common ways to measure the performance of your ASR system, and is, very simply, the number of errors made by the ASR when transcribing a block of audio. The lower the WER, the better the ASR system is at recognizing words.
While the WER is one good metric indicative of how your system performs, it doesn’t delve into exactly why it is performing in that way. To improve WER, it’s important to look at a variety of factors, and again comes down to using high-quality, diverse training datasets. Here are a few considerations when looking at getting these quality datasets.
Background noise
While in an ideal world, each customer using your ASR would be perfectly audible, without background noise or other poor hearing conditions. The truth however is that imperfect conditions are a reality. Many people using your voice search, or calling in to your customer service line, or using your voice application will be in situations with lots of background noise, cross talk, or have poor internet or phone connections.
Industry Jargon
You wouldn’t train a banking customer service ASR on data collected for a medical helpline, would you? Use training datasets that are representative of your domain or industry. There is specific industry jargon that only people in your industry use. Also, being trained in jargon will help with understanding intent, a key part of ensuring that your ASR can understand customers.
Accents/slang variations
There will always be variation in the way people say things, as accents vary across countries or even states or cities. Diverse datasets will include a variety of accents and slang that users might not give a second thought about in their day to day speech, but can trip up an ASR system that’s never “heard” or trained on them before.
Diversity in Data Is Key
Diverse data includes representation of all genders, age groups, accents, ethnicities, and any other factors that dictate the way people speak.
Diversity in data should become a priority for companies training an ASR model for several key reasons:
- Brand reputation: models that are unable to understand and respond to all users will damage the company’s brand image.
- Customer retention: if customers feel they aren’t being heard, they will go to a competitor.
- Customer acquisition: by using diverse datasets, companies are ensuring valued customer segments are not being overlooked.
- Ethics in AI: diverse datasets address the larger issue of AI bias in ASR models and beyond.
The voice of the user
Aside from literally ensuring that the voice of the customer is heard, speech recognition technology saves time, provides essential support to already overburdened systems and increases accessibility for people all around.
A clear line is rapidly developing between two types of businesses: those willing to be at the frontier of speech technology adoption and those that must rush to catch up later. The competitive advantage clearly lies with the former.
Your users will always voice their wants, needs, and expectations. One question then remains: is your organization, along with your AI systems, prepared to listen?