Blog post

How to Ensure Quality When Crowdsourcing AI Training Data

For artificial intelligence (AI) to function as envisaged, it needs to be fueled by high-quality data. However, accessing this high-quality data is one of the biggest barriers to the adoption of AI.

Machine learning (ML) teams were early to harness the power of crowdsourcing to obtain the swathes of data needed to successfully train AI models. However, although crowdsourcing is a brilliant solution, it presents its own challenges.

Gathering and annotating data from a global crowd of contributors means that it is extremely difficult to control the quality of the data collected.

Human error and spammers are just some of the issues that can affect the quality of the data and severely impact algorithm accuracy, frustrating end-users and deterring widespread adoption. In fact, a report by the Harvard Business Review claims that “poor data quality is enemy number one to the widespread, profitable use of machine learning”.

So how can we track and measure the quality of speech training data when gathered from hundreds or thousands of contributors during crowdsourcing? Read on to find out.

Ensuring A Large Enough Crowd

The crowd quality team at is continuously working to understand who their annotators are and what type of work they are best suited for. The reality is that not everybody is suited for every type of jobs. Most of the contributor tasks at follow an 80:20 principle, meaning that 20% percent of users complete 80% of the data, while 80% of users complete only 20% of the jobs.

Davide Rovati, Senior Project Manager at says this engagement is normal. However, it does mean a crowd needs to be large enough to account for this level of engagement.

“Not everyone is suited for every task. Not everyone is interested in every task for many different reasons,” he said.

Rovati explains that different jobs have different cognitive loads. Named entity tagging jobs or transcription jobs have a heavy cognitive load, which means not many people have the patience to complete them. However, these jobs are paid better, which means the people who can do them will find them a lot more engaging. On the other side of the coin, speech collection is a lot easier, but less well paid. “The key is to ensure the crowd-base is large enough to handle all cases,” said Rovati. 

Fair Pay

One way to recruit enough people to a crowd is to pay them fairly. In fact, the topic of ethical Human-in-the-Loop platforms is gaining momentum. By optimizing data collection and labelling pipelines, not only keeps the price low for the client, but also keeps the price fair for the contributor.

Testing the Crowd for Quality: Language & Qualification Tests

When it comes to data generation, Rui Correia, Lead Data Scientist at, explains that the first step is to align the characteristics of a piece of audio, for example, with a client’s requirements.

To ensure native language proficiency, contributors undergo rigorous language tests, created and validated by native speakers to cover colloquial speech and idiomatic nuances.

Contributors also need to pass qualification tests to ensure they’ve understood what they need to do after accessing the task’s instructions. Once they are accepted into the job, contributors are continually tested to ensure they are performing according to the desired standards.

Testing the Work for Quality: Gold-Standard Tests

At, gold-standard tests are one of the most important ways we ensure quality datasets. These tests are created by and agreed upon by domain experts or linguists and are based on a ‘ground truth’ for a given project.

While collecting data from contributors, strategically and silently assigns gold-standard tests (basically real-time audits), into project workflows as a way of controlling quality. These tests automatically assess whether contributors truly understand the job and are capable of producing data that meets the KPIs desired by the customer.

If a contributor fails a succession of RTAs, it means they are either not paying attention, don’t understand the job, or have given up. Once a contributor has failed a succession of RTAs, they will be blocked from further contributions.

“Our gold-standard tests assess both the crowd member and the datasets,” said Chris Tung, Director of Customer Solutions. “This not only ensures crowd members are executing according to the instructions, but also ensures the end product is meeting the KPIs defined in the job deliverables.”

Processes like the gold-standard tests allows to ensure specialist level performance, but at scale.

But why not hire professionals to complete recording and annotation jobs? Tung says the idea will never work.

“We collected and transcribed almost 2000 hours of German speech dialog for one client. Consider how many German specialists there are in the world (not many), and how much they would charge per hour ($30 -$40 per hour). It would have taken us months to find the specialists and we would have spent the entire budget paying them.

“Instead, we give the job to the crowd, and our quality system will make sure only the people who are performing at an acceptable level will be allowed to perform the job,” he said.

Human Validation

Although machine learning is fundamental to’s processes, the company also employs humans to carry out validation tasks to ensure data collection and annotation reflects both the real world and natural human behavior.

One contributor records an audio and another contributor validates that recording. The original sentence is given to the validator along with the recording, and the validator has to say whether the input and output match. This voting system is very traditional in crowdsourcing.

Machine Learning QA

With proprietary algorithms constantly enriched by the wealth of curated data that passes through them (around 500,000 units a day), uses machine learning to identify patterns and monitor crowd behavior to the millisecond. For example, tasks completed too quickly might mean a contributor isn’t spending enough time on a task, while tasks that take too long might mean they are distracted or not understanding the question. also uses techniques like redundancy to ensure expert level performance. In this technique, different people are asked the same question, until the answer is agreed upon. How this works in practice is that one contributor says ‘I’, another says ‘I’m’, and another says ‘eye’. Which is the correct word? This question is put to different people to ascertain the right answer.

Real-time audits and, in the case of automatic speech recognition (ASR), word confidence levels also rate contributors in terms of their quality of work. So, if a respectable number of highly rated contributors provide a certain transcription, for example, an algorithm will determine that those transcriptions are correct.

There are some cases in which a highly rated job member’s transcription is at odds with all the others provided. In these cases, too, an algorithm decides which transcription is correct. Using these types of algorithms is crucial for collecting quality data at scale.

Christopher Shulby, Director of Machine Learning Engineering, says: “There are few great transcribers, for example, in existence. In these cases, we can employ multiple weak transcribers who, thanks to our algorithms aggregating results, actually score higher than one really good transcriber. It’s an excellent solution to ensure scalability, and is what sets us apart from our competitors,” he said. also employs language models to test the plausibility of contributor outputs and whether said output needs further review. The company aligns acoustic models and pronunciation dictionaries with contributor outputs to ensure accuracy.

Contributors’ Agreement

Given the subjective and varied nature of human perception, there can sometimes be ambiguity as to how something should be labeled. For objective tasks, such as Named Entity Tagging, employs statistics such as Cohen’s Kappa or Krippendorff’s Alpha to reduce ambiguity in task design and also to understand individual contributors’ performance (typically enforcing outstanding work with Kappa >0.8). For tasks based on opinion, the company uses metrics such as Pearson Coefficient and Intra-Annotator Agreement to guarantee consistency.

Transparency – Quality Data

To train bias-free AI models, machine learning teams need access to transparent, bias-aware data. DefinedData, our online training data marketplace, offers unprecedented levels of metadata about each dataset available. Teams building or augmenting AI models are able to view the gender, age, and phonetic distribution of each dataset, and well as detailed demographic information.

With this type of information at their fingertips, developers can choose specific data relevant to the target audience of their AI models. To view these datasets, click here.

Sourcing High-Quality Data

As demonstrated above, there are many advantages to using the services of an experienced data provider:

1. Tried and Tested Processes
With tried and tested processes and workflows already in place, data providers can collect data much faster.

2. Crowd Knowledge
If companies attempt to set up their own crowdsourcing platforms, they run the risk of recruiting spammers or contributors who provide consistently poor results. The problem is that companies don’t know the history of the people in their crowd. Data partners like, however, have created several systems that stores the historical performance of contributors, who are scored according to the number of jobs they have completed, and the quality of work submitted.

3. Quality Assurance Tests
Data providers have numerous quality processes in place to ensure datasets are of the quality required to train highly accurate AI models. Data processes are measured and monitored using practices and processes like gold standard datasets, machine learning QA, human validation, language and qualification tests, and contributors’ agreements.

To access high-quality data now, visit our online catalog of ready-to-use datasets here. Or contact us at to request a custom-build AI training dataset.


Leave a comment

Your email address will not be published. Required fields are marked *

Terms of Use agreement

When contributing, do not post any material that contains:

  • hate speech
  • profanity, obscenity or vulgarity
  • comments that could be considered prejudicial, racist or inflammatory
  • nudity or offensive imagery (including, but not limited to, in profile pictures)
  • defamation to a person or people
  • name calling and/or personal attacks
  • comments whose main purpose are commercial in nature and/or to sell a product
  • comments that infringe on copyright or another person’s intellectual property
  • spam comments from individuals or groups, such as the same comment posted repeatedly on a profile
  • personal information about you or another individual (including identifying information, email addresses, phone numbers or private addresses)
  • false representation of another individual, organisation, government or entity
  • promotion of a product, business, company or organisation

We retain the right to remove any content that does not comply with these guidelines or we deem inappropriate.
Repeated violations may cause the author to be blocked from our channels.

Thank you for your comment!

Please allow several working hours for the comment to be moderated before it is published.