Blog post

Speed Your Time to Market with Off-the-Shelf Data You Can Trust

With DefinedCrowd’s new offering, high-quality data is now just a click away 

When it comes to launching AI products into the marketplace, time is of the essence to gain the competitive advantage. However, collecting, annotating and validating the data required to train successful AI can take weeks, if not months.  

For companies looking to build baseline models, adapt existing models with domain corpora, extend models with extra languages, or test and evaluate current models, off-the-shelf data provides a viable alternative to custom-built datasets.  

To help these companies rapidly expand their AI-initiatives into the marketplace, DefinedCrowd is proud to announce the launch of DefinedData, an online marketplace of AI datasets available for on-demand purchase.  

The need for continuous access to highly accurate data 

“Machine learning teams building AI models have always faced one particularly pressing problem, and that is continuous access to highly accurate data. When large enterprise tech firms want to launch their AI initiatives into the market quickly, they simply don’t have the time to collect and validate the data required to do so. DefinedData aims to solve this problem by providing customers with access to an extensive library of speech datasets that will rapidly accelerate their AI programs,” said DefinedCrowd’s VP of engineering, Andrew Michalik.   

DefinedData will provide customers with on-demand access to high-quality data available in multiple languages, domains, recording types and pricing options. The initial offering includes scripted recordings in English, Italian, Portuguese, German, French and Dutch in the domains of  healthcare, entertainment, hospitality, automotive and generic.  

By May 2021, the library is expected to grow to include over 25,000 hours of both scripted and spontaneous recordings in the above languages as well as in Spanish, Hindi and Japanese. Additional domains will include banking, insurance, telecom, retail and IVR. 

Customers are able to instantly request samples or request to purchase one or multiple datasets for immediate use. 

Quality is key to success 

“Although strong algorithms require lots of data, they also require accurate and high-quality data,” said Michalik. After all, the success of AI models depends on the quality of data used to fuel them. For those firms considering off-the-shelf data, they can rest assured that quality is at the core of everything we do at DefinedCrowd.” 

To ensure the highest levels of accuracy and authenticity, the primary quality control mechanism for DefinedCrowd’s speech datasets is Word Error Rate, which is less than 5% for scripted recordings and 10% for spontaneous recordings. For speech collection, quality is ensured by measuring accuracy levels in gender distribution, age distribution, noisy vs silent, nativeness (native vs. non-native and the level of fluency of non-native), domain (accuracy in staying on topic) and segmentation (spontaneous collections). 

“Whether you’re building a prototype, or minimum viable product; evaluating or benchmarking current models; building synthetic datasets; or simply needing quality data fast, our continually updated library of datasets will help you quickly achieve your AI goals,” concluded Michalik.  

With DefinedData, accessing high-quality data has never been easier.  


Leave a comment

Your email address will not be published. Required fields are marked *

Terms of Use agreement

When contributing, do not post any material that contains:

  • hate speech
  • profanity, obscenity or vulgarity
  • comments that could be considered prejudicial, racist or inflammatory
  • nudity or offensive imagery (including, but not limited to, in profile pictures)
  • defamation to a person or people
  • name calling and/or personal attacks
  • comments whose main purpose are commercial in nature and/or to sell a product
  • comments that infringe on copyright or another person’s intellectual property
  • spam comments from individuals or groups, such as the same comment posted repeatedly on a profile
  • personal information about you or another individual (including identifying information, email addresses, phone numbers or private addresses)
  • false representation of another individual, organisation, government or entity
  • promotion of a product, business, company or organisation

We retain the right to remove any content that does not comply with these guidelines or we deem inappropriate.
Repeated violations may cause the author to be blocked from our channels.

Thank you for your comment!

Please allow several working hours for the comment to be moderated before it is published.