Blog post Empowering European Businesses and Governments to Accelerate AI Projects

We’re Trailblazing Crowdsourced  Language Data for Europe’s AI Revolution 

For speech and NLP-based systems to function effectively, they need to be trained on high-quality language data, relevant to the geographic location in which the AI system operates. For example, a chatbot in Germany, Portugal, or France would be far more helpful to users if it could understand German, European Portuguese, or French respectively instead of American English.

However, the language datasets required to train these algorithms are currently unavailable. Businesses or government institutions aiming to launch AI initiatives into the market would need to collect, annotate and validate customized datasets–an expensive and time-consuming undertaking. This deficiency of off-the-shelf datasets, available in a variety of European languages, severely hampered the ability of these institutions and businesses to adapt to and compete in an increasingly AI-driven world.

This is all about to change.

At WebSummit 2020, DefinedCrowd announced the 2021 release of a series of European off-the-shelf language datasets, annotated and validated by a global crowd of over 420,000 contributors.

Available through DefinedData, DefinedCrowd’s online catalog of off-the-shelf datasets, this expansion grants companies developing speech and NLP-based systems in European markets the confidence to move their products to market quickly and without compromising quality. The expansion with begin with the launch of a European Portuguese dataset and will complement the existing 70 datasets already available for download.

Martin Andreas Stein, VP/GM of DefinedData believes this new release is set to be a game-changer. “We are constantly expanding our high-quality datasets to enable companies and ML products with European-focused audiences to reduce their time to market,” said Martin. “With the acceleration of the digitalization we’re currently witnessing, speed and quality are key for success, and we believe these new datasets will empower European-markets to truly compete in a fast-paced industry.”

Daniela Braga, CEO and founder of DefinedCrowd agrees. “In these remarkably uncertain times, one constant is that technology is helping us tackle the issues of tomorrow,” she said. “We’ve seen the digitization of services, powered by AI and machine learning create an outside, positive impact in healthcare as we confront the technological hardships of responding to COVID-19. Our high-quality language data can help countless businesses and governmental organizations adapt to our AI-enabled world.”

Keen to check out our online catalog of off-the-shelf datasets? Browse through our 70 AI training datasets here and keep checking as we add new releases!


Leave a comment

Your email address will not be published. Required fields are marked *

Terms of Use agreement

When contributing, do not post any material that contains:

  • hate speech
  • profanity, obscenity or vulgarity
  • comments that could be considered prejudicial, racist or inflammatory
  • nudity or offensive imagery (including, but not limited to, in profile pictures)
  • defamation to a person or people
  • name calling and/or personal attacks
  • comments whose main purpose are commercial in nature and/or to sell a product
  • comments that infringe on copyright or another person’s intellectual property
  • spam comments from individuals or groups, such as the same comment posted repeatedly on a profile
  • personal information about you or another individual (including identifying information, email addresses, phone numbers or private addresses)
  • false representation of another individual, organisation, government or entity
  • promotion of a product, business, company or organisation

We retain the right to remove any content that does not comply with these guidelines or we deem inappropriate.
Repeated violations may cause the author to be blocked from our channels.

Thank you for your comment!

Please allow several working hours for the comment to be moderated before it is published.