Consent is key: generative AI and why Defined.ai data is different
While our last blog focused on the transformative powers of generative AI and large language models, an equally urgent—if currently less publicized—discussion is happening behind the scenes. For all the stunning advancements in expression, understanding, and productivity that models from OpenAI’s GPT family and image transformers like Stable Diffusion and DALL-E 2 imply for society, there’s a handful of simple questions that we cannot yet answer: is any of this legal? And even if it (more or less) is, should it be?
These answers are being hashed out today, both online and soon, in courtrooms in the US and the UK. In the former case, digital artists on platforms like ArtStation have united in protest against having their artwork scraped as training data for models like Midjourney or Stable Diffusion. Meanwhile, class action lawsuits have been filed against Microsoft, GitHub, and OpenAI for scraping artists’ works and programmers’ code for similar training use. Across the Atlantic in the UK, Getty Images has announced its intent to pursue legal action against Stable Diffusion’s creators, Stability AI, for using unauthorized use of Getty’s library of stock images.
Outpacing the law
The issue is that technology has often outpaced the law. As such, much of the legality of modern AI development and use has thus been hazy at best. While it’s generally understood that deep learning models require vast sums of data to train on before they’re able to produce useful—let alone commercially viable—outputs, collecting much of that data is often a burdensome obstacle to AI developers that aren’t well resourced tech companies, academic institutions, or nation states.
Models like GPT-3 and Stable Diffusion are no different given that their training data is essentially internet-scale in size. While the internet makes scraping for all online text and images possible (if time consuming), almost all of that content has been created and uploaded by humans, with only a minuscule fraction of it falling under open-source or creative commons licenses. Therein lies the problem: the majority of the data scraped by AI developers for model training lacks opt-in or use permissions from content creators, nor does it publicly recognize or compensate creators.
Further complicating the issue is that Fair Use laws in the US have long protected creators using protected works for new forms of expression, often for critique or parody. For better or worse, much of the AI community has utilized Fair Use in lieu of licensing data to power AI development and research. However, can and should Fair Use cover AI training data?
Across the Atlantic, an even thornier question arises with respect to the EU’s General Data Protection Regulation (GDPR), which protects internet users by compelling businesses to ask them to opt into data collection, rather than assuming that users will seek out settings to opt out of every online service or platform. Are these models and the AI companies that build them thus in violation of EU law if they use image data—like your photos on social media or your art from your online portfolio—for model training? If they are in violation, how will the EU move against generative AI to enforce their laws?
Both questions remain unresolved but may influence how people and businesses adopt and use generative AI in the years ahead.
Too late to close the barn doors?
These are all open questions that are being negotiated in US, UK, and EU courts in 2023. Until definitive legal answers are derived however, what large language models and image transformers have shown us is that there are emergent properties of language and image that come from training models on massive amounts of data. The genie, as it were, may already be out of the bottle, especially if the AI industry wants to continue cultivating these types of models—which it almost certainly will, considering that massive models like those described above are being christened as “foundation models” and venture capital is zeroing in on generative AI as the next big technology boom.
To wit, there are already arguments from those in AI community that a looser regard for antiquated copyright laws is what has enabled such rapid development, as in the Mark A. Lemley and Bryan Casey paper, “Fair Learning”, which posits that without fair use of publicly available—though perhaps not freely usable—data, AI development would be severely hampered.
Furthermore, venture capital has found its way to generative AI and the companies that develop them. With the exponential boosts in productivity and the momentous potential to change how people use technology, it’s no wonder that there’s excitement in encouraging the development and deployment of services and platforms based on generative AI.
The Defined.ai difference
Regardless of how battles over the legality of generative AI settle, it’s likely that the technology will persist in some form, whether that be through community-run, open source, opt-in datasets, or through the current loose interpretation of US copyright laws. A thornier issue will be the EU, however, given the GDPR and policies like the Digital Services and forthcoming AI Act.
How will your business capitalize on this revolutionary technology while staying an ethical, global business, regardless of how generative AI’s legal status ultimately resolves? Defined.ai, as always, has the answer and is more than happy to help.
Since our earliest days, we’ve prided ourselves on building specifically tailored, high-quality datasets for our clients’ AI initiatives. This not only means we provide clean, well-structured datasets without the noisy, useless filler endemic to open-source datasets; we also provide detailed demographic metadata to ensure balance and avoid bias because we’ve always believed AI should be fair and equitable.
Most importantly however, we also ensure the full consent of our data contributors. It’s our belief that if data is the lifeblood of AI, then people are the lifeblood of data, and they should not only have the choice to contribute their data, but they should know what they’re contributing to and why, and they should be paid fairly for it.
At the very least, these policies solve all legal compliance issues. More importantly however, it treats data contributors the way they ought to be treated—like human beings deserving of dignity, which ultimately is what the artists, programmers, and writers currently suing generative AI companies are seeking.
While society and the courts battle over generative AI, we at Defined.ai are confident we’ve always had the solution. When the time comes for your business to train your own iteration of a large language model or image transformer, why not do it with high-quality data that’s ethical and globally compliant? You can rest assured that that’s the only kind of high-quality data you’ll find on our AI Data Marketplace.