The doctor is always in: is Generative AI ready for Medicine?
In our previous post, we explored how generative AI will supercharge businesses, but likely not at the expense of most white-collar workers. Humans remain our best “sources of truth,” and given generative AI’s issues reproducing confident falsehoods, the technology is poised to become a powerful tool wielded by trained professionals, scaling capability and productivity to new heights.
For example, this will likely translate to better marketing with steady, high-quality AI-produced content vetted and refined by a fully human staff. As exciting as this may be for many businesses, generative AI’s most important contributions will inarguably be to industries that have a much more significant impact on life and society, like medicine.
When GPT meets Medicine
Enthusiasm for tools like ChatGPT is relatively new, however research and investment in bringing transformer technology to important fields like medicine has been happening for years. A notable recent project is BioMedLM, a model developed jointly by the Stanford Center for Research on Foundation Models (CRFM) and MosaicML.
As its former name “PubMedGPT” implies, BioMedLM is a large language model (LLM) trained on the abstracts and full texts of papers on PubMed, the US National Institute of Health’s online biomedical database. A 2.7-billion parameter generative pre-trained transformer, BioMedLM has either approached or set state-of-the-art metrics* for medical question-answering and text generation. Specifically, BioMedLM was able to score 50.3 percent on the US Medical Licensing Examination, a rigorous multipart exam that students in medical school must pass to become doctors.
*NOTE: As of publishing, ChatGPT and the medical-specific language model, Flan-PaLM, have bested BioMedLM in performance on the USMLE. As with all things AI, this is further evidence that progress is blindingly fast. Nonethtless, the import of BioMedLM’s research remain notable; read on to learn more.
Stanford’s CRFM and MosaicML stress however that despite these promising early results, BioMedLM isn’t ready to begin practicing medicine. Still, why is it so exciting and what do these results imply for the future?
The answer is two-fold.
BIGGER != BETTER
At 2.7B parameters, BioMedLM is smaller than many of the models it was benchmarked against, and orders of magnitude smaller than the models that have since gone on to best it in the USMLE.
For example, the next-best or (in some metrics slightly superior) model during benchmarking was Meta’s Galactica, a 120B parameter model roughly 44-times larger than BioMedLM. OpenAI’s general chatbot, ChatGPT, is based on the 175B parameter GPT-3.5, which is 64-times larger. The LLM that performed the highest on the USMLE, the medical-specific Flan-PaLM, is a 540B parameter model—200-times larger than BioMedLM.
Simply put, BioMedLM proves that bigger doesn’t automatically mean better performance—despite the primary revelation of LLMs being the proficiency of language models trained on gargantuan datasets. This is important given the massive investment in data and computational resources necessary to train LLMs and leads us to BioMedLM’s second important takeaway.
For readers wondering what a “parameter” refers to, here’s an illustrative, if overly simplified example: if a deep learning algorithm were a cake recipe—e.g.,
Cake = (A*Ingredient1) +
(B*Ingredient2) + (C*Ingredient3)…etc.
—the variable amounts of
A, B, C, etc., (dictating how much of each ingredient to use) would be your coefficients or parameters. In the case of LLMs, there would thus be billions of ingredients at varying amounts comprising the entire model/recipe.
The economic benefits of (re)training domain-specific LLMs
There’s a reason why LLMs are also classed as “foundation models”—it would be infeasible for everyone to train one of their own. Instead, LLMs built by well-resourced research institutes and tech companies can serve as “foundations” which can be retrained on smaller, targeted datasets for enterprise interested in using them.
BioMedLM’s foundational model was 2019’s GPT-2, which—while still powerful—is less capable than the current generation of OpenAI’s GPT-3 and the forthcoming GPT-4. Its impressive performance proved that LLMs retrained on domain-specific data can perform as well as or better than general language models or models specifically built for a given language task. Using this approach, enterprises thus gain the powerful benefits of an LLM tailored to their needs, but at a fraction of the cost of developing and training one from scratch.
As mentioned in our first blog post on generative AI, we’re seeing the rapid emergence of the next AI paradigm: foundation models and their proliferation through tailored use. All that businesses need to do to capitalize on them is to have the right kind of data to bootstrap their own implementation. As always, that’s where Defined.ai’s extensive library of high-quality, ethically-sourced, and bias-aware data can help. Don’t take our word for it, though; sample some from our AI Data Marketplace.
While it may not be ready for deployment in actual clinical settings, LLMs like BioMedLM soon will be—it’s only a matter of time before medical, educational, financial, and legal AI becomes an everyday fixture in our lives.
What does your foundation model roadmap look like? Reach out to Defined.ai today to let us know what your implementation plans are, and how we can help you be the “foundation model first mover” in your industry.