LLMs meet the crowd: an interview with Chat GPT – Part I

20 Mar 2023

14 min read

Ian Turner

Director of Strategic AI Partnerships

At Defined.ai, it is essential that we keep our fingers on the pulse of the technology industry. Large Language Models (LLMs), particularly ChatGPT, are all the rage in the news right now. John Oliver recently did an expose on the capabilities of generative AI and the implications of its application and its implementation, and people are understandably concerned about if AI has finally come to take over the world (or at least take over our jobs). It might comfort you to know that while LLMs provide unprecedented capabilities in replicating human competencies, they still have a long way to go and rely heavily on humans to curate and improve them. This reliance on a “human in the loop” represents major ethical considerations, as well as opportunities.

In honor of Defined.ai’s new API-based solution for HITL annotations, we wanted to learn more about what implications a crowd-based workforce could have on the tech giants building these new models. We wanted to learn from the best, and what better way to learn more about this topic than to ask the source itself?

Today, we will be chatting with ChatGPT.

We will be asking ChatGPT to share its unique perspective on the important issue in artificial intelligence as it pertains to the “Human in the loop” (HITL) annotations. With its vast knowledge base and advanced language processing capabilities, ChatGPT is the perfect partner for this conversation, and we are excited to explore this topic with such an advanced AI model.

*Defined.ai provides fast, easily installed API solutions for acquiring thousands of human annotations for LLMs and other AI applications, in hours not weeks. Interested in discovering more about our offerings? Schedule a free demonstration here.

Our interview with ChatGPT

Defined.ai (DAI): Hello chat GPT, my name is Ian Turner and I’m the Director for Strategic Partnerships at Defined.ai, how are you?

Chat GPT (CGPT): Hello Ian Turner, it’s nice to meet you! As an AI language model, I don’t have feelings or emotions, but I’m functioning well and ready to assist you in any way I can. How can I help you today?

DAI: I just wanted to know more about you. Is it true that you use “human in the loop” to improve your results?

CGPT: Yes, that’s correct! As an AI language model, I am trained on large datasets of text, but I also use a technique called “human-in-the-loop” to improve my performance. This means that human experts review and provide feedback on my responses, allowing me to learn and improve over time.

For example, when a user asks me a question, I provide an answer based on my existing knowledge and training. However, if the user or another expert notices an error or omission in my response, they can provide feedback, which helps me to correct and improve my understanding of the topic. This process of continuous learning and improvement helps to ensure that my responses are accurate, up-to-date, and useful to users like you.

DAI: What are some examples of annotations that the human in the loop would provide? Please be as exhaustive as possible. Also if possible please provide actual samples of the annotation data in something like json.

CGPT: Sure, I can provide some examples of the types of annotations that are commonly used in the “human-in-the-loop” process for language models like me.

1. Correction of errors: Humans might notice errors in my responses, such as factual inaccuracies, spelling or grammar mistakes, or misunderstandings of the context. They can then provide feedback to correct those errors and help me to improve.

Here’s an example of feedback in JSON format:

Adding missing information: Sometimes, my responses might be incomplete, or I might not have all the necessary information to answer a question. In such cases, humans can provide additional information or context to help me provide a more complete and accurate response.

Here’s an example of feedback in JSON format:

Clarification of ambiguity: Sometimes, my responses might be ambiguous or unclear, leading to confusion or misinterpretation. In such cases, humans can provide clarification or additional context to help me provide a clearer and more accurate response.

Here’s an example of feedback in JSON format:

In this example, the human noticed that my response was ambiguous, as it was unclear whether the user was asking about the political or economic capital of the United States. The human provided a clarification to help me provide a more accurate response.

These are just a few examples of the types of annotations that humans might provide to help me improve my performance. The actual format and content of the annotations may vary depending on the specific use case and the needs of the users and developers involved.

DAI: Interesting!

CGPT: I’m glad you find it interesting! If you have any other questions or if there’s anything else I can help you with, please don’t hesitate to ask.

DAI: Considering a generalized use case, how many crowd members would we need to make a useful dataset of human annotations?

CGPT: The number of crowd members needed to make a useful dataset of human annotations depends on various factors, such as the complexity of the task, the desired level of accuracy, and the quality of the crowd members. In general, more crowd members can lead to a more diverse and comprehensive set of annotations, but it can also increase the cost and time required to obtain and process the data.

For simple tasks, such as identifying and correcting spelling errors, a small number of crowd members (e.g., 3-5) may be sufficient to obtain reliable annotations. However, for more complex tasks that require subjective judgment or domain expertise, a larger number of crowd members (e.g., 10-20 or more) may be needed to ensure the quality and consistency of the annotations.

It’s also important to note that the quality of the crowd members can have a significant impact on the usefulness of the dataset. Crowd members who are experienced, motivated, and well-trained in the task at hand are more likely to produce accurate and high-quality annotations than those who are inexperienced or uninterested.

In general, it’s best to start with a small pilot study and gradually increase the number of crowd members as needed based on the quality of the annotations and the desired level of confidence in the results.

DAI: Wow so 10 to 20 annotators sometimes, but how many samples should I annotate? Please give specific examples.

CGPT: The number of samples to annotate also depends on various factors, such as the complexity of the task, the desired level of accuracy, and the amount of variation in the data. Generally, the more samples you annotate, the more accurate and robust your dataset will be, but there are diminishing returns to additional annotations beyond a certain point.

The number of samples to annotate also depends on the statistical power and confidence level you want to achieve. A commonly used statistical measure is the inter-annotator agreement (IAA), which measures the degree of consensus among multiple annotators. The level of agreement needed depends on the task and the desired level of accuracy, but a common threshold is a minimum IAA score of 0.7, which indicates substantial agreement.

Here are some specific examples of the number of samples to annotate for different types of tasks:

Text classification: For a simple binary classification task (e.g., spam vs. non-spam), you may need to annotate around 500-1000 samples to achieve a reasonable level of accuracy. For a more complex classification task with multiple categories, you may need to annotate several thousand samples.

Named entity recognition: For a task of identifying named entities (e.g., person, organization, location) in text, you may need to annotate around 1000-2000 samples to achieve a reasonable level of accuracy, depending on the complexity of the entity types and the variation in the data.

Sentiment analysis: For a task of determining the sentiment of text (e.g., positive, negative, neutral), you may need to annotate around 1000-2000 samples to achieve a reasonable level of accuracy, depending on the complexity of the sentiment categories and the variation in the data.

Machine translation: For a task of translating text from one language to another, you may need to annotate several thousand samples to achieve a reasonable level of accuracy, depending on the complexity of the languages and the variation in the data.

It’s worth noting that these numbers are just general guidelines, and the actual number of samples needed may vary depending on the specific task and the quality of the annotations. In general, it’s a good practice to start with a small pilot study and gradually increase the number of samples and annotators as needed based on the quality of the annotations and the desired level of confidence in the results.

DAI: Wow so for some of these over 40,000 annotations is needed.

CGPT: Yes, that’s correct. For some complex tasks that require a high degree of accuracy and variability, a large number of annotations may be needed to achieve a reliable and robust dataset. For example, machine learning models that require large amounts of training data, such as natural language processing and computer vision models, often require tens of thousands or even millions of annotated samples to achieve optimal performance.

However, it’s important to note that the number of annotations needed depends on various factors, such as the complexity of the task, the desired level of accuracy, and the quality of the annotators. In some cases, a smaller number of high-quality annotations may be sufficient to achieve a high level of accuracy and reliability.

In addition, there are also techniques such as active learning and semi-supervised learning that can help reduce the number of annotations needed by selecting the most informative samples for annotation and leveraging unlabeled data to improve the model’s performance.

Ultimately, the number of annotations needed should be determined based on the specific task and the goals of the project, and it’s important to balance the need for accuracy and reliability with practical considerations such as cost, time, and resources.

Stay tuned for the second part of our Chat GPT interview

On the second part of our conversation with ChatGPT, we went through the process of crowdsourcing these annotators, how long it could take, and how to accomplish it in an ethical way. Stay tuned for the rest of this great conversation.

In the meantime, take the opportunity to learn more about how Defined.ai’s data annotation API works and how it can benefit your AI projects by reading one of our latest blog articles on how you can transform LLMs in 13 lines of code. You can even book a free demo with the author, Ian Turner.