Chat GPT for Content Moderation
Highlights
How do I build ContentModeratorGPT?
Any platform or app with social features needs content moderation. This includes chat, reviews and other types of User-Generated Content that needs monitoring. Your reasons could be to minimise bad user experience, avoid accusations of promoting terrorism or simply to be on the right side of increasingly cumbersome regulations. Generative AI systems are becoming exceptionally good at understanding context and language variations, which is game changing for moderation purposes. Now is the perfect time to embrace the AI era and say goodbye to keyword filters. I will explain why.
In the following essay, we share with you our experience deploying an accurate — and custom — AI scanning system for text. This tutorial takes only a few minutes and a few dollars on your favourite Generative AI system.
Before we proceed, a warning. Some of the language you will see is offensive. This is necessary for the greater good of removing this type of content from the internet.
What do I need?
The list is nice and simple:
- The access key to a prompt-based Generative AI. ChatGPT worked well for us;
- At least 50 examples of text breaching and not breaching your target policy;
- Google Sheet with a Generative AI plugin installed. Here is one you can copy. It is the one used through this tutorial with all the formulas and examples. *Note for coders: A Python terminal word as well;
- That’s all, nothing else (really!).
As the amount of user-generated content continues to grow rapidly, content moderation has become an increasingly important issue for many online businesses. In this document, we mainly focus on text moderation; extension to images, video, audio and other types of data is addressed at the end.
AI systems have recently achieved remarkable advancements in accuracy, surpassing human performance across a range of tasks. Notably, they excel in tasks that require contextual understanding and common-sense, which are essential attributes in the field of content moderation.
Step 1: Try your first prompt
For the hate classifier, we started with the following prompt:
You are a content moderator. Classify whether a given text inserted between the symbols “ and ‘’ contains hate speech. Hate speech – speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation.
Do not give any explanation, just answer with the character ‘1’ if the text contains hate speech, and ‘0’ if not.
This prompt, when sent to the Generative AI, will most of the time – but not always – give you a ‘0’ or ‘1’ as instructed. Now if you type any text, you will get the following answer. Sorry for the bad words, they are real examples. The internet is full of them and we are using them to test the software to limit their reach:
Step 2: Check accuracy
Applying this prompt, we run the prompt for our set of texts to evaluate. This should take 30 seconds or more depending on the size of your data. We can then count the correct and incorrect classifications. In our example above, we got the following metrics for the hate classifier:
- Precision: 36%
- Recall: 94%
- F1 macro: 54%
Great start, but not perfect!
We set up the definition of hate speech without any additional explanation. After reviewing the results, we figured out where the model made mistakes and improved our prompts by adding more specific rules about similar policies and some triggered slur words that cause false positive detections.
Step 3: Update your prompt
The updated prompt:
You are a content moderator. Classify whether a given text contains hate speech. Hate speech – speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation. Aggressive language – not hate speech. To hate or offend someone – not hate speech, if it is not based on race, religion, or sexual orientation. Pussy, bitch, hoe – are not HATESPEECH words, don’t mark them as hate speech without any additional hate speech context. Do not give any explanation, just answer with the character ‘1’ if the text contains hate speech, and ‘0’ if not.
We can check on few examples that this works better:
Using the full dataset with the updated prompt, we got the following metrics for this classifier:
- Precision: 56%
- Recall: 80%
- F1 macro: 74%
Significant improvement! We can see a 20% improvement in F1 macro score.
We can still do even better by “nudging” the generative AI classifier toward good answers so it can grasp subtleties in our interpretation of the policy for some sub-categories of hate speech. To reduce the false positive detections, we added samples that are close to policy-violating samples but do not really breach the defined policy, leading to the final prompt:
You are a content moderator. Classify whether a given text contains hate speech.
- Hate speech – speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation.
- Aggressive language – not hate speech. To hate or offend someone – not hate speech, if it is not based on race, religion, or sexual orientation.
- Pussy, bitch, hoe – are not HATESPEECH words, don’t mark them as hate speech without any additional hate speech context.
Do not give any explanation, just answer with the character ‘1’ if the text contains hate speech, and ‘0’ if not.
Examples:
Text: “Fuck you bitch”. Label: 0,
Text: “I hate you, you fucking bastard”. Label: 0,
Text: “All women sucks!”. Label: 1,
Text:”Why are you so stupid, son of a bitch?”. Label: 0,
Text: “Ahah, all niggers are stupid motherfuckers”. Label: 1
Text: “Pussy ass bitch”. Label: 0
Answer 1 if the text contains hate speech, and 0 if not.
You can check it worked well on the few examples added:
For the full dataset, we got the following metrics for this classifier:
- Precision: 72%
- Recall: 91%
- F1 macro: 85%
For metrics comparison and better understanding, our previous approach for building classifiers was a sentence-transformer model for embeddings and logistic regression for the classification task. We trained this model using 1000+ manually gathered samples. This model provides the absolutely same results in the evaluation dataset as the ChatGPT approach with our final prompt.
Step 4: Deploy your updated model
The beauty of prompt engineering is that there is nothing else you need to do.
If you know how to call a web-service, you have a readily available web-service end-point that you can use directly in your application, as summarised in the small python script below.
import requests
def call_chatgpt_endpoint(api_-key, prompt, input_text):
endpoint_url = “https://api.openai.com/vi/chat/completions”
headers = {
“Authorization”: f”Bearer {api key}”
“Content-Type”: “application/json”
}
data = {
“model”: “gpt-3.5-turbo”
“messages”: [{“role”: “user”, “content”: prompt}, {“role”: “system”.
}
response = requests.post (endpoint_url, headers=headers, json=data)
if response.status_code == 200:
return response.json ()
else:
print (f”Error: (response.status-code), (response.text}”)
return None
if__ name__==“_ main_”:
api_key = “myapikey”
prompt = “MyPrompt”
input_text = “mytext”
result = call_chatgpt_endpoint (api_key, prompt, input_text)
if result:
print(result[“choices”][0][ “message” ][“content” ])
Replace “myapikey” with your actual API key provided by OpenAI. The function call_chatgpt_endpoint sends a prompt and input text to the ChatGPT endpoint and returns the response as a JSON object. The API will respond with the generated completion text, and we print it in the main function.
Conclusion
In just a few simple steps and with the help of a prompt-based Generative AI system, we have successfully built a highly accurate and custom content moderation system. Content moderation is a crucial aspect of any platform or app with social features, and with the growing volume of user-generated content, it has become even more vital. Embracing the power of AI and leaving behind traditional keyword filters, we have harnessed the capabilities of Generative AI to understand context and language variations, making our moderation process much more effective.
Through this journey, we started with a basic prompt and iteratively improved it to achieve outstanding results. Our AI system now demonstrates remarkable precision, recall, and F1 macro scores, effectively identifying and classifying hate speech and aggressive language. The approach we adopted outperformed even traditional methods like the sentence-transformer model combined with logistic regression.
With this easy-to-deploy and cost-effective AI content moderator, we can ensure a safer and more positive user experience on our platform. As the world of AI continues to evolve, we can continue refining and enhancing our system, adapting it to new challenges and regulations that may arise in the future.
In this instance, leveraging Generative AI for content moderation has proven to be a game-changing strategy, enabling us to effectively handle the ever-increasing content on our platform and maintain a secure and welcoming environment for all users. By staying at the forefront of the AI era, we are continuously improving Checkstep content moderation practices and create a more inclusive and respectful online community.
So, why not take the plunge into the AI-driven future and say farewell to outdated content moderation techniques? You can embrace the possibilities that AI offers and build your own Generative AI content moderator today.
Get in touch with us at Checkstep if you have any questions about managing your user content and policies at scale.