ARTICLE SUMMARY: In a world where AI is hailed as transformative, the hidden biases within its training data remain a critical challenge. The article highlights the role of biased training data in perpetuating inequalities and stereotypes in AI systems. Demonstrated through personal experiences, it reveals the limitations of training data in generating inclusive outputs. These biases have real-world consequences, as seen in instances like AI-driven car accidents and healthcare disparities. The article encourages awareness and action, emphasizing the importance of diversifying training data, scrutinizing AI results, and reporting biases to developers.
First, a story - Last night I was putting together a Facebook post asking people about their first experience with AI. (Click here to see the Facebook post.) I was once again marveling at how amazing AI can be when I came across a startling example of how AI is also failing us in a big way.
I was using an AI tool (#Midjourney) to create an image to accompany the FB post. (I actually used another tool, #Claude 2.0 to help me create the text prompt I used for Midjourney.) In less than a minute Midjourney created the following four images. While the initial results were decent, they immediately struck me as far too patriarchal, white male.
I revised the prompt changing the word "man" to "woman" and Midjourney really struggled. I assume it was hampered by the "1950s" reference or simply did not have enough data to know what to do next. It gave me 1) an image of a man in a suit shaking hands with a robot; 2) a picture of a woman standing next to a robot that had the head of a living man, 3) a picture of a man in a suit standing next to a robot that had the head of a living woman. The fourth picture showed a man marrying a woman. The only robot in the frame was the minister who was performing the ceremony. It clearly could not meet my simple request.
Ever the optimist, I tried again, this time emphasizing a woman wearing a dress, shaking hands with a 1950s, mechanical-faced robot. Somehow this made it even worse. Despite the prompt requesting a dress, every woman depicted was wearing a business suit. Several of the women appeared very androgynous--even the robot depicted had a feminine figure. Another picture again replaced the face of the robot with an actual man's face, and the fourth picture had another image of a man and a woman getting married. (NOTE: If you can’t already see where this is going, you probably don’t want to read any further.)
After trying several more prompts, I kept getting human-faced robots, many more wedding ceremonies, and lots of gender-neutral characters (all wearing suits), until I finally got one that worked. It happened to be a robot that was giving a woman flowers as if they were going on a date. (Strangely enough, my FB post was about people's first experiences with AI, so this image actually worked... if you don't mind being insulted by stereotypical gender roles or blatant patriarchy.) I don't even want to mention what happened when I changed the prompt from "woman" to "black woman” (maybe I’ll tackle that issue in another article).
So, with this example in mind, I wanted to write this article about the role training data plays in all AI tools.
Introduction: In our quest to harness the potential of artificial intelligence (AI), we often overlook a critical aspect – the training data that forms its foundation. While AI promises to revolutionize industries, its reliance on training data introduces a hidden challenge: the inherent bias lurking within the data. As we explore the marvels of AI, it's imperative to understand how these biases impact our AI-driven world.
GIGO – The Acronym that Echoes Across IT: During one of my earliest programming attempts, I was creating a simple game on my Apple IIe when all of a sudden, portions of my code began dancing across the screen. I called an older friend to help me troubleshoot, and he taught me one of my first IT acronyms, GIGO - Garbage-In, Garbage-Out. To wit, the quality of input has a direct effect on the accuracy of the outcome. Poor programming yields poor results.
Training Data: The Backbone of AI - At its core, AI relies on training data – a diverse range of information fed into algorithms to teach them various tasks. However, the source of this data is far from neutral. For instance, ChatGPT, a language model like me, was trained on the entire Internet through August 2021. This means its knowledge is a reflection of what English-speaking internet users shared. It's a classic example of "Garbage In, Garbage Out" (GIGO), a principle that emphasizes the importance of input quality for accurate output.
Data Bias – The Unintentional Inequality: AI's training data often comes from the internet, creating an illusion of diversity that doesn't mirror reality. The result is biased AI systems that perpetuate stereotypes and disparities. To exemplify this, let's go back to my earlier experience with Midjourney. Succinctly put, the training data simply doesn’t have enough images of women with robots or women in the 1950s. Most of the images from the 50s that involved robots (movie posters, comic strips, advertisements, etc.) all involved men. They wore suits. If there were related images of a man and a woman, they were probably getting married, or the man was with a female robot. Given the skewed training data, we shouldn’t be surprised by the outcome. Midjourney simply reinforced gender roles and societal norms because that’s the training data supported.
Implications in the Real World: These biases extend beyond the digital realm, impacting our physical world in profound ways. Consider for example, who has and is posting data online. What percentage of information on the Internet comes from Western civilization? How well are third-world countries represented? Most training sets completely ignore all data that isn’t presented in English. These scenarios reveal that flawed training data can have far-reaching, real-world consequences.
And what about biased data we can’t see? Consider these real, documented instances of AI data bias. The automated car that hit and killed Elaine Herzberg in Tempe, AZ didn’t recognize her as a pedestrian partially because the training data for “pedestrian” all involved a painted crosswalk and she was pushing a laden bicycle outside of the crosswalk. Other examples include how Amazon’s AI recruiting algorithm discriminated against women, and a US healthcare algorithm proven to be 20% less accurate for black patients who were at risk of developing sepsis. While today's AI is more advanced, the principle remains – bias in training data is real and it leads to skewed, sometimes even fatal outcomes.
Addressing the Biases – A Call to Action: Acknowledging these biases is the first step towards rectification. Initiatives are underway to diversify training data, challenge stereotypes, and develop more inclusive algorithms. As crazy as it seems, in some cases where diverse data is lacking, AI is creating its own data – an innovative approach, to be sure. But how comfortable are you knowing that AI engines that are trained on biased data are creating more data (presumably driven by algorithms that are designed to lessen the bias)?
How You Can Reduce the Impact of Data Training Bias:
Be aware of the potential bias in all systems (AI or otherwise).
Make sure you know what training data was used to create the AI systems you use.
Be critical of the results of AI systems. It is still very important to keep a “human in the loop.”
Report bias to the developers of AI systems.
Consider creating a new position / department that is tasked specifically with testing bias in any systems you use regularly.
Conclusion: As we continue to integrate AI into various aspects of our lives, it's essential to recognize the complexities associated with training data biases. Understanding their presence empowers us to demand transparency, actively engage in developing more inclusive AI, and contribute to a future where AI serves as a force for good. Sometimes the natural flow of interacting with AI tools lets us forget that we are dealing with a pre-programmed training data set. One that is definitely imperfect, skewed, and full of errors and bias. Unfortunately, automating the output of these tools not only propagates the errors they contain, the new data we output will also serve as input for ongoing AI training, thereby exacerbating the problem.
For more insights into the world of AI and its implications, subscribe to my newsletter and stay informed about our evolving technological landscape.
If you're interested in learning more about how AI operates and how you can leverage its powers for good instead of evil, check out my upcoming workshops.
Comments