BIAS Blog 2: Harms in Machine Learning
Published on:
Exploring how AI systems turn people into data, and what it takes to bring empathy and accountability back into technology.
Case Study: Understanding Potential Sources of Harm throughout the Machine Learning Life Cycle
This case study breaks down seven ways machine learning systems can introduce harm, from historical bias to deployment bias, and explains how they emerge throughout the entire life cycle of an ML project. What stood out to me is that bias isn’t one mistake, it’s a chain of decisions, assumptions, and oversights that become harmful when they interact with the real world.
Breaking Down Bias: How Harms Emerge in the ML Life Cycle
When I first read the MIT article, I honestly didn’t expect it to make me reflect so much on how machine learning actually works behind the scenes. I used to think of bias in AI as this one-dimensional “bad data problem,” but this case study made me realize it’s more like seven different fault lines, historical, representation, measurement, aggregation, learning, evaluation, and deployment bias, each capable of causing its own kind of damage.
What surprised me is how ordinary some of these harms begin. Historical bias isn’t someone deliberately coding stereotypes, it’s a model inheriting the world as it is. For example, the article mentions gendered word embeddings, but a new example that came to mind is facial recognition systems trained on old policing datasets, they reflect policing patterns, not actual crime patterns, and that distinction matters.
Representation bias also hit home for me. The authors mentioned how Western-centric datasets skew performance, but I’ve seen an example myself: when I used an app that tries to identify cultural clothing, it labeled multiple Central Asian outfits as “Middle Eastern” or “Indian.” Not malicious, just ignorant, but still harmful.
Measurement bias made me think about something even more personal. The article talks about GPA as an oversimplified measure of student success, and I agree. But another example is using “number of questions answered in class” to measure engagement. Some students are quieter but deeply engaged, and a metric like that would misrepresent them completely.
Aggregation bias also triggered a memory: in a project I worked on involving predicting student workload, a single model was used across every major. Unsurprisingly, STEM heavy majors were misclassified because their patterns didn’t match humanities students. That’s a perfect example of a model assuming everyone behaves the same.
Learning bias reminded me of optimizing only for accuracy in my CS classes. Even in simple projects, I’ve caught myself thinking, “As long as the accuracy is high, it’s good,” without realizing that accuracy can hide who the model fails.
Evaluation bias is something I see all the time: models being tested on ideal, clean data and then breaking when used in the real world. The article gave examples, but another one is AI voice assistants, they work flawlessly in testing but struggle with accents, background noise, or code-switching.
Deployment bias was one of the most unsettling parts. A system can work “perfectly” in testing and still cause harm once humans start using it in unpredictable ways. One example that came to mind is people using ChatGPT to give medical advice despite disclaimers, it’s a mismatch between intended use and actual use, and it can be dangerous.
How These Harms Show Up in My Own Projects
Thinking about my own work, I realized how easily these biases sneak into even small student projects.
For example, in a business analytics class, my team built a model to predict customer churn for a fictional company. Looking back:
- Historical bias: We trained on outdated customer behavior that didn’t reflect actual modern usage patterns.
- Representation bias: The dataset overrepresented older customers, so predictions for younger users were terrible.
- Measurement bias: We treated “time spent on website” as a proxy for engagement, which isn’t always true.
- Aggregation bias: We used one model across every customer segment, even though the behaviors were clearly different.
- Learning bias: We optimized for overall accuracy without checking subgroup performance.
- Evaluation bias: We tested on extremely clean data the professor gave us, not real-world messy data.
- Deployment bias: We imagined the model as a guide for marketing decisions, but in the real world, someone could misuse it to deny discounts or target vulnerable groups.
It’s wild how all seven sources of harm existed in a project we thought was “simple.”
While reading, I realized there’s another type of harm we don’t talk about enough: Psychological or emotional harm. Models that classify appearance, body type, personality traits, or attractiveness can quietly damage self-esteem, especially for younger users.
Another one is:Feedback-loop harm. Once a model makes a prediction (e.g., recommending certain posts), future data reflects that prediction, not reality. The system slowly reshapes behavior. These aren’t strictly “technical,” but they’re real.
Additional Mitigations We Should Explore
The article suggests many mitigations, but here are a few I think matter:
- User-facing transparency prompts, where the model explains uncertainty or data gaps
- Community audits, especially from the groups affected
- Model warnings when predictions are based on low-quality or unrepresentative data
- Refusing to deploy a model unless real-world evaluation is performed with diverse users
- Designing models to allow “opt-out data zones,” especially for sensitive groups
A lot of harm comes from pretending a model is more capable than it is.
A Checklist for Anyone Starting an ML Project
Here’s the checklist I wish I had in my first CS classes:
- Before data collection: Whose data is missing? Could it encode past inequality? Are labels oversimplifying something complex?
- During development: Did we test performance across subgroups? Do our metrics reflect fairness, not just accuracy? Does one model fit all groups?
- During evaluation: Are benchmarks and test data realistic and diverse? Are we measuring potential harms?
- Before deployment: Who could misuse this model? What happens if it’s used out of context? Who is most affected if it fails?
The Question I’d Ask
How can we design systems that allow communities most affected by algorithmic decisions to participate in the design or auditing process itself?
This question matters because most harms aren’t technical, they’re social. And social problems cannot be solved by engineers alone.
Final Thoughts
Writing this blog made me realize how often we (especially CS students) talk about “building things that work,” without asking for whom they work, and who they harm. This case study didn’t just give me vocabulary like “aggregation bias” or “deployment bias”, it gave me a much clearer sense of responsibility.
I don’t think the point is to never make mistakes in ML. But I do think the point is to stay aware that even the smallest choices, what data we use, what metrics we choose, how we deploy, have real consequences for real people.
And that awareness is where ethical tech actually begins.
