Ethical Concerns in Model Evaluation: Bias, Fairness & Responsible AI (Class 10)

Imagine two students, Riya and Arjun, apply for the same college scholarship. An AI system evaluates their applications. Both have similar grades and achievements, but the AI rejects Riya while accepting Arjun.

Why? The AI was trained on historical data where fewer women received scholarships. It learned this pattern and now unfairly discriminates against female applicants – even though it wasn’t programmed to do so.

This is a real ethical concern with AI models. When we evaluate AI, we don’t just ask “Is it accurate?” We must also ask:

Is it fair to everyone?
Does it discriminate against any group?
Can we trust its decisions?
Who is responsible when it makes mistakes?

These questions form the foundation of Ethical AI Evaluation – ensuring our AI systems don’t just work well, but work FAIRLY for all people.

Let’s dive in!

Learning Objectives

By the end of this lesson, you will be able to:

Understand why ethical evaluation of AI models is important
Identify different types of bias in AI systems
Explain how bias enters AI models through data and algorithms
Understand the concept of fairness in AI
Recognize the impact of biased AI on different groups
Apply ethical frameworks to evaluate AI models
Suggest ways to reduce bias and improve fairness in AI

Why Ethical Evaluation Matters

In previous lessons, we learned to evaluate AI models using metrics like accuracy, precision, recall, and F1 score. These metrics tell us how well a model performs overall. But there’s a crucial question they don’t answer: “Is this model fair to everyone?”

Technical metrics measure performance, but they can hide serious problems. A model might work wonderfully for some groups of people while failing badly for others. If we only look at overall numbers, we might never discover these hidden problems – until real people are harmed.

Beyond Accuracy

A model can be highly accurate but still deeply unfair. Consider this scenario:

Model	Overall Accuracy	Accuracy for Group A	Accuracy for Group B
Model X	90%	95%	70%

Model X has 90% overall accuracy – sounds good! But when we look deeper, we see it works much better for Group A (95%) than Group B (70%). Is this acceptable? Would you want to use this model if you were in Group B?

This is why ethical evaluation matters. We need to look beyond overall metrics to understand how AI affects different groups of people.

Real-World Consequences

When AI makes unfair decisions, the consequences aren’t just theoretical – real people suffer real harm. Here are documented cases of AI bias causing problems:

AI Application	Unfair Outcome	Real Impact
Hiring AI	Rejects qualified women	Career opportunities lost, wage gaps widen
Loan approval	Denies loans to certain neighborhoods	Families can’t buy homes, economic inequality grows
Healthcare AI	Misses diseases in certain groups	Health conditions worsen, lives endangered
Criminal justice	Higher risk scores for minorities	People unfairly denied bail or parole
Facial recognition	Lower accuracy for darker skin	Wrongful identification, false arrests

These aren’t hypothetical scenarios – they’re real cases that have been documented and studied. AI bias isn’t just a technical problem; it’s a social justice problem that affects people’s lives, opportunities, and wellbeing.

Understanding Bias in AI

Before we can fix bias, we need to understand what it is and where it comes from. Bias in AI is often misunderstood – many people think biased AI was deliberately programmed to discriminate. The reality is more subtle and, in some ways, more troubling.

AI systems learn from data and examples. If those examples contain bias – even unintentional bias – the AI will learn and reproduce that bias. The AI isn’t “deciding” to be unfair; it’s simply learning patterns from what it’s been shown.

What is Bias?

Bias in AI refers to systematic errors that result in unfair outcomes for certain groups of people. The key word is “systematic” – these aren’t random mistakes that affect everyone equally. They’re patterns of errors that consistently disadvantage specific groups.

It’s important to understand: AI doesn’t become biased on purpose. Bias creeps in through:

The data we use to train it
The way we design algorithms
The features we choose to include
The way we evaluate performance

Types of Bias

Understanding the different sources of bias helps us identify and address them. Here are the main types you need to know:

1. Data Bias (Historical Bias)

What it is: The training data reflects historical inequalities, prejudices, or past discrimination that existed in society.

Example: A hiring AI trained on 10 years of company hiring data. If the company historically hired more men for technical roles (because of past societal biases), the AI learns to prefer male candidates – even if the company now wants to hire fairly.

The problem: The AI perpetuates past discrimination, essentially “freezing” historical unfairness into its decision-making. Even if the company’s policies change, the AI keeps making decisions based on the biased past.

Historical Data:
90% men hired → AI learns pattern → Predicts men are "better candidates"
                                    (Even when women are equally qualified)

2. Sampling Bias

What it is: The training data doesn’t represent all groups equally – some groups are overrepresented while others are underrepresented.

Example: A facial recognition system trained mostly on lighter-skinned faces. Because the training data has fewer examples of darker-skinned faces, the system doesn’t learn to recognize their features as well.

The problem: Underrepresented groups experience worse AI performance. The system works great for some people and poorly for others – through no fault of theirs.

Training Data Composition	Recognition Accuracy
Light skin: 80% of training data	Light skin faces: 99% accuracy
Dark skin: 20% of training data	Dark skin faces: 65% accuracy

3. Measurement Bias

What it is: The way we measure or collect data is itself flawed or unfair for certain groups.

Example: Using arrest records as a proxy for “likelihood to commit crime.” This seems logical, but there’s a problem: if certain communities are policed more heavily, their residents will have more arrests – not because they commit more crimes, but because they’re watched more closely. The measurement itself is biased.

The problem: Biased measurements create biased predictions. The AI learns that certain groups are “more likely” to commit crimes based on biased police data, then makes predictions that reinforce those same biases.

4. Algorithm Bias

What it is: The algorithm itself makes assumptions or uses features that indirectly disadvantage certain groups.

Example: An algorithm that uses zip code as a feature for loan approval might seem neutral. But because neighborhoods are often segregated by race and income (due to historical discrimination in housing), zip code becomes a proxy for race. The algorithm discriminates without ever using race directly.

The problem: Even “neutral” features can encode discriminatory patterns. By using zip code, the AI is effectively using race and income – just in a hidden way.

5. Evaluation Bias

What it is: Testing the model on data that doesn’t represent all the groups who will actually use it.

Example: Testing a medical AI only on data from one hospital that serves a predominantly white, affluent population. The AI might work great in testing – but fail badly when deployed in a diverse community hospital.

The problem: We don’t discover poor performance for some groups until the AI is already deployed in the real world, causing real harm.

How Bias Affects Model Evaluation

Now that we understand what bias is and where it comes from, let’s see how it affects our evaluation metrics. The key insight is that overall metrics can hide serious problems.

Think of it like grading a class. If you only report the class average, you might miss that some students are struggling badly while others excel. Similarly, overall accuracy can mask significant disparities between groups.

The Problem with Average Metrics

When we calculate overall accuracy, we might miss serious problems for specific groups:

Example: Disease Detection AI

Group	Population Size	AI Accuracy
Group A	900 people	95%
Group B	100 people	50%
Overall	1000 people	90.5%

The overall accuracy (90.5%) looks good! But the AI fails for HALF of Group B – potentially missing disease in 50 people. If we only checked overall metrics, we’d never discover this problem.

This happens because Group A is much larger. Their good results dominate the average, hiding Group B’s terrible experience.

Disaggregated Evaluation

The solution is disaggregated evaluation – checking model performance separately for different groups:

Different genders (male, female, non-binary)
Different age groups (children, adults, elderly)
Different ethnicities and races
Different income levels
Different geographic regions
Different abilities (people with disabilities)

This reveals hidden disparities that overall metrics mask. A model that looks great overall might be failing certain communities – and disaggregated evaluation is how we find out.

Example: Facial Recognition Audit

A famous research study examined commercial facial recognition systems used by major tech companies. The overall accuracy was very high. But when researchers checked performance for different groups:

Demographic Group	Error Rate
Lighter-skinned males	0.8%
Lighter-skinned females	1.4%
Darker-skinned males	12.0%
Darker-skinned females	34.7%

Overall accuracy might be 95%+, but darker-skinned women experienced 34.7% errors – more than one in three! This is completely unacceptable for real-world use, especially in applications like security or law enforcement.

Without disaggregated evaluation, this problem would have remained hidden.

Fairness in AI: Different Definitions

When we say an AI system should be “fair,” what exactly do we mean? It turns out that fairness can be defined in several different ways – and these definitions sometimes conflict with each other.

Understanding different fairness definitions helps us make better decisions about how to evaluate and design AI systems. There’s no single “correct” definition – different situations call for different approaches.

What is Fairness?

Fairness means the AI treats all groups equitably. But “equitably” can mean different things depending on what you consider important:

Definition 1: Demographic Parity

Meaning: The AI should make positive predictions at the same RATE for all groups, regardless of their qualifications.

Example: If 30% of Group A applicants are approved for loans, then 30% of Group B applicants should also be approved.

Formula:

P(Positive | Group A) = P(Positive | Group B)

Strength: Ensures equal representation in outcomes.

Limitation: Doesn’t consider if groups actually differ in qualifications. If 50% of Group A is qualified but only 20% of Group B is qualified, demographic parity would mean approving unqualified applicants or rejecting qualified ones.

Definition 2: Equal Opportunity

Meaning: Among people who DESERVE a positive outcome, the AI should correctly identify them at the same rate across all groups.

Example: If 80% of qualified Group A candidates are correctly identified (true positive rate), 80% of qualified Group B candidates should also be correctly identified.

Formula:

True Positive Rate (Group A) = True Positive Rate (Group B)

Strength: Focuses on equal treatment of deserving individuals.

Limitation: Only looks at true positives – doesn’t consider false positive rates.

Definition 3: Equalized Odds

Meaning: The AI should have BOTH equal true positive rates AND equal false positive rates across groups.

Example: Both the chance of correctly approving qualified applicants AND the chance of wrongly approving unqualified applicants should be equal across groups.

Strength: Most comprehensive definition of equal treatment.

Limitation: Very strict; often mathematically impossible to achieve perfectly in practice.

The Fairness Trade-off

Here’s a challenging truth: Different fairness definitions can conflict with each other. You often cannot satisfy all fairness criteria simultaneously.

For example, achieving demographic parity might mean rejecting some qualified Group A members to make room for less qualified Group B members – violating equal opportunity. Or achieving equal opportunity might result in very different approval rates – violating demographic parity.

Choosing which fairness definition to prioritize depends on:

The specific application and its purpose
The consequences of different types of errors
Societal values and what we consider most important
Legal requirements and regulations

This is a value judgment, not just a technical decision.

Applying Bioethics to Model Evaluation

How do we think systematically about ethical issues in AI? One useful approach is to apply principles from bioethics – the field that deals with ethical issues in medicine and biology. These principles have been developed over decades and provide a helpful framework.

Remember the four principles of bioethics from earlier AI ethics discussions? Let’s apply them specifically to how we evaluate AI models:

1. Respect for Autonomy

Principle: People should understand and have control over AI decisions affecting them. They deserve to make informed choices.

In Model Evaluation, ask:

Can people understand why the AI made a decision about them?
Can they challenge or appeal AI decisions they believe are wrong?
Is the AI’s decision-making process transparent enough to explain?
Do people have meaningful alternatives if they don’t want AI making decisions about them?

Evaluation Question: “Can affected individuals understand and contest this AI’s decisions?”

2. Do Not Harm (Non-maleficence)

Principle: AI should not cause harm to anyone. If harm is unavoidable, it should be minimized.

In Model Evaluation, ask:

Does the AI harm certain groups more than others?
What are the consequences of false positives and false negatives for different groups?
Could the AI’s errors have serious, irreversible consequences?
Are vulnerable groups protected from disproportionate harm?

Evaluation Question: “Does this AI cause disproportionate harm to any group?”

3. Maximum Benefit (Beneficence)

Principle: AI should actively benefit everyone, not just some groups. The benefits should be widely shared.

In Model Evaluation, ask:

Does the AI provide benefits equally across groups?
Are improvements in AI performance helping all users or just some?
Is the AI addressing the real needs of diverse populations?
Who benefits most, and who benefits least?

Evaluation Question: “Does this AI benefit all groups fairly?”

4. Give Justice

Principle: Benefits and burdens should be distributed fairly. Those who bear risks should also receive benefits.

In Model Evaluation, ask:

Are the AI’s benefits accessible to all groups?
Do certain groups bear more risk from AI errors?
Does the AI perpetuate or reduce existing inequalities?
Are the most vulnerable groups protected?

Evaluation Question: “Does this AI distribute benefits and risks fairly?”

Case Studies: When AI Evaluation Failed

Learning from real failures helps us understand what can go wrong and how to prevent it. These case studies show how proper ethical evaluation could have caught problems before they caused harm.

Case Study 1: Healthcare Algorithm Bias

The Situation: A major healthcare AI system was used by hospitals across the United States to identify patients who need extra care and intervention. It assigned risk scores to millions of patients.

What Went Wrong: The AI used healthcare costs as a proxy for health needs. This seems reasonable – sicker people should need more healthcare spending, right? But historically, less money was spent on Black patients due to systemic barriers to healthcare access, discrimination, and economic factors. The AI learned that Black patients “need less care” – when actually they had less ACCESS to care.

The Result: At the same risk score, Black patients were actually much sicker than white patients. The AI systematically underestimated the health needs of Black patients, meaning they were less likely to be referred for additional care even when they needed it more.

Evaluation Failures:

Only looked at overall accuracy – didn’t check performance across racial groups
Didn’t question whether the training data proxy (cost = need) was fair
Didn’t consider historical inequalities affecting the data

Lesson: Always evaluate performance separately for different groups, and critically examine whether your training data reflects reality or historical unfairness.

Case Study 2: Hiring AI Discrimination

The Situation: A major tech company built an AI to screen job resumes and identify top candidates, hoping to make hiring more efficient and objective.

What Went Wrong: The AI was trained on 10 years of the company’s hiring data. Since the tech industry historically hired mostly men, the AI learned to associate male characteristics with being a good candidate. It penalized resumes that mentioned “women’s” (like “women’s chess club captain”) and favored candidates from all-male colleges.

The Result: The AI systematically ranked women lower than equally qualified men. Far from removing human bias from hiring, the AI amplified historical discrimination.

Evaluation Failures:

Tested for overall accuracy, not fairness across genders
Didn’t check if the AI treated male and female candidates equally
Didn’t consider that historical hiring data encoded past discrimination

Lesson: High accuracy doesn’t mean fairness. Always test for bias across different groups, especially protected characteristics.

Case Study 3: Facial Recognition Failures

The Situation: Commercial facial recognition systems were deployed in various applications including law enforcement, airport security, and phone unlocking.

What Went Wrong: Systems trained primarily on lighter-skinned faces had much higher error rates for darker-skinned individuals, especially women. But companies didn’t discover this because testing datasets were also predominantly lighter-skinned.

The Result: Wrongful identifications, including documented cases of innocent people being arrested based on faulty facial recognition matches. People lost hours or days of their lives being detained for crimes they didn’t commit.

Evaluation Failures:

Tested on datasets that weren’t diverse
Didn’t disaggregate performance by demographic groups
Deployed systems despite incomplete testing

Lesson: Test on diverse, representative data and always report disaggregated metrics. Don’t deploy systems that haven’t been properly evaluated across all groups who will use them.

Strategies for Ethical Model Evaluation

Understanding the problems isn’t enough – we need concrete strategies to prevent them. Here are practical approaches for ensuring AI evaluation is ethical and thorough.

These strategies should be built into the AI development process from the beginning, not added as an afterthought.

Strategy 1: Use Diverse, Representative Data

For Training:

Ensure data includes all groups the AI will serve
Balance representation of different demographics
Audit historical data for embedded biases before using it
Consider whether past data reflects how things SHOULD be, not just how they WERE

For Testing:

Test on data that represents all user groups proportionally
Include edge cases and minority groups
Don’t just use convenient or easily available data
Actively seek out underrepresented groups for testing

Strategy 2: Evaluate Disaggregated Metrics

Don’t just calculate:

Overall accuracy: 92%

Also calculate:

Accuracy for Group A: 95%
Accuracy for Group B: 88%
Accuracy for Group C: 75% ← Problem identified!

Check all relevant metrics (accuracy, precision, recall, F1) separately for each relevant group. If you find significant disparities, investigate and address them before deployment.

Strategy 3: Define Fairness Criteria Upfront

Before building the model, decide:

What does “fair” mean for this specific application?
Which groups need protection from discrimination?
What level of disparity is acceptable vs. unacceptable?
Which fairness definition (demographic parity, equal opportunity, etc.) is most appropriate?

Document these decisions and evaluate the final model against them. Having clear criteria prevents the temptation to rationalize unfair results after the fact.

Strategy 4: Involve Diverse Stakeholders

Include in the evaluation process:

People from communities affected by the AI
Ethicists and social scientists who study fairness
Diverse team members with different perspectives and life experiences
External auditors who can provide independent assessment

Different viewpoints reveal blind spots. What seems obviously fair to one person might seem obviously unfair to another – especially someone from a different background.

Strategy 5: Consider Context and Consequences

Ask:

What happens when the AI makes mistakes? How serious are the consequences?
Are consequences equal across groups, or do some groups suffer more from errors?
Could errors cause serious, irreversible harm to anyone?
Who bears the risk of AI failures? Is that fair?

The same error rate might be acceptable for a music recommendation system but completely unacceptable for a medical diagnosis system.

Strategy 6: Ensure Transparency and Explainability

Can we explain why the AI made a specific decision?
Can affected individuals understand the reasoning?
Are the evaluation methods and results public and auditable?
If someone challenges a decision, can we provide a meaningful explanation?

Transparency enables accountability. If we can’t explain how the AI works, we can’t properly evaluate whether it’s fair.

Strategy 7: Plan for Continuous Monitoring

Bias can emerge or worsen over time as:

User populations change
Real-world conditions shift
The AI’s decisions affect future data (feedback loops)
New groups start using the system

Regularly re-evaluate deployed models. A system that was fair at launch might become unfair as circumstances change.

Ethical Evaluation Checklist

Use this checklist when evaluating AI models to ensure you’ve considered all important ethical dimensions:

Data Evaluation

[ ] Is training data representative of all groups the AI will serve?
[ ] Have you audited the data for historical biases?
[ ] Is testing data diverse and representative?
[ ] Do you know the demographic composition of your data?

Performance Evaluation

[ ] Have you calculated metrics separately for different groups?
[ ] Are there significant disparities between groups?
[ ] Have you checked multiple metrics (accuracy, precision, recall, F1)?
[ ] Do all groups meet minimum acceptable performance thresholds?

Fairness Evaluation

[ ] Have you defined what “fair” means for this application?
[ ] Does the model meet your chosen fairness criteria?
[ ] Have you checked for different types of bias?
[ ] Have you considered conflicting fairness definitions?

Impact Evaluation

[ ] What happens when the model makes errors?
[ ] Who is harmed by errors, and how severely?
[ ] Are consequences equal across groups?
[ ] Have vulnerable populations been specifically considered?

Process Evaluation

[ ] Have diverse stakeholders been involved in evaluation?
[ ] Can decisions be explained and appealed?
[ ] Is there a plan for ongoing monitoring?
[ ] Who is accountable if problems emerge?

The Role of Human Oversight

As powerful as AI becomes, human oversight remains essential for ethical AI systems. This is especially true for high-stakes decisions that significantly affect people’s lives.

Think of AI as a powerful tool that assists human decision-making, not a replacement for it. Just as a calculator doesn’t replace mathematical understanding, AI shouldn’t replace human judgment in important matters.

When Human Oversight is Critical

AI should ASSIST, not REPLACE, human judgment in areas with serious consequences:

Application	Why Human Oversight Matters
Criminal justice	Imprisonment affects lives permanently; requires human accountability
Medical diagnosis	Health outcomes depend on catching AI errors; doctors provide context
Loan approval	Financial futures are at stake; humans can consider special circumstances
Hiring decisions	Career opportunities shouldn’t depend solely on algorithms
Child welfare	Children’s safety requires human judgment and empathy

What Human Oversight Provides

Humans bring capabilities that AI currently lacks:

Verification: Checking AI recommendations against common sense and experience
Context: Understanding unique circumstances AI might miss
Accountability: Someone responsible for decisions
Appeal process: A path for challenging wrong decisions
Empathy: Understanding the human impact of decisions

The Spectrum of Human Involvement

The more serious the consequences, the more human involvement is required:

Consequence Level	Example	Human Involvement
Low stakes	Movie recommendations	AI decides, human reviews occasionally
Medium stakes	Spam filtering	AI decides, human can override easily
High stakes	Loan applications	AI recommends, human decides
Critical stakes	Criminal sentencing	AI provides information, human decides with full discretion

Quick Recap

Let’s summarize the key concepts we’ve learned about ethical AI evaluation:

Why Ethical Evaluation Matters:

Technical metrics don’t show if AI is fair to everyone
Overall accuracy can hide disparities between groups
Real people suffer when AI makes unfair decisions

Types of Bias:

Data/Historical Bias: Training data reflects past discrimination
Sampling Bias: Some groups underrepresented in data
Measurement Bias: How we measure things is flawed
Algorithm Bias: Features indirectly encode discrimination
Evaluation Bias: Testing doesn’t represent all users

Fairness Definitions:

Demographic Parity: Equal positive prediction rates across groups
Equal Opportunity: Equal true positive rates for deserving individuals
Equalized Odds: Equal TPR and FPR across groups
Different definitions can conflict – choice depends on values

Bioethics Framework:

Respect for Autonomy: Can people understand and contest decisions?
Do Not Harm: Does AI cause disproportionate harm?
Maximum Benefit: Does AI benefit all groups fairly?
Give Justice: Are risks and benefits distributed fairly?

Key Strategies:

Use diverse, representative data
Evaluate disaggregated metrics for all groups
Define fairness criteria before building
Involve diverse stakeholders
Ensure transparency and explainability
Maintain human oversight for high-stakes decisions
Monitor continuously after deployment

Key Takeaway: Ethical AI evaluation goes beyond technical metrics. We must ask not just “Does it work?” but “Does it work FAIRLY for EVERYONE?” Building trustworthy AI requires considering the human impact of our systems on all people, especially vulnerable groups.

Activity: Evaluate a Loan Approval AI

Here’s a scenario to practice ethical evaluation:

Scenario: A bank uses an AI to approve or reject loan applications. You’re given data about its performance:

Metric	Urban Applicants	Rural Applicants
Approval rate	65%	45%
Accuracy	88%	78%
False rejection rate	8%	22%
Training data %	85%	15%

Questions:

Is there evidence of bias in this system? Explain.
What types of bias might be present based on the data?
Applying the “Give Justice” principle, is this fair?
What additional information would you want before making a final judgment?
What recommendations would you make to the bank?

Next Lesson: No-Code AI Tools for Statistical Data Analysis: Build AI Without Coding

Previous Lesson: Confusion Matrix, Precision, Recall & F1 Score Explained Simply

Chapter-End Exercises

A. Fill in the Blanks

in AI refers to systematic errors that result in unfair outcomes for certain groups.
When AI learns unfair patterns from past discrimination in data, it’s called bias.
bias occurs when training data doesn’t represent all groups equally.
evaluation checks model performance separately for different demographic groups.
parity requires equal positive prediction rates across all groups.
The “Do Not ” principle asks whether AI causes disproportionate harm to any group.
bias can occur when the features we use indirectly encode discrimination.
Facial recognition systems have shown higher error rates for -skinned individuals.
High overall accuracy can serious problems affecting specific groups.
Human is essential for high-stakes AI decisions.

B. Multiple Choice Questions

Why is ethical evaluation of AI important?
- a) To make AI faster
- b) To ensure AI treats all groups fairly
- c) To increase computing power
- d) To reduce training time
What is historical/data bias?
- a) AI that is programmed to discriminate
- b) Training data that reflects past inequalities
- c) Using old computers
- d) Having too much data
What does disaggregated evaluation mean?
- a) Combining all results together
- b) Checking performance separately for different groups
- c) Using only one metric
- d) Ignoring group differences
Which scenario demonstrates sampling bias?
- a) Equal data from all groups
- b) Training mostly on one demographic
- c) Using current data only
- d) Testing the model thoroughly
What is demographic parity?
- a) Different accuracy for different groups
- b) Equal positive prediction rates across groups
- c) More data for some groups
- d) Ignoring demographics
Which bioethics principle asks “Does AI benefit all groups fairly?”
- a) Respect for Autonomy
- b) Do Not Harm
- c) Maximum Benefit
- d) Give Justice
Why can zip code be a source of bias?
- a) Zip codes are hard to collect
- b) Zip codes can correlate with race due to historical segregation
- c) Zip codes are too specific
- d) Zip codes change frequently
What should companies do about AI bias?
- a) Ignore it because AI is objective
- b) Test for bias and work to reduce it
- c) Only use AI for simple tasks
- d) Blame the data
When is human oversight most important for AI?
- a) Music recommendations
- b) Weather predictions
- c) Criminal justice decisions
- d) Spam filtering
What does “continuous monitoring” mean in ethical AI?
- a) Checking the model once before deployment
- b) Regular re-evaluation of deployed models
- c) Monitoring computer performance
- d) Continuous training of new models

C. True or False

High overall accuracy guarantees an AI system is fair.
Historical bias occurs when training data reflects past discrimination.
Sampling bias means having too much training data.
Different fairness definitions can sometimes conflict with each other.
Demographic parity means equal accuracy for all groups.
Using zip code as a feature can never cause discrimination.
Disaggregated evaluation helps reveal hidden performance disparities.
Testing on diverse data helps identify bias before deployment.
Once deployed, AI systems don’t need further evaluation.
Human oversight is unnecessary if an AI has high accuracy.

D. Definitions

Define the following terms in 30-40 words each:

Bias (in AI)
Historical/Data Bias
Sampling Bias
Fairness (in AI)
Disaggregated Evaluation
Demographic Parity
Algorithmic Accountability

E. Very Short Answer Questions

Answer in 40-50 words each:

Why isn’t overall accuracy enough to evaluate if an AI is fair?
How can historical bias enter an AI system? Give an example.
What is sampling bias and how does it affect AI performance?
Why is disaggregated evaluation important for detecting bias?
How can using zip code as a feature lead to discrimination?
How does the “Do Not Harm” principle apply to AI evaluation?
Describe the healthcare AI bias case study and what went wrong.
Why might different fairness definitions conflict with each other?
When is human oversight most critical for AI decision-making?
What should be done before building an AI to ensure fairness?

F. Long Answer Questions

Answer in 75-100 words each:

Explain three different types of bias that can affect AI systems. Give examples of each.
Describe the healthcare AI bias case study. What was the AI trying to do, what went wrong, and what lessons should we learn?
What does “fairness” mean in AI? Explain at least two different definitions of fairness and why they might conflict.
How can the four bioethics principles (Respect for Autonomy, Do Not Harm, Maximum Benefit, Give Justice) be applied to evaluate AI systems?
Describe five strategies that organizations should follow to ensure ethical AI evaluation.
What is disaggregated evaluation and why is it important? Give an example showing how overall metrics can hide bias.
Discuss the role of human oversight in AI decision-making. When is it most important and what does it provide?

📖 Reveal Answer Key — click to expand

Answer Key

A. Fill in the Blanks – Answers

Bias
Explanation: Bias refers to systematic errors causing unfair outcomes for certain groups.
historical/data
Explanation: Historical bias occurs when training data contains past discrimination patterns.
Sampling
Explanation: Sampling bias occurs when some groups are underrepresented in training data.
Disaggregated
Explanation: Disaggregated evaluation checks performance separately for each group.
Demographic
Explanation: Demographic parity requires equal positive prediction rates across all groups.
Harm
Explanation: The “Do Not Harm” principle evaluates whether AI causes disproportionate harm.
Algorithm
Explanation: Algorithm bias can occur through features that indirectly encode discrimination.
darker
Explanation: Studies show facial recognition has higher errors for darker-skinned individuals.
hide/mask
Explanation: High overall accuracy can hide serious problems for specific groups.
oversight
Explanation: Human oversight is essential for high-stakes AI decisions.

B. Multiple Choice Questions – Answers

b) To ensure AI treats all groups fairly
Explanation: Ethical evaluation ensures AI doesn’t discriminate or harm certain groups.
b) Training data that reflects past inequalities
Explanation: Historical bias comes from data that encodes past discrimination.
b) Checking performance separately for different groups
Explanation: Disaggregated evaluation reveals hidden disparities between groups.
b) Training mostly on one demographic
Explanation: Sampling bias occurs when some groups are underrepresented in data.
b) Equal positive prediction rates across groups
Explanation: Demographic parity means equal rates of positive predictions for all groups.
c) Maximum Benefit
Explanation: Maximum Benefit asks whether AI benefits all groups fairly.
b) Zip codes can correlate with race due to historical segregation
Explanation: Neighborhood segregation means zip code can be a proxy for race/income.
b) Test for bias and work to reduce it
Explanation: Companies should actively evaluate for and address bias in AI systems.
c) Criminal justice decisions
Explanation: High-stakes decisions with serious consequences require human oversight.
b) Regular re-evaluation of deployed models
Explanation: Continuous monitoring means regularly checking deployed AI for emerging bias.

C. True or False – Answers

False
Explanation: High overall accuracy can hide poor performance for specific groups.
True
Explanation: Historical bias occurs when training data reflects past discrimination.
False
Explanation: Sampling bias means not having EQUAL representation of all groups.
True
Explanation: Different fairness definitions can conflict – satisfying one may violate another.
False
Explanation: Demographic parity means equal PREDICTION RATES, not equal accuracy.
False
Explanation: Zip code can correlate with race/income due to segregation.
True
Explanation: Checking each group separately reveals problems hidden in averages.
True
Explanation: Diverse testing data reveals problems across different groups.
False
Explanation: Models need continuous monitoring as conditions change.
False
Explanation: High-stakes decisions need human review to catch errors.

D. Definitions – Answers

Bias (in AI): Systematic errors in AI systems that result in unfair outcomes for certain groups. Bias can enter through training data, algorithm design, or evaluation methods, causing the AI to treat different groups inequitably.
Historical/Data Bias: Bias that occurs when training data reflects past inequalities or discrimination. The AI learns these patterns and perpetuates historical unfairness, even if the current intent is to be fair.
Sampling Bias: Bias that occurs when training data doesn’t represent all groups equally. Underrepresented groups experience worse AI performance because the model hasn’t learned enough about their characteristics.
Fairness (in AI): The principle that AI systems should treat all groups equitably. Fairness can be defined in multiple ways, including equal prediction rates, equal accuracy, or equal error rates across different demographic groups.
Disaggregated Evaluation: The practice of evaluating AI model performance separately for different groups (by gender, race, age, etc.) rather than only calculating overall metrics. This reveals hidden performance disparities.
Demographic Parity: A fairness criterion requiring that positive predictions occur at equal rates across all demographic groups. For example, if 30% of one group receives loan approval, 30% of other groups should too.
Algorithmic Accountability: The principle that those who create and deploy AI systems should be responsible for their outcomes. It includes transparency about how systems work, monitoring for bias, and addressing harms caused.

E. Very Short Answer Questions – Answers

Why accuracy alone isn’t enough: Accuracy averages performance across all groups, potentially hiding significant disparities. A model with 90% overall accuracy might have 95% accuracy for one group but only 70% for another – unfair despite good overall performance.
Historical bias explained: Historical bias enters AI when training data reflects past discrimination. If a hiring AI trains on data where mostly men were hired, it learns to prefer men – perpetuating historical inequality even without explicit programming.
Sampling bias example: A facial recognition system trained on 80% lighter-skinned faces will perform poorly on darker-skinned faces due to insufficient examples. The AI doesn’t learn diverse facial features, causing higher error rates for underrepresented groups.
Disaggregated evaluation importance: Disaggregated evaluation checks model performance separately for each group (gender, race, age, etc.). It’s important because overall metrics can hide serious problems – a model might work well on average but fail badly for specific groups.
Zip code causing discrimination: Zip codes correlate with race and income due to historical neighborhood segregation. An AI using zip code might learn that certain areas (predominantly minority/low-income) are “high risk,” effectively discriminating without using race directly.
Do Not Harm in evaluation: This bioethics principle requires evaluating whether AI causes harm to any group. Questions include: Does the AI harm certain groups more? What are consequences of errors? Do some groups bear disproportionate risk?
Healthcare AI case study: The AI used healthcare costs as a proxy for health needs. Historically, less was spent on Black patients. The AI learned Black patients “need less care” when actually they had less access. It systematically underestimated their health needs.
Why fairness definitions conflict: Demographic parity (equal approval rates) might conflict with equal opportunity (equal true positive rates) when groups have different qualification rates. Achieving one often means sacrificing another – requiring careful value-based choices.
When human oversight matters most: Human oversight is critical for high-stakes decisions with serious consequences – criminal justice, medical diagnosis, loan approval. Errors in these areas can devastate lives, requiring human review to catch AI mistakes.
Before building for fairness: Define what “fair” means for this specific application. Identify which groups need protection, what disparities are unacceptable, and how you’ll measure fairness. Document these decisions to guide development and evaluation.

F. Long Answer Questions – Answers

Three Types of AI Bias:
Historical Bias: Training data reflects past discrimination. Example: Hiring AI trained on historical data where 90% of hires were men learns to prefer male candidates, perpetuating past inequality. Sampling Bias: Training data underrepresents certain groups. Example: Facial recognition trained mostly on lighter-skinned faces has 35% error rate for darker-skinned women vs. 1% for lighter-skinned men. Algorithm Bias: Algorithm assumptions disadvantage groups. Example: Using zip code as a feature encodes racial and economic segregation, causing discrimination against minority neighborhoods even without using race directly.
Healthcare AI Case Study:
A major healthcare AI assigned risk scores to identify patients needing extra care. It used healthcare costs as a proxy for health needs. Problem: Historically, less money was spent on Black patients due to access barriers. The AI learned Black patients “cost less” = “need less care.” Result: At equal risk scores, Black patients were significantly sicker than white patients. The AI systematically underestimated Black patients’ needs. Lesson: Don’t use biased proxy measures; evaluate performance across racial groups; question whether training data reflects reality or historical inequality.
Fairness in AI:
Fairness means AI treats all groups equitably, but definitions vary. Demographic Parity requires equal positive prediction rates – if 30% of Group A is approved, 30% of Group B should be too. Equal Opportunity requires equal true positive rates – if 80% of qualified Group A members are identified, 80% of qualified Group B members should be too. These can conflict: if Group A has 50% qualified members and Group B has 30%, demographic parity (equal approval rates) would mean different accuracy for qualified members. Choosing between definitions requires value judgments about what equality means.
Four Bioethics Principles Applied:
Respect for Autonomy: Can affected individuals understand AI decisions? Can they appeal? Is the process transparent? Do Not Harm: Does AI harm any group disproportionately? What are consequences of errors for different groups? Could errors cause serious harm? Maximum Benefit: Does AI benefit all groups equally? Are improvements helping everyone? Does it address diverse needs? Give Justice: Are benefits accessible to all? Do some groups bear more risk? Does AI reduce or perpetuate inequalities?
Five Strategies for Ethical AI Evaluation:
1. Diverse Data: Ensure training data includes applicants from all demographics, genders, and backgrounds with balanced representation. 2. Disaggregated Evaluation: Check accuracy, precision, and recall separately for different genders, races, and ages to identify disparities. 3. Define Fairness Upfront: Decide what “fair” means – equal approval rates? Equal accuracy for qualified candidates? Document criteria. 4. Involve Stakeholders: Include ethicists, diverse team members, and representatives of affected communities in evaluation. 5. Human Review: Require human review for high-stakes decisions, with appeal processes for those affected.
Disaggregated Evaluation with Example:
Disaggregated evaluation checks performance separately for each group, revealing hidden problems. Example: A loan approval AI has 90% overall accuracy. | Group | Population | Accuracy | |——-|———–|———-| | Urban | 800 | 95% | | Rural | 200 | 70% | | Overall | 1000 | 90% | The 90% average hides that rural applicants experience 25% worse accuracy. If we only checked overall metrics, we’d miss this unfairness. Disaggregated evaluation reveals the disparity, enabling targeted improvements for underserved groups.
Role of Humans in AI Decision-Making:
Humans should oversee AI, especially for high-stakes decisions. AI should assist, not replace, human judgment in areas with serious consequences: criminal justice (imprisonment affects lives), medical diagnosis (health outcomes), loan approval (financial futures). Human oversight provides: verification of AI recommendations, ability to catch AI errors, consideration of context AI might miss, accountability for decisions, and appeal processes. For lower-stakes applications (spam filtering, recommendations), less oversight is needed since errors are easily corrected. The key principle: the more serious the consequences, the more human involvement required.

Activity Answer

Evidence of bias: Yes – significant disparities between urban and rural applicants (65% vs 45% approval, 8% vs 22% false rejection rate, 88% vs 78% accuracy)
Types of bias: Likely sampling bias (rural applicants are only 15% of training data) and possibly historical bias (past lending patterns may have favored urban areas)
Give Justice analysis: Not fair – rural applicants are rejected at much higher rates and falsely rejected nearly 3x more often. They bear disproportionate burden of AI errors.
Additional information needed:
- Actual qualification rates for both groups
- Default rates by group (are rural loans actually riskier?)
- What features the AI uses
- Historical lending patterns
- Why the training data has so few rural examples
Recommendations:
- Audit training data for rural representation and add more rural examples
- Investigate why rural false rejection rate is so high
- Consider retraining with more balanced data
- Implement human review for rural applications
- Set fairness targets and monitor regularly
- Consider whether using urban/rural-correlated features is appropriate

This lesson is part of the CBSE Class 10 Artificial Intelligence curriculum. For more AI lessons with solved questions and detailed explanations, visit iTechCreations.in

Previous Chapter: Confusion Matrix, Precision, Recall & F1 Score Explained

Next Chapter: No-Code AI Tools for Statistical Data Analysis

Next Lesson: No-Code AI Tools for Statistical Data Analysis: Build AI Without Coding

Previous Lesson: Confusion Matrix, Precision, Recall & F1 Score Explained Simply

Ethical Concerns in Model Evaluation: Bias, Fairness & Responsible AI (Class 10)

Learning Objectives

Why Ethical Evaluation Matters

Beyond Accuracy

Real-World Consequences

Understanding Bias in AI

What is Bias?

Types of Bias

1. Data Bias (Historical Bias)

2. Sampling Bias

3. Measurement Bias

4. Algorithm Bias

5. Evaluation Bias

How Bias Affects Model Evaluation

The Problem with Average Metrics

Disaggregated Evaluation

Example: Facial Recognition Audit

Fairness in AI: Different Definitions

What is Fairness?

Definition 1: Demographic Parity

Definition 2: Equal Opportunity

Definition 3: Equalized Odds

The Fairness Trade-off

Applying Bioethics to Model Evaluation

1. Respect for Autonomy

2. Do Not Harm (Non-maleficence)

3. Maximum Benefit (Beneficence)

4. Give Justice

Case Studies: When AI Evaluation Failed

Case Study 1: Healthcare Algorithm Bias

Case Study 2: Hiring AI Discrimination

Case Study 3: Facial Recognition Failures

Strategies for Ethical Model Evaluation

Strategy 1: Use Diverse, Representative Data

Strategy 2: Evaluate Disaggregated Metrics

Strategy 3: Define Fairness Criteria Upfront

Strategy 4: Involve Diverse Stakeholders

Strategy 5: Consider Context and Consequences

Strategy 6: Ensure Transparency and Explainability

Strategy 7: Plan for Continuous Monitoring

Ethical Evaluation Checklist

Data Evaluation

Performance Evaluation

Fairness Evaluation

Impact Evaluation

Process Evaluation

The Role of Human Oversight

When Human Oversight is Critical

What Human Oversight Provides

The Spectrum of Human Involvement

Quick Recap

Activity: Evaluate a Loan Approval AI

Chapter-End Exercises

A. Fill in the Blanks

B. Multiple Choice Questions

C. True or False

D. Definitions

E. Very Short Answer Questions

F. Long Answer Questions

Answer Key

A. Fill in the Blanks – Answers

B. Multiple Choice Questions – Answers

C. True or False – Answers

D. Definitions – Answers

E. Very Short Answer Questions – Answers

F. Long Answer Questions – Answers

Activity Answer

Submit a Comment Cancel reply

Pin It on Pinterest