Model Evaluation: Why Testing Your AI Matters & Train-Test Split Explained (Class 10)

Imagine you’re preparing for your final exams. You study hard, solve practice questions, and feel confident. But here’s the important question: How do you KNOW you’re actually ready for the exam?

You could solve the same practice questions again and again until you get 100% – but that doesn’t prove you’ve learned the concepts. You might have just memorized the answers! The real test is when you face NEW questions you’ve never seen before.

This is exactly the challenge with AI models too.

When we train an AI model, it learns from data. But how do we know if it has actually learned useful patterns – or just memorized the training examples? How do we know it will work well on NEW, unseen data in the real world?

This is where Model Evaluation comes in – the process of testing how well our AI model performs.

Let’s dive in!

Learning Objectives

By the end of this lesson, you will be able to:

Understand why evaluating AI models is important
Explain the difference between training data and testing data
Describe the Train-Test Split concept and why it’s necessary
Understand the problem of overfitting and underfitting
Explain what model accuracy means
Understand the importance of evaluating models on unseen data
Apply the concept of train-test split in AI project development

Why Do We Need to Evaluate AI Models?

Before deploying any AI model in the real world, we must answer a critical question: “Will this model actually work when it encounters new situations?” This isn’t just a nice-to-have – it’s absolutely essential. An AI model that seems to work perfectly during development might fail catastrophically when used in practice, potentially causing serious problems.

Think about it this way: Would you trust a self-driving car that was only tested on the same five roads? Would you trust a medical diagnosis AI that was only verified on the patients it was trained on? Of course not! We need to thoroughly evaluate our models to ensure they’ll perform well in situations they’ve never seen before.

The Fundamental Question

When we build an AI model, we want it to work in the real world – on data it has never seen before. But how do we know it will work?

The Problem: A model might perform perfectly on the data it was trained on but fail miserably on new data. This is like a student who memorizes answers but can’t solve new problems. The model looks great on paper, but it’s actually useless in practice.

The Solution: We need to TEST the model on data it has NEVER seen during training. This gives us an honest picture of how the model will perform in the real world.

Real-World Example: Medical Diagnosis

Let’s see why proper evaluation matters with a critical example – an AI model trained to detect cancer from X-ray images:

Scenario 1 (Bad Evaluation):

Model is trained on 1,000 X-ray images
We test it on the SAME 1,000 images
It gets 99% accuracy! Amazing!
We deploy it in hospitals, confident it will save lives
But it fails on new patients’ X-rays – disaster! Cancers are missed, healthy patients are misdiagnosed

Scenario 2 (Good Evaluation):

Model is trained on 800 X-ray images
We test it on 200 DIFFERENT images it never saw during training
It gets 85% accuracy on these new images
We now have an honest estimate of real-world performance
We can make informed decisions about deployment and improvement

The second approach tells us the TRUTH about how the model will perform. The first approach gives us a false sense of security that could lead to real harm.

Understanding Training Data vs Testing Data

To properly evaluate AI models, we need to understand two fundamental types of data: training data and testing data. These serve very different purposes, and keeping them separate is one of the most important principles in machine learning.

Think of it like preparing for an exam. The practice problems you solve while studying are like training data – they help you learn. The actual exam questions are like testing data – they reveal how much you’ve truly understood. If the exam just repeated the practice problems exactly, it wouldn’t really test your knowledge, would it?

What is Training Data?

Training Data is the data we use to TEACH the model. The model learns patterns from this data by adjusting its internal parameters.

Think of it as:

The textbook you study from
The practice problems you solve repeatedly
The examples your teacher explains in class

During training:

Model sees both inputs (features) and outputs (labels)
Model adjusts its parameters to learn the relationship between inputs and outputs
Model tries to minimize errors on this data
Goal: Learn patterns that connect inputs to correct outputs

For example, when training a spam filter, the training data would include thousands of emails along with labels saying “spam” or “not spam.” The model studies these examples and learns what characteristics make an email likely to be spam.

What is Testing Data?

Testing Data is the data we use to EVALUATE the model after training is complete. The model has NEVER seen this data before – it’s completely new.

Think of it as:

The actual exam paper you see for the first time
Questions you’ve never practiced before
The real test of whether you learned the concepts or just memorized examples

During testing:

Model sees only inputs (features) – NOT the correct answers
Model makes predictions based on what it learned
We compare these predictions with the actual correct outputs
Goal: Measure how well the model performs on completely new data

For the spam filter, testing data would be a fresh set of emails the model has never seen. We ask the model to predict whether each email is spam, then check how many predictions were correct.

Key Difference

Understanding the difference between training and testing data is crucial:

Aspect	Training Data	Testing Data
Purpose	Teach the model patterns	Evaluate model performance
When used	During the learning phase	After learning is complete
Model’s exposure	Sees it many times, learns from it	Sees it only once, for final testing
Labels used for	Learning patterns and adjusting	Checking if predictions are correct
Analogy	Practice problems you study	Actual exam questions

The Train-Test Split

Now that we understand the difference between training and testing data, the question is: where do we get testing data that the model has never seen? The answer is elegant and simple – we divide our available data into two parts before training even begins!

This process is called Train-Test Split, and it’s one of the most fundamental techniques in machine learning. It ensures we always have fresh, unseen data for evaluation.

What is Train-Test Split?

Train-Test Split is the process of dividing your complete dataset into two separate parts:

Training Set: The larger portion, used to train the model
Testing Set: The smaller portion, kept completely hidden and used only for final evaluation

The key insight is that once data goes into the testing set, the model must NEVER see it during training. This separation ensures honest evaluation.

┌─────────────────────────────────────────────────────────────┐
│                     COMPLETE DATASET                        │
│                      (100% of data)                         │
├───────────────────────────────────────┬─────────────────────┤
│           TRAINING SET                │    TESTING SET      │
│            (70-80%)                   │     (20-30%)        │
│                                       │                     │
│    Used to TEACH the model            │  Used to EVALUATE   │
│    Model sees this during learning    │  Model never sees   │
│                                       │  this until testing │
└───────────────────────────────────────┴─────────────────────┘

Common Split Ratios

Different situations call for different split ratios. Here are the most commonly used ones:

Split Ratio	Training	Testing	When to Use
80:20	80%	20%	Most common, works well for general purposes
70:30	70%	30%	When you want more thorough testing
90:10	90%	10%	When you have limited data and need more for training
60:40	60%	40%	When testing accuracy is critically important

The 80:20 split is most commonly used in practice because it provides a good balance – enough training data for the model to learn well, and enough testing data for reliable evaluation.

Example: Fruit Classification

Let’s see how train-test split works with a concrete example. Suppose you have 100 images of fruits (apples, oranges, bananas) that you want to use to train and test a fruit classifier:

Original Dataset: 100 images

After 80:20 Split:

Training Set: 80 images (used to teach the model to recognize fruits)
Testing Set: 20 images (kept hidden, used only for final evaluation)

ORIGINAL DATA (100 images)
🍎🍎🍎🍎🍎🍊🍊🍊🍊🍊🍌🍌🍌🍌🍌🍎🍎🍎🍎🍎
🍊🍊🍊🍊🍊🍌🍌🍌🍌🍌🍎🍎🍎🍎🍎🍊🍊🍊🍊🍊
🍌🍌🍌🍌🍌🍎🍎🍎🍎🍎🍊🍊🍊🍊🍊🍌🍌🍌🍌🍌
🍎🍎🍎🍎🍎🍊🍊🍊🍊🍊🍌🍌🍌🍌🍌🍎🍎🍎🍎🍎
🍊🍊🍊🍊🍊🍌🍌🍌🍌🍌🍎🍎🍎🍎🍎🍊🍊🍊🍊🍊
                    ↓
                SPLIT (80:20)
                    ↓
┌─────────────────────────────────────┬────────────────┐
│         TRAINING SET (80)           │  TEST SET (20) │
│  🍎🍎🍎🍎🍊🍊🍊🍊🍌🍌🍌🍌...      │  🍎🍎🍊🍊🍌...  │
│                                     │                │
│  Model LEARNS from these            │  Model TESTED  │
│                                     │  on these      │
└─────────────────────────────────────┴────────────────┘

The model will learn to recognize fruits using the 80 training images. Then, to test how well it learned, we’ll show it the 20 testing images it has never seen and check its accuracy.

Why is Train-Test Split Important?

You might wonder: “Why go through all this trouble? Why not just use all the data for training and then test on the same data?” The answer reveals some fundamental truths about how machine learning works.

Train-test split isn’t just a technical detail – it’s essential for building AI systems that actually work in the real world. Here’s why:

Reason 1: Prevent Memorization (Overfitting)

If we test on training data, the model might just memorize specific answers instead of learning general patterns. This is a crucial problem called “overfitting.”

Example:

A model trained on 100 cat images might memorize specific pixel patterns in those exact images
When tested on the SAME images: 100% accuracy (it just remembers them!)
When tested on NEW cat images: 60% accuracy (it never actually learned what a cat looks like)

Without separate test data, we would never discover this problem! Train-test split reveals whether the model truly learned or just memorized.

Reason 2: Estimate Real-World Performance

The testing set simulates how the model will perform on completely new data in the real world. This is incredibly valuable because once deployed, models will only encounter data they’ve never seen.

Real-world scenario:

You deploy a spam filter trained on your dataset
Every day, new emails arrive that weren’t in your training data
The testing set accuracy is your best estimate of how well the filter will handle these new emails

Reason 3: Compare Different Models

When choosing between multiple models or approaches, testing set performance helps us pick the best one fairly.

Example:

Model A: 95% training accuracy, 75% test accuracy
Model B: 85% training accuracy, 82% test accuracy

Which is better? Model B! Even though Model A looks better on training data, Model B actually performs better on unseen data – which is what matters in the real world.

Reason 4: Detect Problems Early

Testing on separate data reveals issues before deployment, when they can still be fixed:

Model performs well on training but poorly on testing → Overfitting problem – need to simplify
Model performs poorly on both → Underfitting – need more features or complexity
Model performs well on both → Ready for deployment!

Without proper evaluation, we might deploy a flawed model and only discover problems when it’s causing real harm.

The Problem of Overfitting and Underfitting

Two of the most important concepts in machine learning are overfitting and underfitting. Understanding these problems helps us build models that actually work well on new data.

Think of it like studying for an exam. Some students memorize answers without understanding concepts (overfitting). Other students don’t study enough and understand neither answers nor concepts (underfitting). The goal is to understand the concepts deeply enough to answer both familiar and new questions (good fit).

What is Overfitting?

Overfitting occurs when a model learns the training data TOO well – including noise, random variations, and patterns that don’t actually generalize to new data. The model essentially memorizes the training examples instead of learning the underlying concepts.

Analogy: Think of a student who memorizes the exact wording of practice problems. They can solve those specific problems perfectly, but give them a similar problem with slightly different numbers or wording, and they’re lost. They memorized instead of learning.

Signs of Overfitting:

Very high training accuracy (e.g., 99%)
Much lower testing accuracy (e.g., 70%)
Large gap between training and testing performance

OVERFITTING

Training Accuracy:  ████████████████████ 99%
Testing Accuracy:   ██████████████       70%
                    ↑
            Big gap = Problem!

Why it happens:

Model is too complex for the amount of data available
Not enough training examples to learn general patterns
Training continues for too long (model starts memorizing)
Model has too many features or parameters

What is Underfitting?

Underfitting occurs when a model is too simple to capture the patterns in the data. It hasn’t learned enough – neither from training data nor about the general problem.

Analogy: Think of a student who barely studied and doesn’t understand the subject. They perform poorly on practice problems AND on the exam. They haven’t learned the material at all.

Signs of Underfitting:

Low training accuracy (e.g., 60%)
Low testing accuracy (e.g., 55%)
Both are poor, but relatively similar to each other

UNDERFITTING

Training Accuracy:  ████████████         60%
Testing Accuracy:   ██████████           55%
                    ↑
            Both low = Model too simple!

Why it happens:

Model is too simple for the complexity of the data
Not enough features to capture important patterns
Training didn’t continue long enough
The problem is more complex than the model can handle

The Goal: Good Fit

Good Fit is when the model learns genuine, generalizable patterns that work well on both training data and new data. This is what we’re aiming for!

GOOD FIT

Training Accuracy:  █████████████████    85%
Testing Accuracy:   ████████████████     82%
                    ↑
            Close values, both good = Success!

A good fit means the model has truly learned the underlying patterns. It performs well on training data (it learned something) and almost as well on testing data (what it learned generalizes to new situations).

Comparison Table

Aspect	Underfitting	Good Fit	Overfitting
Training Accuracy	Low	High	Very High
Testing Accuracy	Low	High (close to training)	Much lower than training
Problem	Model too simple	Just right	Model too complex
What to do	Add complexity, more features	Deploy the model!	Simplify, get more data

The Evaluation Process

Now let’s put everything together and see the complete process for evaluating an AI model. Following these steps systematically ensures we get an honest assessment of our model’s capabilities.

Step-by-Step Model Evaluation

Step 1: Collect Data

Start by gathering your complete dataset with features and labels. Ensure your data is representative of the real-world situations your model will encounter.

Step 2: Split Data

Divide into training set (80%) and testing set (20%). This must happen BEFORE any training begins, and the testing set must be kept completely separate.

Step 3: Train the Model

Use ONLY training data to teach the model. The model learns patterns, adjusts weights, and optimizes for accuracy on this data alone.

Step 4: Test the Model

Feed test data inputs to the trained model. The model makes predictions without seeing the correct answers. This is the moment of truth!

Step 5: Evaluate Performance

Compare the model’s predictions with the actual test labels. Calculate accuracy and analyze the results.

Step 6: Analyze Results

Look at the numbers carefully:

High test accuracy → Model is working well
Low test accuracy → Need improvements
Big gap between training and test accuracy → Overfitting

Visual Process

Here’s a visual representation of the complete evaluation workflow:

┌──────────────┐
│  COLLECT     │
│  ALL DATA    │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   SPLIT      │
│   80 : 20    │
└──────┬───────┘
       │
   ┌───┴───┐
   │       │
   ▼       ▼
┌──────┐ ┌──────┐
│TRAIN │ │ TEST │
│ SET  │ │ SET  │
└──┬───┘ └──┬───┘
   │        │
   ▼        │ (Keep aside for now)
┌──────┐    │
│TRAIN │    │
│MODEL │    │
└──┬───┘    │
   │        │
   ▼        │
┌──────────┐│
│ TRAINED  ││
│  MODEL   │◄┘ Now test with test set
└──┬───────┘
   │
   ▼
┌──────────┐
│ EVALUATE │
│ ACCURACY │
└──────────┘

What is Model Accuracy?

When we evaluate a model, we need a way to measure how well it’s performing. The most basic and commonly used measure is accuracy. It tells us, simply: “Out of all the predictions the model made, how many were correct?”

Definition

Accuracy is the percentage of correct predictions out of all predictions made by the model.

Formula:

                    Number of Correct Predictions
Accuracy (%) = ─────────────────────────────────── × 100
                    Total Number of Predictions

This formula is straightforward: if the model makes 100 predictions and 85 are correct, the accuracy is 85%.

Example Calculation

Let’s say our fruit classifier is tested on 20 images from the testing set. Here’s how we would calculate its accuracy:

Image #	Predicted	Actual	Correct?
1	Apple	Apple	✓
2	Orange	Orange	✓
3	Banana	Banana	✓
4	Apple	Apple	✓
5	Orange	Apple	✗
6	Banana	Banana	✓
7	Apple	Apple	✓
8	Orange	Orange	✓
9	Banana	Orange	✗
10	Apple	Apple	✓
…	…	…	…

Results:

Total predictions: 20
Correct predictions: 16
Incorrect predictions: 4

Accuracy Calculation:

Accuracy = (16 / 20) × 100 = 80%

The model correctly classifies 80% of the test images. This tells us that on new, unseen fruit images, we can expect the model to be correct about 80% of the time.

Interpreting Accuracy

What’s considered “good” accuracy depends heavily on the specific task:

Accuracy Range	General Interpretation	Considerations
90-100%	Excellent	Check for overfitting if training accuracy is similar
80-90%	Good	Acceptable for many applications
70-80%	Acceptable	May need improvement for critical applications
60-70%	Poor	Significant improvement needed
Below 60%	Very Poor	Rethink the entire approach

Important Note: What’s “acceptable” depends entirely on the application:

For medical diagnosis, even 95% might not be enough – lives are at stake
For music recommendations, 70% might be perfectly fine – a wrong suggestion isn’t dangerous
For self-driving cars, 99% might still be too risky – mistakes can be fatal

Importance of Test Data Remaining Unseen

One principle is so important that it deserves special emphasis: Test data must NEVER be seen by the model during training. This is sometimes called the “golden rule” of machine learning evaluation.

This might seem obvious, but it’s surprisingly easy to accidentally violate this rule. And when you do, your evaluation becomes meaningless.

The Golden Rule

Test data must NEVER be seen by the model during training.

This is crucial for honest evaluation. If test data “leaks” into training, even indirectly:

Evaluation becomes meaningless (you’re testing on what you trained on)
Accuracy numbers become unreliable (probably too optimistic)
Real-world performance will be worse than expected (leading to nasty surprises)

Data Leakage

Data Leakage happens when information from the test set influences the training process in any way. This can happen more easily than you might think:

Types of leakage:

Direct Leakage: Accidentally using test data for training (obviously wrong, but happens)
Indirect Leakage: Making decisions based on test set performance and then retraining (more subtle, but equally problematic)

Example of Indirect Leakage Problem:

Wrong Process:
1. Split data into training and testing sets
2. Train model on training data
3. Test model → 75% accuracy
4. Tune model settings based on test results
5. Test again → 80% accuracy
6. Tune more based on test results
7. Test again → 85% accuracy
8. Repeat until test accuracy looks good

Problem: Model has now been influenced by test data!
The 85% accuracy is not reliable.
Real-world performance will likely be lower.

Each time you adjust based on test results, you’re essentially “training” on the test set indirectly. The test set is no longer truly unseen.

Correct Process:

Right Process:
1. Split data into train, validation, and test sets
2. Train model on training data
3. Tune and adjust using validation data (not test data!)
4. Repeat tuning with validation data until satisfied
5. Final evaluation ONCE on test data
6. Report this number as your expected real-world performance

Advanced: Train-Validation-Test Split

For more rigorous evaluation and development, we often use THREE sets instead of two. This solves the problem of needing to tune and adjust our model while still having completely unseen data for final evaluation.

┌─────────────────────────────────────────────────────────────┐
│                     COMPLETE DATASET                        │
├───────────────────────────┬───────────────┬─────────────────┤
│      TRAINING SET         │  VALIDATION   │    TEST SET     │
│        (60-70%)           │    (10-15%)   │    (15-20%)     │
│                           │               │                 │
│  Used to TRAIN            │  Used to TUNE │  Used for FINAL │
│  the model                │  the model    │  evaluation     │
└───────────────────────────┴───────────────┴─────────────────┘

Training Set (60-70%): Used to teach the model – same as before.

Validation Set (10-15%):

Used to tune the model (adjust settings, choose features)
Can be used multiple times during development
Helps catch overfitting while you’re still developing
It’s okay if the model’s performance on this set influences your decisions

Test Set (15-20%):

Used ONLY for final evaluation
Used only ONCE at the very end
Gives the most honest estimate of real-world performance
Never let this influence any development decisions

This three-way split allows you to iterate and improve your model (using validation data) while still having completely uncontaminated data for final evaluation (the test set).

Practical Example: Building a Spam Filter

Let’s trace through the entire evaluation process with a realistic example – building an AI spam filter for email.

Step 1: Collect Data

We gather 10,000 emails that have been labeled as either “Spam” or “Not Spam” by humans. This is our complete dataset.

Step 2: Split Data (80:20)

We divide the data before any training begins:

Training set: 8,000 emails
Testing set: 2,000 emails (kept completely separate – we won’t touch these until the end)

Step 3: Train the Model

We use the 8,000 training emails to teach our model. The model learns patterns like:

Suspicious words (“FREE,” “WINNER,” “URGENT”)
Sender characteristics (unknown senders, suspicious domains)
Email formatting (lots of capital letters, unusual punctuation)
Link patterns (shortened URLs, suspicious destinations)

Step 4: Evaluate on Training Data (Optional Check)

As a sanity check, we test on training data first:

Accuracy on training data: 95%
This tells us the model did learn something from the training data

Step 5: Evaluate on Test Data (Critical!)

Now the real test – we feed the 2,000 test emails to the model:

Model predicts “Spam” or “Not Spam” for each email
Model never saw these emails during training
We compare predictions with the actual labels

Step 6: Calculate Test Accuracy

Results:

Correct predictions: 1,700 out of 2,000
Test Accuracy: (1,700 / 2,000) × 100 = 85%

Step 7: Analyze the Results

Training accuracy: 95%
Test accuracy: 85%
Gap: 10 percentage points

The 10% gap suggests some overfitting – the model learned some patterns specific to training data that don’t generalize perfectly. However, 85% is still quite good for a spam filter.

Step 8: Make a Decision

Based on our analysis:

85% accuracy is acceptable for a spam filter
The model is ready for deployment with this expected performance
We might want to collect more data and retrain to reduce the overfitting gap

Common Mistakes to Avoid

Learning from others’ mistakes is a great way to avoid making them yourself. Here are the most common evaluation mistakes and how to avoid them:

Mistake 1: Testing on Training Data

Wrong: “My model got 99% accuracy!” (tested on training data)

Why it’s wrong: Training accuracy doesn’t tell you how the model will perform on new data. It only tells you how well the model memorized what it was taught.

Right: Always report TEST accuracy as your measure of performance.

Mistake 2: Peeking at Test Data

Wrong: Looking at test data to “understand your data better” or “see what’s in there”

Why it’s wrong: Once you’ve seen the test data, it can influence your decisions, even subconsciously. This contaminates the evaluation.

Right: Keep test data completely hidden until final evaluation. Use training data or validation data to understand your dataset.

Mistake 3: Repeatedly Testing and Tuning

Wrong: Test → Tune → Test → Tune → Test (on the same test set)

Why it’s wrong: Each time you tune based on test results, you’re indirectly training on the test set. Your “final” accuracy will be unrealistically optimistic.

Right: Use a separate validation set for tuning. Touch the test set only ONCE for final evaluation.

Mistake 4: Imbalanced Splits

Wrong: All spam emails end up in training, all normal emails in testing (or vice versa)

Why it’s wrong: The test set won’t be representative of what the model will encounter in the real world.

Right: Randomly shuffle data before splitting to ensure both sets have similar proportions of each category.

Mistake 5: Too Small Test Set

Wrong: Using only 5% for testing (50 examples out of 1,000)

Why it’s wrong: With too few test examples, accuracy can vary wildly just by chance. Results aren’t statistically reliable.

Right: Use at least 20% for testing to get reliable accuracy estimates.

Quick Recap

Let’s summarize the key concepts we’ve learned about model evaluation:

Why Evaluate Models?

To know how well the model will perform on new, real-world data
To prevent deploying models that only work on training data
To compare different models and approaches fairly
To catch problems (overfitting/underfitting) before deployment

Training vs Testing Data:

Training data teaches the model – it sees this data repeatedly
Testing data evaluates the model – it sees this data only once, at the end
Test data must remain completely unseen during training

Train-Test Split:

Divide data into training (80%) and testing (20%) sets
Split BEFORE training begins
Common ratios: 80:20, 70:30, or 90:10

Overfitting vs Underfitting:

Overfitting: Model memorizes training data but fails on new data (high training accuracy, low test accuracy)
Underfitting: Model is too simple to learn the patterns (low accuracy on both)
Good fit: Model performs well on both training and testing data

Accuracy:

Percentage of correct predictions: (Correct / Total) × 100
Calculate on TEST data, not training data
What’s “good” depends on the application

Golden Rule: Never let the model see test data during training!

Activity: Evaluate Your Understanding

Now it’s time to apply what you’ve learned! Consider this scenario:

Scenario: You’re building an AI model to predict whether students will pass or fail based on their attendance and assignment scores. You have data for 500 students.

Questions:

How would you split this data using 80:20 ratio? How many students would be in each set?
If your model gets 92% accuracy on training data but only 68% accuracy on test data, what problem does this indicate? What would you do to fix it?
If your model gets 55% accuracy on both training and test data, what problem does this indicate? What would you do to fix it?
Why shouldn’t you keep testing and adjusting your model based on test set performance?
What would be a “good fit” scenario for this model? Give example accuracy values.

Next Lesson: Confusion Matrix, Precision, Recall & F1 Score Explained Simply

Previous Lesson: Neural Networks Explained Simply: How AI Thinks and Makes Decisions

Chapter-End Exercises

A. Fill in the Blanks

is the process of checking how well an AI model performs.
Training data is used to the model, while testing data is used to evaluate it.
The common ratio for train-test split is for training and 20% for testing.
When a model memorizes training data but fails on new data, it’s called .
When a model is too simple to capture patterns, it’s called .
is the percentage of correct predictions out of total predictions.
Test data must remain during the training process.
The gap between training accuracy and testing accuracy can indicate .
set is used for tuning the model without using the test set.
The goal is to achieve a fit where the model works well on both training and testing data.

B. Multiple Choice Questions

Why do we need to evaluate AI models?
- a) To make training faster
- b) To know how well the model will perform on new data
- c) To reduce the cost of computation
- d) To increase the amount of data
What is the purpose of testing data?
- a) To train the model
- b) To evaluate model performance on unseen data
- c) To store the model
- d) To increase accuracy
What is the most common train-test split ratio?
- a) 50:50
- b) 80:20
- c) 95:5
- d) 100:0
What does overfitting mean?
- a) Model is too simple
- b) Model performs poorly everywhere
- c) Model memorizes training data but fails on new data
- d) Model needs more layers
What are the signs of underfitting?
- a) High training accuracy, low test accuracy
- b) Low training accuracy, low test accuracy
- c) High accuracy on both sets
- d) No errors at all
How is accuracy calculated?
- a) Total predictions ÷ Correct predictions
- b) (Correct predictions ÷ Total predictions) × 100
- c) Training accuracy + Test accuracy
- d) Number of features × 100
Why should test data remain unseen during training?
- a) To save storage space
- b) To get an honest evaluation of real-world performance
- c) To make training faster
- d) To reduce data size
What indicates a good fit?
- a) 99% training accuracy, 50% test accuracy
- b) 40% accuracy on both sets
- c) High training accuracy, similar high test accuracy
- d) No training needed
What is data leakage?
- a) Losing data from storage
- b) Test data influencing the training process
- c) Having too much data
- d) Splitting data incorrectly
What is the purpose of a validation set?
- a) To train the model
- b) To replace the test set
- c) To tune the model without touching test data
- d) To store extra data

C. True or False

Testing data can be used during training to improve the model.
Train-test split helps detect if a model is overfitting.
High training accuracy guarantees the model will work well on new data.
Overfitting occurs when a model is too simple.
Accuracy is calculated as (Correct predictions / Total predictions) × 100.
It’s okay to repeatedly test and tune using the same test set.
The testing set is typically smaller than the training set.
Underfitting means the model performs poorly on both training and testing data.
Validation set is different from test set and used for tuning.
Training accuracy is the best measure of real-world performance.

D. Definitions

Define the following terms in 30-40 words each:

Model Evaluation
Training Data
Testing Data
Train-Test Split
Overfitting
Underfitting
Accuracy

E. Very Short Answer Questions

Answer in 40-50 words each:

Why is it important to evaluate AI models before deployment?
What is the difference between training data and testing data?
If you have 1000 images, how would you split them using 80:20 ratio?
What are the signs that indicate a model is overfitting?
What are the signs that indicate a model is underfitting?
How do you calculate the accuracy of a model?
Why must test data remain unseen during training?
What is the purpose of a validation set?
How can you identify if a model has achieved a good fit?
What is data leakage and why is it a problem?

F. Long Answer Questions

Answer in 75-100 words each:

Explain the concept of train-test split. Why is it necessary and what are the common split ratios used?
Compare and contrast overfitting and underfitting. What are the causes and signs of each?
Describe the complete process of evaluating an AI model. Include all steps from data collection to final analysis.
Explain why test data must remain unseen during training. What problems occur if test data is used during training?
What is the difference between training accuracy and testing accuracy? Why might they be different?
A model shows 98% training accuracy but only 65% testing accuracy. What does this indicate and what steps would you take to address this issue?
Explain the concept of train-validation-test split. When would you use three sets instead of two?