02 Sep 2025

How to Train a Yolo Model for Digital Meter Reading

“Can we train a YOLO model using an NVIDIA GeForce RTX 4050 GPU with just ~975 temperature panel images?”

This question from my colleague sparked an interesting journey into computer vision and object detection. What started as a simple inquiry turned into a comprehensive exploration of training YOLO models for digit recognition on digital displays - and yes, it absolutely works on a RTX 4050!

The Challenge

My colleague shared a dataset of approximately 975 images of temperature panel digital meter readings from refrigeration units. The goal was to automatically extract temperature readings from images like these, where each display shows temperatures for different rooms (Meat Room, Fish Room, Vegetable Room, and Dairy Room).

The first question was: Which YOLO version should we use? After some research and YouTube surfing, I discovered that YOLO11 (YOLOv11) is the latest and most optimized version as of 2024. Being relatively new to model training myself, I turned to Claude for guidance and learned the entire process step by step.

The Game Plan

Before diving into training, we needed a solid strategy. Here’s the approach that worked:

1. Data Selection Strategy

Instead of annotating all 975 images (which would take forever!), we selected approximately 20% (~195 images) for annotation. The key was ensuring our subset included:

All digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
The negative sign (-) for sub-zero temperatures
Various display conditions (bright, dim, different angles)
Different temperature ranges to ensure model generalization

Pro Tip: Quality over quantity! A well-curated subset with diverse examples trains better than randomly selected images.

Setting Up the Annotation Environment

Label Studio - Your New Best Friend

The best tool I found for annotation was Label Studio. Here’s why it’s perfect for this task:

Web-based interface (no complex software installation)
Supports object detection with bounding boxes
Exports directly to YOLO format
Easy to set up with Docker

Getting Label Studio Running

Create a docker-compose.yml file:

version: "3.8"

services:
  label-studio:
    image: heartexlabs/label-studio:latest
    container_name: label-studio
    ports:
      - "8080:8080"
    volumes:
      - ./mydata:/label-studio/data
    environment:
      - DATA_UPLOAD_MAX_NUMBER_FILES=1000
    command: label-studio start --username admin@example.com --password password123
    stdin_open: true
    tty: true

Start it up:

docker-compose up -d

Visit http://localhost:8080 and login with:

Username: admin@example.com
Password: password123

The Annotation Marathon

This is where patience becomes your superpower. Here’s the workflow:

1. Project Setup

Create a new project in Label Studio
Upload your curated image subset
Select “Object Detection with Bounding Boxes” as your labeling setup

2. Define Your Classes

Our classes were:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9 (individual digits)
negative_sign (the minus symbol)

3. The Annotation Process

Now comes the time-consuming but crucial part:

Draw bounding boxes around each digit and negative sign
Assign the correct class to each box
Be consistent with box sizing (tight around digits works best)
Take breaks! Your eyes and focus matter for quality annotations

Reality Check: This took me several coffee-fueled sessions. Each image typically had 8-12 digits to annotate, so budget your time accordingly.

4. Export Your Dataset

Once annotation is complete:

Click the “Export” button
Select “YOLO” format
Download the zip file containing images and labels

Congratulations! You now have a YOLO-ready dataset.

Understanding YOLO Format Structure

Before diving into training, let’s understand what Label Studio actually exported and why it’s crucial for model training.

YOLO Dataset Structure

When you export from Label Studio in YOLO format, you get:

dataset/
├── images/              # Your annotated images
│   ├── image001.jpg
│   ├── image002.jpg
│   └── ...
├── labels/              # YOLO format annotation files
│   ├── image001.txt
│   ├── image002.txt
│   └── ...
└── classes.txt          # Class definitions file

What’s in the Labels Folder?

Each .txt file in the labels folder corresponds to an image and contains the bounding box annotations in YOLO format. For example, image001.txt might contain:

2 0.456 0.234 0.087 0.156
1 0.567 0.234 0.098 0.167
negative_sign 0.398 0.234 0.045 0.089
5 0.678 0.456 0.089 0.178

YOLO Format Explained:

Column 1: Class ID (0=digit “0”, 1=digit “1”, 2=digit “2”, etc.)
Column 2: X-center coordinate (normalized 0-1)
Column 3: Y-center coordinate (normalized 0-1)
Column 4: Width of bounding box (normalized 0-1)
Column 5: Height of bounding box (normalized 0-1)

Why This Format Matters

Normalization: All coordinates are normalized to 0-1 range, making the model resolution-independent
Center-based: YOLO uses center coordinates rather than top-left corner, which is more intuitive for object detection
One file per image: Each image has its corresponding label file, making dataset management simple
Class mapping: The classes.txt file maps class IDs to human-readable names

How YOLO Actually Identifies Digits

Important Clarification: YOLO doesn’t identify digits based on their position or location in the image. Instead, it uses structure-based recognition - learning the visual patterns and features that make each digit unique.

Structure-Based Detection (The Core Method)

When YOLO11 detects a digit, it:

Learns Visual Features: During training, the model learns distinctive patterns for each digit shape:
- Digit “0”: Oval/circular shape with enclosed area
- Digit “1”: Vertical line, sometimes with small top segment
- Digit “8”: Two enclosed loops stacked vertically
- Digit “2”: Curved top, horizontal middle, curved bottom
Feature Recognition: Uses convolutional neural networks to identify:
- Edge patterns (straight lines, curves)
- Geometric relationships between segments
- Spatial arrangements of display elements
- Unique structural characteristics of each digit
Location-Independent Detection: The model can recognize digits anywhere in the image, regardless of position. A “5” in the top-left corner is identified the same way as a “5” in the bottom-right.

The Two-Stage Process

Our temperature monitoring system uses a two-stage approach:

Stage 1: Digit Recognition (Structure-Based)

Input: Image pixel data
↓
YOLO11 Analysis: "I see a curved shape with two enclosed areas"
↓ 
Output: "This is digit '8' with 95% confidence at coordinates (x,y)"

Stage 2: Room Assignment (Position-Based)

Detected digit "8" at coordinates (150, 200)
↓
Check which quadrant contains point (150, 200)
↓
Result: "Digit '8' belongs to Fish Room temperature display"

Why This Distinction Matters

Common Misconception: “YOLO memorizes where each digit appears”

❌ Wrong: YOLO doesn’t learn “digit 2 always appears in top-left”
✅ Correct: YOLO learns “this curved shape pattern = digit 2”

Real Example from Our System:

# YOLO identifies digits by visual structure
detected_digits = [
    {'digit': '2', 'confidence': 0.94, 'position': (100, 150)},  # Meat room
    {'digit': '2', 'confidence': 0.96, 'position': (400, 150)},  # Fish room  
    {'digit': '2', 'confidence': 0.93, 'position': (100, 350)},  # Vege room
]

# Our code uses position to assign rooms
for digit in detected_digits:
    if digit['position'][0] < image_width/2 and digit['position'][1] < image_height/2:
        room = "Meat Room"  # Top-left quadrant
    # ... more position logic

The beauty of this approach is that YOLO can detect the same digit anywhere on the display, while our logic determines which temperature reading it belongs to based on spatial layout.

This makes the system robust - even if displays are positioned differently or digits appear in unexpected locations, YOLO will still recognize them correctly based on their visual structure.

Process Architecture Overview

Here’s the complete workflow we followed:

📁 Raw Images (975 images)
          ↓
🎯 Data Curation (Select ~20% diverse subset)
          ↓
📝 Manual Annotation (Label Studio)
    ├── Object Detection Setup
    ├── Class Definition (0-9, negative_sign)
    └── Bounding Box Drawing
          ↓
📦 YOLO Format Export
    ├── images/ folder
    ├── labels/ folder (annotations)
    └── classes.txt file
          ↓
📊 Data Analysis & Balance Check
    └── Label distribution analysis
          ↓
🔧 Dataset Preparation
    ├── Train/Validation Split
    └── YOLO config (data.yaml)
          ↓
🎓 Model Training (YOLO11)
    ├── Multiple epochs
    ├── Early stopping
    └── Model validation
          ↓
🔍 Model Testing & Issues Discovery
    ├── Inference on test images
    └── Performance analysis
          ↓
⚖️ Class Imbalance Issues Found
    ├── Some digits misclassified
    └── Need for rebalancing
          ↓
🔄 Iterative Improvement
    ├── Additional annotations
    ├── Data augmentation
    └── Retraining
          ↓
🚀 Production Model
    └── Temperature extraction pipeline

Training the YOLO11 Model

Environment Setup

First, set up your Python environment:

# Create virtual environment
conda create -n yolo-env python=3.9
conda activate yolo-env

# Install required packages
pip install ultralytics torch torchvision opencv-python pyyaml

Hardware Requirements

Good news for RTX 4050 users! Here’s what I found:

GPU: RTX 4050 works perfectly (8GB VRAM is sufficient)
RAM: 16GB system RAM recommended
Storage: ~5GB for dataset and model files
Training Time: ~2-4 hours depending on epochs and dataset size

The Training Script

Here’s a simplified version of the training process:

from ultralytics import YOLO
import torch

# Load YOLO11 model
model = YOLO('yolo11n.pt')  # nano version for faster training

# Training configuration
results = model.train(
    data='path/to/your/data.yaml',  # Dataset config file
    epochs=100,
    imgsz=640,
    batch=16,
    patience=20,
    device='cuda' if torch.cuda.is_available() else 'cpu',
    project='trained_models',
    name='digit_detector'
)

Dataset Configuration

Create a data.yaml file:

path: /path/to/your/dataset
train: images/train
val: images/val
names:
  0: "0"
  1: "1"
  2: "2"
  3: "3"
  4: "4"
  5: "5"
  6: "6"
  7: "7"
  8: "8"
  9: "9"
  10: "negative_sign"

Training Insights and Results

What I Learned

Model Size Selection:

YOLO11n (nano): Fastest training, good for prototyping
YOLO11s (small): Best balance of speed and accuracy for digit detection
YOLO11m (medium): Better accuracy, longer training time

Training Parameters That Worked:

Epochs: 100-200 (with early stopping)
Batch Size: 16 (perfect for RTX 4050)
Image Size: 640px (YOLO standard)
Patience: 20-30 epochs (prevents overfitting)

Performance Results

With our ~195 annotated images:

Training Accuracy: ~98%
Validation mAP50: ~95%+
Inference Speed: ~45 FPS on RTX 4050
Model Size: ~6MB (YOLO11n) to ~40MB (YOLO11m)

The Reality Check: Challenges We Faced

Initial Training Results

After our first training run, we were excited! The model seemed to work, but when we tested it on real images, we discovered several issues:

Common Misclassifications:

Digit “2” was frequently read as “8”
Digit “6” confused with “5”
Some digits were completely ignored (not detected)
Negative signs were inconsistently detected

Discovering the Root Cause: Class Imbalance

The problem wasn’t our model architecture or training parameters—it was data imbalance. Some digits appeared much more frequently in our training set than others.

Building a Data Analysis Tool

To understand our data distribution, we created a label analysis script to examine our annotations:

from collections import Counter
import glob

def analyze_label_distribution(labels_dir):
    """Analyze distribution of classes in YOLO label files."""
    class_counts = Counter()
    
    # Process each label file
    for label_file in glob.glob(f"{labels_dir}/*.txt"):
        with open(label_file, 'r') as f:
            for line in f:
                if line.strip():
                    class_id = int(line.split()[0])
                    class_counts[class_id] += 1
    
    # Generate report
    total_annotations = sum(class_counts.values())
    print(f"Total annotations: {total_annotations}")
    
    for class_id, count in sorted(class_counts.items()):
        percentage = (count / total_annotations) * 100
        print(f"Class {class_id}: {count:4d} ({percentage:5.1f}%)")
    
    # Identify imbalance
    max_count = max(class_counts.values())
    min_count = min(class_counts.values())
    imbalance_ratio = max_count / min_count
    
    print(f"\nImbalance ratio: {imbalance_ratio:.1f}:1")
    if imbalance_ratio > 10:
        print("⚠️  High class imbalance detected!")
    
    return class_counts

# Usage
class_distribution = analyze_label_distribution("path/to/labels")

What We Discovered

Running our analysis revealed shocking imbalances:

Class Distribution Analysis:
Class 0 (digit "0"): 5 annotations (0.3%)
Class 1 (digit "1"): 567 annotations (33.4%)  ← Most frequent
Class 2 (digit "2"): 23 annotations (1.4%)   ← This explains the "2" vs "8" issue!
Class 3 (digit "3"): 89 annotations (5.2%)
Class 4 (digit "4"): 145 annotations (8.5%)
...

Imbalance ratio: 113.4:1 ⚠️ High class imbalance detected!

The “Aha!” Moment: Digit “1” appeared in 33.4% of all annotations, while “0” appeared in only 0.3%. No wonder the model struggled with rare digits!

The Iterative Improvement Process

Round 1: Targeted Annotation

Identified underrepresented classes (digits “0”, “2”, “6”)
Went back to our original 975 images
Specifically selected images containing these rare digits
Added ~50 more annotations focused on balancing classes

Round 2: Data Augmentation Strategy

# Applied targeted augmentation to minority classes
augmentation_config = {
    'rare_digits': ['0', '2', '6'],
    'augmentation_factor': 3,  # 3x more augmentations for rare digits
    'techniques': ['rotation', 'brightness', 'noise']
}

Round 3: Retraining with Balanced Data

New class distribution much more balanced
Improved from 113:1 ratio to 8:1 ratio
Retrained model with same parameters

Results After Rebalancing

Before balancing:

Overall accuracy: ~85%
Digit “2” accuracy: ~45% (frequently misread as “8”)
Digit “0” accuracy: ~30% (often not detected)

After balancing:

Overall accuracy: ~98%
Digit “2” accuracy: ~94%
Digit “0” accuracy: ~91%

Lessons Learned About Class Imbalance

Monitor class distribution early: Run analysis before training, not after
Quality over quantity: 10 well-distributed samples per class beats 100 imbalanced samples
Real-world bias: Digital displays show some digits more than others (temperature ranges)
Iterative approach works: Multiple small improvements beat one massive fix

Real-World Application

Temperature Extraction Pipeline

After training, the complete pipeline looks like this:

Image Input: Raw photo of temperature display
Digit Detection: Model finds all digits and negative signs
Spatial Sorting: Arrange detections by position (left-to-right, top-to-bottom)
Temperature Assembly: Combine digits into temperature readings
Room Assignment: Map temperatures to specific rooms based on display layout

Sample Results

Input: Digital display image
Output:
- Meat Room: -23°C
- Fish Room: 15°C  
- Vegetable Room: 2°C
- Dairy Room: 8°C

Key Takeaways and Tips

What Worked Well

Small curated dataset: 20% well-chosen images beat 100% random selection
Class balance: Ensure all digits are represented in training data
Consistent annotation: Take your time with bounding box quality
YOLO11: Latest version provided excellent out-of-the-box performance

Common Pitfalls to Avoid

Rushing annotation: Poor annotations = poor model performance
Ignoring rare digits: Make sure digits like “0” have enough examples
Over-fitting: Use validation set and early stopping
Inconsistent lighting: Include various lighting conditions in training data

RTX 4050 Optimization Tips

Batch size: Start with 16, reduce if you get CUDA out-of-memory errors
Mixed precision: Enable for faster training (amp=True)
Cache: Disable image caching to save VRAM (cache=False)

Final Thoughts

Training a YOLO model for digit detection is absolutely achievable with consumer hardware like the RTX 4050. The key is understanding that computer vision is as much about data quality as it is about model architecture.

Time investment:

Annotation: 1-2 days (depending on your coffee intake ☕)
Training: 2-4 hours
Testing and refinement: Half day

Is it worth it? Absolutely! This project taught me that modern deep learning tools have democratized computer vision. With some patience, a decent GPU, and good coffee, you can build production-ready models for real-world applications.

The digital meter reading model now successfully processes thousands of temperature readings, saving hours of manual data entry and reducing human error. Sometimes the best projects start with a simple colleague question: “Can we train a model for this?”

Answer: Yes, we absolutely can!

This project used YOLO11, Label Studio, and Python. Total cost: $0 (using open-source tools). Total learning: Priceless.