Non-English texts still detected after fine-tuning text detection model #1906

Yuvaraj-off · 2025-03-25T08:27:04Z

Yuvaraj-off
Mar 25, 2025

We fine-tuned db_mobilenet_v3_large using the pdfa-eng-wds dataset to detect only English text, as suggested in issue #1564. Though, we trained with 10 epoch and our accuracy results isnt the best for that, we expected the detections to be limited to english texts, but it is detected non english and other junks as texts. Are we missing something here.

Steps Taken
• Used db_mobilenet_v3_large as the base model and trained it on pdfa-eng-wds.
• The training results are below:

Loaded it in our OCR pipeline and tested on a sample aadhaar card image and got the following result:

felixdittrich92 · 2025-03-25T11:35:29Z

felixdittrich92
Mar 25, 2025
Maintainer

Hi @Yuvaraj-off 👋,

In general I don't think this will work this way - our dataset for the detection pre-training does also contain only latin text but the models learns really well to generalize for mostly any kind of text - keep in mind that's all CNN-based architectures so there is no "textual understanding" - It could maybe work by providing negative samples where different text is on the image but only the english annotated - but no guarante

Have you fine tuned our model or trained from scratch ?
Additional (not sure if the issues still exist) - the pdfa-eng-wds dataset had lots of issues wrong annotations / missing annotations (which is also really clear from the shared metrics) :)

Best,
Felix

0 replies

Yuvaraj-off · 2025-03-25T11:48:28Z

Yuvaraj-off
Mar 25, 2025
Author

Thanks for the quick response and insights, @felixdittrich92!

We've trained our model from scratch and are now focusing on using a our own dataset annotated exclusively for English text. We'll update you on the results once we test this approach.

Additionally, we're exploring the possibility of using a YOLO model for text detection. Would it be feasible to integrate a YOLO-based text detection model with our existing docTR pipeline? We'd love to hear your insights on this!

8 replies

Yuvaraj-off Apr 1, 2025
Author

class YOLOProcessor:
"""YOLOv8 model wrapper using PyTorch (.pt model)."""

def __init__(self, model_path):
    print("Loading YOLOv8 PyTorch model...")
    self.model = YOLO(model_path)
    print("YOLOv8 PyTorch model loaded successfully.")

def forward(self, image: np.ndarray):
    # print(image.shape[:2],"image shape")
    input_tensor = self.preprocess_for_yolov8(image)
    results = self.model(input_tensor)[0]
    # print(results,"results")
    return self.parse_yolo_output(results, image.shape[:2])

@staticmethod
def preprocess_for_yolov8(image: np.ndarray) -> torch.Tensor:
    if image.dtype != np.uint8:  # Ensure correct dtype
        image = (image * 255).astype(np.uint8)

    transform = T.Compose([
        T.ToPILImage(),
        T.Resize((640, 640)),
        T.ToTensor(),  # Converts to (C, H, W), scales to [0,1]
    ])

    input_tensor = transform(image)

    return input_tensor.unsqueeze(0)  # Add batch dimension

@staticmethod
def parse_yolo_output(results, img_size):
    """Parse YOLOv8 output and normalize bounding boxes."""
    h, w = img_size
    boxes = results.boxes.xyxy.cpu().numpy()  # Bounding boxes (x1, y1, x2, y2)
    scores = results.boxes.conf.cpu().numpy()  # Confidence scores
    class_ids = results.boxes.cls.cpu().numpy()

    # 🔹 Filter out non-text classes (if applicable)
    TEXT_CLASS_ID = 0  # Change this to match your text class index
    valid_indices = class_ids == TEXT_CLASS_ID

    if np.sum(valid_indices) == 0:
        print("Warning: No text detections found.")
        return {"boxes": np.array([]), "scores": np.array([]), "class_ids": np.array([])}

    # Apply filtering
    boxes = boxes[valid_indices]
    scores = scores[valid_indices]
    class_ids = class_ids[valid_indices]

    # Normalize bounding boxes
    boxes[:, [0, 2]] /= w  # Normalize x-coordinates
    boxes[:, [1, 3]] /= h  # Normalize y-coordinates

    return {
        "boxes": boxes.astype(np.float32),
        "scores": scores.astype(np.float32),
        "class_ids": class_ids.astype(np.int32)
    }

class CustomDetectionPredictor(DetectionPredictor):
"""Custom DetectionPredictor for docTR using YOLOv8 PyTorch."""

def __init__(self, model_path: str):
    super().__init__(pre_processor=None, model=torch.nn.Module())
    self.yolo_model = YOLOProcessor(model_path)

@torch.inference_mode()
def forward(self, pages: list[np.ndarray], return_maps: bool = False, **kwargs):
    if any(page.ndim != 3 for page in pages):
        raise ValueError("Incorrect input shape: all pages must be multi-channel 2D images.")

    detections = [self.yolo_model.forward(page) for page in pages]
    feature_maps = [np.zeros((1, 1)) for _ in pages]  # Dummy feature maps

    return (detections, feature_maps) if return_maps else detections

Yuvaraj-off Apr 2, 2025
Author

Here's a refined version with improved clarity and structure:

The key challenge has been replicating the seg_maps output expected by DetectionPredictor, particularly in this snippet:

if return_maps:
    seg_maps = [pred.numpy() for batch in predicted_batches for pred in batch["out_map"]]
    return preds, seg_maps
return preds

For the YOLO-based model, I need to generate a compatible seg_maps output to align with docTR’s expected structure. Additionally, I need guidance on setting up a suitable postprocessor to ensure seamless integration with docTR’s pipeline. Any insights on structuring this for YOLO-based text detection would be helpful! 🚀

felixdittrich92 Apr 2, 2025
Maintainer

Hi @Yuvaraj-off 👋,

Correct the seg_maps are of shape (H, W, 1) and contain the prob scores for each pixel so everything you need to do is to create a zero filled mask (dtype = float) and fill all the detected areas from your yolo model with 1.0 for example

A yeah sry I missed this part:

doctr/doctr/models/predictor/pytorch.py

Line 85 in de2c359

    
           np.where(out_map > getattr(self.det_predictor.model.postprocessor, "bin_thresh"), 255, 0).astype(np.uint8)

from doctr.models.detection.core import DetectionPostProcessor

class CustomDetectionPredictor(DetectionPredictor):
"""Custom DetectionPredictor for docTR using YOLOv8 PyTorch."""

    def __init__(self, model_path: str):
        super().__init__(pre_processor=None, model=torch.nn.Module())
        self.yolo_model = YOLOProcessor(model_path)
        self.model.postprocessor = DetectionPostProcessor(box_thresh=0.5, bin_thresh=0.5, assume_straight_pages=True)

As mentioned it is possible but not designed for the use case to plug in any model - especially if you want to use it with the whole pipeline 😅

The format for the preds with assume_straight_pages=True is [xmin, ymin, xmax, ymax, box_score]

[{'words': array([[0.90203401, 0.93945312, 0.91446805, 0.95214844, 0.33106691],
       [0.83157444, 0.85449219, 0.91308649, 0.86523438, 0.49566621],
       [0.78183828, 0.85449219, 0.83157444, 0.86621094, 0.46532482],
       ...,
       [0.44473759, 0.03320312, 0.50138156, 0.04492188, 0.57190824],
       [0.39776455, 0.03320312, 0.42815887, 0.04785156, 0.51862246],
       [0.24302981, 0.03320312, 0.27066101, 0.04589844, 0.55790615]])}]

and with assume_straight_pages=False

[[x1, y1], [x2, y2], [x3, y3], [x4, y4], [0, box_score]]

[{'words': array([[[0.90203401, 0.93945312],
        [0.91170493, 0.93945312],
        [0.91170493, 0.95019531],
        [0.90203401, 0.95019531],
        [0.        , 0.42683684]],

       [[0.83157444, 0.85449219],
        [0.91032337, 0.85449219],
        [0.91032337, 0.86328125],
        [0.83157444, 0.86328125],
        [0.        , 0.5850419 ]],

       [[0.78183828, 0.85449219],
        [0.82881132, 0.85449219],
        [0.82881132, 0.86425781],
        [0.78183828, 0.86425781],
        [0.        , 0.5922716 ]],

       ...,

       [[0.44473759, 0.03320312],
        [0.49861844, 0.03320312],
        [0.49861844, 0.04296875],
        [0.44473759, 0.04296875],
        [0.        , 0.65185542]],

       [[0.39638299, 0.03222656],
        [0.42539575, 0.03222656],
        [0.42539575, 0.04589844],
        [0.39638299, 0.04589844],
        [0.        , 0.66118372]],

       [[0.24302981, 0.03320312],
        [0.26789789, 0.03320312],
        [0.26789789, 0.04394531],
        [0.24302981, 0.04394531],
        [0.        , 0.66179151]]])}]

where box_score is the fast computed objectness_score so you can fill this value with any value you want 0.0-1.0
And all coords are relative

felixdittrich92 Apr 2, 2025
Maintainer

Hope this helps :)

Yuvaraj-off Apr 2, 2025
Author

This should really help. Thanks for your insights, will look into and revert back

Yuvaraj-off · 2025-04-16T13:24:59Z

Yuvaraj-off
Apr 16, 2025
Author

Hi @felixdittrich92,

We've been working in parallel on retraining the db_mobilenet_v3_large model using our updated dataset, where we re-annotated only the English texts. The initial results on straight (non-rotated) images have been promising.

However, when we tested the model on rotated images, the performance dropped significantly — it began detecting all texts regardless of language, and even included a lot of noise/junk.

To address this, we'd like to retrain the model using rotated versions of our dataset as well. Could you let us know if there are recommended approaches or existing options for augmenting our dataset with rotated images (while keeping the annotations aligned correctly)?

Thanks in advance!

12 replies

felixdittrich92 May 6, 2025
Maintainer

Hi @Yuvaraj-off 👋,

Yes

First you load your custom weights

For example:

det_model = db_resnet50(pretrained=False, pretrained_backbone=False)
det_params = torch.load('<path_to_pt>', map_location="cpu")
det_model.load_state_dict(det_params)

Then export:

import torch
from doctr.models.utils import export_model_to_onnx

batch_size = 1  # only placeholder - batch size is by default set as dynamic axis
input_shape = (3, 1024, 1024)
dummy_input = torch.rand((batch_size, *input_shape), dtype=torch.float32)
model_path = export_model_to_onnx(
    det_model,
    model_name="my_custom_model.onnx",
    dummy_input=dummy_input
)

Then enjoy and load in OnnxTR 🤗

from onnxtr.models import ocr_predictor, db_resnet50

det_model = db_resnet50("path_to_custom_model.onnx")
predictor = ocr_predictor(det_arch=det_model)

Yuvaraj-off May 6, 2025
Author

import torch
from doctr.models.detection import db_mobilenet_v3_large
from doctr.models.utils import export_model_to_onnx

Load the pretrained model with custom weights

det_model = db_mobilenet_v3_large(pretrained=False, pretrained_backbone=False)
det_params = torch.load('/home/yuvaraj/dbmobilenet_retrain/new_models/db_mobilenet_v3_large_20250505-150745.pt', map_location="cpu")
det_model.load_state_dict(det_params)

Export the model to ONNX

batch_size = 1 # only placeholder - batch size is by default set as dynamic axis
input_shape = (3, 1024, 1024)
dummy_input = torch.rand((batch_size, *input_shape), dtype=torch.float32)

Export the model

model_path = export_model_to_onnx(
det_model,
model_name="db_mobilenet_v3_large_custom.onnx",
dummy_input=dummy_input
)

this is the code i have been working with and its the same error
RuntimeError: Only tuples, lists and Variables are supported as JIT inputs/outputs. Dictionaries and strings are also accepted, but their usage is not recommended. Here, received an input of unsupported type: numpy.ndarray

felixdittrich92 May 6, 2025
Maintainer

@Yuvaraj-off Oh I see something has changed with the pytorch 2.7.0 release because with 2.6.x it works ..I will have a look 👍

felixdittrich92 May 6, 2025
Maintainer

ok nothing changed ... I missed to set exportable=True 😅

corrected:

import torch
from doctr.models.detection import db_mobilenet_v3_large
from doctr.models.utils import export_model_to_onnx

# Load the pretrained model with custom weights
det_model = db_mobilenet_v3_large(pretrained=False, pretrained_backbone=False, exportable=True)  # set exportable
det_params = torch.load('/home/yuvaraj/dbmobilenet_retrain/new_models/db_mobilenet_v3_large_20250505-150745.pt', map_location="cpu")
det_model.load_state_dict(det_params)

# Export the model to ONNX
batch_size = 1 # only placeholder - batch size is by default set as dynamic axis
input_shape = (3, 1024, 1024)
dummy_input = torch.rand((batch_size, *input_shape), dtype=torch.float32)
model_path = export_model_to_onnx(
det_model,
model_name="db_mobilenet_v3_large_custom.onnx",
dummy_input=dummy_input
)

Yuvaraj-off May 6, 2025
Author

Ohh sweet. Thanks

Yuvaraj-off · 2025-05-26T06:13:21Z

Yuvaraj-off
May 26, 2025
Author

Hi @felixdittrich92,

Thank you for your detailed guidance on retraining the text detection model — it was incredibly helpful!

I'm now planning to move forward with fine-tuning the recognition model (CRNN). Before I begin, I had a couple of quick questions:

Training Dataset: Could you please share which datasets were originally used to train the CRNN model included in docTR?

Image Augmentations: For fine-tuning, I’d like to customize certain image transformations or augmentations. Is there built-in support in the training scripts or documentation that allows for easy modification of these pre-processing steps?

Appreciate your help in advance!

2 replies

felixdittrich92 May 26, 2025
Maintainer

Hey :)

Glad to hear that it was helpful 👍

Mh the recognition models was trained on a mindee internal dataset containing ~10M word crop images

DocTR train: 10M real crops
DocTR val: 365k real crops

Including:
IV3: 48k real crops from Invoice v3 val dataset (mindee internal)
L3D: 1M real crops from L3D logo dataset (https://github.com/lhf-labs/tm-dataset - MIT license)
SRCPDF: 2M real crops from source PDFs (mindee internal)

Unfortunately the datasets can't be shared.

From your examples you want to fine tune the model for devanagari script ?

About the image augmentations you can modify the training script to your own needs:

doctr/references/recognition/train_pytorch.py

Line 362 in 8af91c4

img_transforms=Compose([

But in general I have tested the current default augmentation setup and it works pretty well.

Best,
Felix

Yuvaraj-off May 26, 2025
Author

Hey Felix,

Ahh no, I’m sticking with English text for now — not looking into Devanagari script yet 😊

I’m mainly looking to improve accuracy on some real-world edge cases — like older documents or lower-quality camera captures — where the model occasionally struggles. I’ve noticed recognition inconsistencies especially with certain serial patterns and date formats, so the idea is to fine-tune for better robustness in those scenarios.

Thanks a lot for pointing me to the augmentation section! I’ll definitely explore tweaking it based on the kinds of distortions we’re seeing.

Appreciate your support as always and would love your insights on this attempt

Non-English texts still detected after fine-tuning text detection model #1906

Uh oh!

Yuvaraj-off Mar 25, 2025

Replies: 4 comments · 22 replies

Uh oh!

felixdittrich92 Mar 25, 2025 Maintainer

Uh oh!

Yuvaraj-off Mar 25, 2025 Author

Uh oh!

Yuvaraj-off Apr 1, 2025 Author

Uh oh!

Yuvaraj-off Apr 2, 2025 Author

Uh oh!

felixdittrich92 Apr 2, 2025 Maintainer

Uh oh!

felixdittrich92 Apr 2, 2025 Maintainer

Uh oh!

Yuvaraj-off Apr 2, 2025 Author

Uh oh!

Yuvaraj-off Apr 16, 2025 Author

Uh oh!

felixdittrich92 May 6, 2025 Maintainer

Uh oh!

Uh oh!

Yuvaraj-off May 6, 2025 Author

Load the pretrained model with custom weights

Export the model to ONNX

Export the model

Uh oh!

felixdittrich92 May 6, 2025 Maintainer

Uh oh!

felixdittrich92 May 6, 2025 Maintainer

Uh oh!

Yuvaraj-off May 6, 2025 Author

Uh oh!

Yuvaraj-off May 26, 2025 Author

Uh oh!

felixdittrich92 May 26, 2025 Maintainer

Uh oh!

Yuvaraj-off May 26, 2025 Author

Yuvaraj-off
Mar 25, 2025

Replies: 4 comments 22 replies

felixdittrich92
Mar 25, 2025
Maintainer

Yuvaraj-off
Mar 25, 2025
Author

Yuvaraj-off Apr 1, 2025
Author

Yuvaraj-off Apr 2, 2025
Author

felixdittrich92 Apr 2, 2025
Maintainer

felixdittrich92 Apr 2, 2025
Maintainer

Yuvaraj-off Apr 2, 2025
Author

Yuvaraj-off
Apr 16, 2025
Author

felixdittrich92 May 6, 2025
Maintainer

Yuvaraj-off May 6, 2025
Author

felixdittrich92 May 6, 2025
Maintainer

felixdittrich92 May 6, 2025
Maintainer

Yuvaraj-off May 6, 2025
Author

Yuvaraj-off
May 26, 2025
Author

felixdittrich92 May 26, 2025
Maintainer

Yuvaraj-off May 26, 2025
Author