VitStr model for spanish language #1875

dasantosa · 2025-02-26T11:42:01Z

dasantosa
Feb 26, 2025

🚀 The feature

I'm using DocTR for text detection and recognition and I'm having trouble with the recognition of the "ñ"/"Ñ" characters. I've been using the vitstr_small model and its default weights which have been trained with the French alphabet.

I've read the issues page for a similar issue and found that there is a multilingual model loaded in HuggingFace. The problem is that these weights are for the ParseQ model and it's a bit slower than VitStr.

My question is, is it possible to load weights for each language instead of a multilingual model to get more efficiency in inferences? Or is there a plan to load multilingual weights for other models?

Thanks!.

Motivation, pitch

Use VitStr models for each language efficiently.

Alternatives

No response

Additional context

No response

felixdittrich92 · 2025-02-26T12:48:00Z

felixdittrich92
Feb 26, 2025
Maintainer

Hi @dasantosa 👋,

It's the highest prio this year to make docTR multilingual but this will still take some time.
On top of this task the sec step would be to add a way to black or whitelist characters so that users can adjust it there own needs (lang or char specific).

We have already a ticket for this: #1699 and #988

You are right I trained only parseq on this dataset but the good thing the used dataset for fine tuning is 100% synth generated so I can share it and you can fine tune vitstr on it on your own :)

dataset: https://drive.google.com/file/d/1TNQN8uBMiGjzf2GM41BWDICefLubau5q/view?usp=sharing

3 replies

dasantosa Feb 26, 2025
Author

Thank you very much for your quick response.

Thanks also for the data set. I will get on it as soon as possible and, if I have a good model, I will also upload it to HuggingFace.

Once again, thank you. Keep in touch.

dasantosa Mar 7, 2025
Author

Hello again!

I am getting good results with the VITstr_small model with the dataset filtered by Spanish characters. However, I am having some problems with texts that are too long, for example with a shape of (31H, 407W, 3C). The problem is that, when the image is preprocessed, due to its shape, the image is resized obtaining a small text, which makes the model not recognize the characters correctly.

Is there a way to use these images without cropping them into different slices of 128 width? I have read the documentation but I couldn't find anything.

Thanks.

felixdittrich92 Mar 7, 2025
Maintainer

Hi @dasantosa 👋

We have already some logic under the hood to split wide crops and remap afterwards:

doctr/doctr/models/recognition/predictor/pytorch.py

Line 58 in a8a81bf

if self.split_wide_crops:

I think this logic needs some robustness updates unfortunately I haven't found the time to work on it 😅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VitStr model for spanish language #1875

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

VitStr model for spanish language #1875

Uh oh!

dasantosa Feb 26, 2025

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

felixdittrich92 Feb 26, 2025 Maintainer

Uh oh!

dasantosa Feb 26, 2025 Author

Uh oh!

dasantosa Mar 7, 2025 Author

Uh oh!

felixdittrich92 Mar 7, 2025 Maintainer

dasantosa
Feb 26, 2025

Replies: 1 comment 3 replies

felixdittrich92
Feb 26, 2025
Maintainer

dasantosa Feb 26, 2025
Author

dasantosa Mar 7, 2025
Author

felixdittrich92 Mar 7, 2025
Maintainer