Skip to content

"Clear cache and re-download language data files" not working #12312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ericjacolin opened this issue May 18, 2025 · 5 comments
Open

"Clear cache and re-download language data files" not working #12312

ericjacolin opened this issue May 18, 2025 · 5 comments
Assignees
Labels
bug It's a bug desktop All desktop platforms high High priority issues

Comments

@ericjacolin
Copy link

Operating system

Linux

Joplin version

3.3.12

Desktop version info

Joplin 3.3.12 (prod, linux)

Device: linux, 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Client ID: c9d562b412294cb3a9cfa76b0c20e0ed
Sync Version: 3
Profile Version: 47
Keychain Supported: No
Alternative instance ID: -

Revision: 4d790b6

Backup: 1.4.3
Csv Import: 1.0.1
Freehand Drawing: 3.0.1
Outline: 1.5.13
Search & Replace: 2.2.0

Current behaviour

  1. I activated OCR for the first time, the scan worked (about 600 images) but only in English
  2. I have a lot of Chinese content so I added Tesseract chi_sim language file (as well as French) with the procedure described in https://joplinapp.org/help/apps/ocr/
  3. I tried to "Clear cache and re-download language data files", but this had no effect
  4. The cached language files (English) are still there

Expected behaviour

I would expect the cache to clear and rebuild with the new set of of languages

Logs

No response

@ericjacolin ericjacolin added the bug It's a bug label May 18, 2025
@laurent22 laurent22 added desktop All desktop platforms high High priority issues labels May 19, 2025
@personalizedrefrigerator
Copy link
Collaborator

Thank you for reporting this!

For comparison, "Clear cache and re-download language data files" seems to work for me (on Fedora 42 Linux, with Joplin 3.3.12 (dev)):

  • Shows a confirmation dialog.
  • After pressing OK, restarts Joplin.
  • Logs the following:
    OcrDriverTesseract: Clearing cached language data...
    OcrDriverTesseract: Clearing language data with key ./eng.traineddata
    
  • At this point, inspecting the keyval-store indexedDB table (using the "Application" tab in the development tools) reveals that the ./eng.traineddata cached model is no longer downloaded.

Follow-up questions:

  • After clicking "Clear cache and re-download language data files", is "OcrDriverTesseract: Clearing language data with key ./eng.traineddata" added to Joplin's log file?
    • Logs can either be accessed from Joplin's development tools (Help > Toggle developer tools, includes recent logs) or from the log.txt file in the profile directory (Help > Open profile directory, includes older logs).
    • If there is an error, it should log "OCR: Failed to clear language data cache." followed by the error message.
  • Are new images (not scanned previously) processed using the new OCR models?

Note

"Clear cache and re-download language data files" deletes the local eng.traineddata (and other) models, but does not re-OCR existing attachments.

@personalizedrefrigerator
Copy link
Collaborator

personalizedrefrigerator commented May 19, 2025

Proposed changes:

  • (Maybe) Get rid of "Clear cache and re-download language data files". Instead, automatically remove the models for non-active languages (e.g. after a week or two).
    • We could also base this on available disk space.
  • Add a new option, "Re-OCR all attachments", maybe with a prompt. This would be shown when the user changes the OCR URL.

@ericjacolin
Copy link
Author

Thanks @personalizedrefrigerator this is very helpful.
Inspecting IndexedDB, I see that after clearing languages the only language that Joplin will reload from local traineddata folder is eng. Will not load chi_sim or fra

@personalizedrefrigerator
Copy link
Collaborator

Inspecting IndexedDB, I see that after clearing languages the only language that Joplin will reload from local traineddata folder is eng. Will not load chi_sim or fra

Thank you for the additional information!

Is the UI language set to English? If so, hanging Joplin's UI language in settings > general might help. Currently, Joplin uses the global locale setting to select the OCR language:

const language = toIso639Alpha3(Setting.value('locale'));

As a result, if the UI language is set to English, the OCR service will try to use an eng language model.

@ericjacolin
Copy link
Author

Thanks. I may have misunderstood the purpose of the feature.
My locale is English indeed and I was expecting to be able to scan documents in both English and simplified Chinese, using tesseract -l eng+chi_sim behind the scenes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug It's a bug desktop All desktop platforms high High priority issues
Projects
None yet
Development

No branches or pull requests

3 participants