Skip to content

Enhancement: Add a database caching for improved performance #9784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from

Conversation

Merinorus
Copy link

@Merinorus Merinorus commented Apr 24, 2025

/!\ I tested these changes manually; integration tests might be needed.
Edit 2025-05-05: Added Integration tests and simplified code.

Proposed change

DB queries made by the Django ORM may now be cached in Redis with django-cachalot. This is an opt-in setting.

Limitations:

  • When Redis is not available (eg: debug mode with local memory cache), the read cache is disabled.
  • Whenever a row is modified by a Django application (Paperless, Celery...), the whole table is removed from the cache (Limitation of django-cachalot).
  • /!\ WARNING: If you modify the database through a python manage.py command or externally, you must invalidate the cache with python manage.py invalidate_db_cache.

New python dependency: django-cachalot

Generally, each API request that relies on SQL database does it through the Django ORM, which leads to about 50 to 150 SQL requests. The Django ORM is definitely not optimized for database performance. Adding a database read cache greatly improves response time without having to rewrite our application code, as discussed in #8478 (comment).
Various performance issues #6092 can probably be improved with this PR.

How does it work?

Each SQL request generated in the Django application context (paperless, celery, management commands, django migrations etc.) is intercepted and hashed. If a redis key with this hash exists, it means the SQL request has already been cached, so the value is directly retrieved from Redis without querying the database. Otherwise, the DB is queried, and the result is stored in Redis:

$ redis-cli
127.0.0.1:6379> keys *
...
5787) ":1:cachalot_5c38aa7b3f3a5603529be7692074bf7f2277d926"
5788) ":1:cachalot_fa58c5c66d04cc82dbf01deafc964537b35a177e"
5789) ":1:cachalot_3876b992ed26c20ee2e593268a20f00c904ce93f"
5790) ":1:cachalot_62c3de81dc044de0e6633ca81de8f9079336268d"
5791) ":1:cachalot_5ac725258589bc93664c4f20d6e654dd5fca107a"
...

Each database modification (UPDATE, INSERT, DELETE, ALTER, CREATE or DROP) is intercepted, and all hashed queries that relate to the affected table(s) are removed from the Redis cache. This ensures that no stale data is stored in the cache.

For instance, if you update one document's title, all the requests that select or join from the documents are invalidated from the cache. This may sound inefficient, but it seems safer to avoid stale data in the cache.

More information on how Django-Cachalot works: https://django-cachalot.readthedocs.io/en/latest/introduction.html

Performance impact

After more than one month of personal testing, this feature allows me to keep using Paperless-Ngx instead of giving up. On machines with a weak CPU or slow disk IO, this is probably a night-and-day difference.

I've checked with the Django debug toolbar: even after a modification, most of the SQL queries are still cached (user rights, tags, correspondents, etc.). This greatly reduces the number of SQL queries for the average user workflow. Here are a few examples of operations that are greatly accelerated:

  • Loading a list of documents (heavy SQL overhead due to checked user rights on each thumbnail: 50 thumbnails x 4 SQL requests = 200 fewer SQL requests)
  • Navigating through documents one by one (50 requests by document, the majority being repeated queries related to user and group permissions)
  • Saving a modified document (usually more than 100 SQL read requests for each modification if audit trail is enabled)
  • Statistics view (20 SQL queries, each count is cached separately, so a document modification won't invalidate the cache for tags, document type count, etc.)
  • Managing views: tags, document types, storage paths, correspondents, etc. Note: some tabs still rely on Python application code to render views with a lot of computation (e.g., counting the number of documents associated with each tag), and it is still slow with weak CPUs (Intel Atom, Raspberry Pi, etc.)

The expected additional RAM usage depends on the TTL you set. Once the cache expires, the RAM usage should be negligible.
The default 1-hour TTL is a very safe setting that won't increase RAM usage a lot, and will still help improve performance.

If you want to benefit as much as possible from this cache, you can specify a longer TTL (I've set one month). In this case, you might dedicate a separate Redis instance dedicated to the cache, with a key eviction policy, because he main redis instance doesn't have an eviction policy by default, so out-of-memory situation will forbid new value insertions (cache, but also scheduled tasks, document consumption etc.). You may also disable disk writing, because I don't mind losing the cache when the system is shut down.
Here is the docker command I use for my second Redis instance:

["redis-server", "--appendonly", "no", --save, "", "--maxmemory", "512mb", "--maxmemory-policy", "allkeys-lru"]

Type of change

  • Bug fix: non-breaking change which fixes an issue.
  • New feature / Enhancement: non-breaking change which adds functionality. Please read the important note above.
  • Breaking change: fix or feature that would cause existing functionality to not work as expected.
  • Documentation only.
  • Other. Please explain:

Checklist:

  • I have read & agree with the contributing guidelines.
  • If applicable, I have included testing coverage for new code in this PR, for backend and / or front-end changes.
  • If applicable, I have tested my code for new features & regressions on both mobile & desktop devices, using the latest version of major browsers.
  • If applicable, I have checked that all tests pass, see documentation.
  • I have run all pre-commit hooks, see documentation.
  • I have made corresponding changes to the documentation as needed.
  • I have checked my modifications for any breaking changes.

@Merinorus Merinorus requested a review from a team as a code owner April 24, 2025 09:27
@github-actions github-actions bot added documentation Improvements or additions to documentation backend non-trivial Requires approval by several team members labels Apr 24, 2025
Copy link
Contributor

Hello @Merinorus,

Thank you very much for submitting this PR to us!

This is what will happen next:

  1. CI tests will run against your PR to ensure quality and consistency.
  2. Next, human contributors from paperless-ngx review your changes.
  3. Please address any issues that come up during the review as soon as you are able to.
  4. If accepted, your pull request will be merged into the dev branch and changes there will be tested further.
  5. Eventually, changes from you and other contributors will be merged into main and a new release will be made.

You'll be hearing from us soon, and thank you again for contributing to our project.

@Merinorus Merinorus changed the title Add django-cachalot as database ORM read cache Add a database read cache Apr 24, 2025
@Merinorus Merinorus changed the title Add a database read cache Performance: Add a database read cache Apr 24, 2025
@shamoon shamoon changed the title Performance: Add a database read cache Enhancement: Add a database caching for improved performance Apr 24, 2025
@shamoon shamoon marked this pull request as draft April 24, 2025 14:39
@shamoon
Copy link
Member

shamoon commented Apr 24, 2025

It’s a cool idea and appreciate the effort. Looking forward to trying it out etc.

/!\ I tested these changes manually; integration tests might be needed.

Indeed, integration tests are needed (note the failing test), just marked it as a draft in the meantime.

Copy link
Member

@shamoon shamoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this seems to kind of just work, which is great. I dont notice much difference with a normal-sized install in terms of speed etc, at least not as an end-user.

I had initially wondered about enabling this by default, but Im not sure thats a good idea.

So what does happen when you hit OOM?

I fixed the lock file and added tests.

Will wait for our main backend dev to take a look.

@shamoon shamoon marked this pull request as ready for review April 24, 2025 19:13
@shamoon shamoon force-pushed the feature-django-cachalot branch from 3f193b4 to c1f34a2 Compare April 24, 2025 19:13
@shamoon shamoon added enhancement New feature or enhancement and removed documentation Improvements or additions to documentation labels Apr 24, 2025
@Merinorus
Copy link
Author

Merinorus commented Apr 25, 2025

Thank you for your feedback!

There are two types of OOM situations:

  • If you set a memory limit in Redis without an eviction policy (e.g., redis-server --maxmemory 64mb) and the memory is full, Redis won't accept new key insertions until keys are evicted because of the end of TTL. So any new data insertion will raise a Redis error: new task, new document, any SQL request that needs to be cached etc. I guess no one has arbitrarily set a memory limit on Redis for now.
    If you don't set a memory limit on Redis but the system is OOM, well, it's probably not because of this cache feature. You probably never want an OOM situation on Linux. It's unpredictable, and it may hang for a few minutes/hours without anything available until, if you're lucky, the system kills some applications.

For the average user with the default Redis instance (without memory limit and without eviction policy), just enabling this feature would be safe. Indeed, this cache would increase memory use by dozens of MB at most, maybe up to 200-300 MB if you set a one-month TTL. While consuming a new document might temporarily eat a few GB of RAM.

On second thoughts, I think having a second Redis instance might be useful for users who want to have fine control on their RAM usage. E.g., you can set a 32 MB or 64 MB limit with LRU eviction, and still have most of the cache benefits. But, honestly, I don't see the point when the main Django application may consume almost 1 GB of RAM when idling.

The risk of enabling this feature by default is that some people might have crons or custom SQL scripts writing to the database. If they update the Paperless-Ngx app without checking the changelog, they may end up with stale data in the cache. Also, while I didn't find any bug for now, it's quite a sensitive change. Maybe there is a table that is updated outside of the application context I'm not aware of.
I was thinking of having this feature as optional as long as it's considered "experimental", so we could collect more user feedback. If everything is good, I would enable it by default on the next "major" version that brings breaking changes. That might be safer rather than taking the risk of having many issues raised due to potentially inconsistent data bugs.

Note: I see you added a "pngx" prefix to the cache keys. I forgot I set a PAPERLESS_DB_READ_CACHE_REDIS_PREFIX environment variable that can be set, like PAPERLESS_REDIS_PREFIX. It defaults to None. I'll add it to the doc.
Also, if we have multiple paperless instances, the Redis cache must be shared between all instances, otherwise, all of them will end up with stale data. More information: https://django-cachalot.readthedocs.io/en/latest/limits.html
I should also update the doc with this warning.

@Merinorus Merinorus force-pushed the feature-django-cachalot branch from c1f34a2 to 99004dc Compare April 28, 2025 14:51
@Merinorus Merinorus force-pushed the feature-django-cachalot branch 3 times, most recently from 55139a0 to 0d816af Compare April 29, 2025 14:23
Copy link
Member

@stumpylog stumpylog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the failing tests, I believe this can be simplified

@Merinorus Merinorus changed the title Enhancement: Add a database caching for improved performance [WIP] Enhancement: Add a database caching for improved performance Apr 29, 2025
@Merinorus
Copy link
Author

Merinorus commented Apr 29, 2025

I've added integration tests, but I didn't pay attention that Redis is not used in CI. I'll fall back to local memory for tests, like the Django default cache. @stumpylog I'm currently simplifying the code, as you said, there is a lot of unneeded logic.

Edit 2025-05-05: I simplified the code, it's ready for another review.

@Merinorus Merinorus force-pushed the feature-django-cachalot branch 4 times, most recently from 9f8ddee to 38a3722 Compare April 30, 2025 17:27
@Merinorus Merinorus requested a review from stumpylog April 30, 2025 17:40
@Merinorus Merinorus changed the title [WIP] Enhancement: Add a database caching for improved performance Enhancement: Add a database caching for improved performance Apr 30, 2025
@shamoon shamoon force-pushed the dev branch 3 times, most recently from a3ed776 to 1b4a52a Compare May 2, 2025 06:34
@Merinorus Merinorus force-pushed the feature-django-cachalot branch 8 times, most recently from 9117117 to 59c7ac6 Compare May 5, 2025 17:45
@Merinorus Merinorus force-pushed the feature-django-cachalot branch 3 times, most recently from d513f45 to 0c28c0d Compare May 23, 2025 10:38
@Merinorus
Copy link
Author

Merinorus commented May 23, 2025

If everything works as expected, the DB read cache could be enabled by default in a future version.

I've renamed some settings to have a more generic name. Indeed, we could use these settings to manage other types of cache.
For instance, the read cache TTL is 1 hour, but I could reduce it to 50 minutes to be equal to the thumbnail and suggestion cache. So we could use the same setting for all, instead of the currently fixed duration.

Merinorus and others added 8 commits May 27, 2025 17:25
DB queries made by the Django ORM may now be cached in Redis with django-cachalot.
This is an optional setting.

Limitations:
- When Redis is not available (eg: debug mode with local memory cache), the read cache is disabled.
- Whenever a row is modified by a django application (Paperless, Celery...), the whole table is removed from the cache (Limitation of django-cachalot).
- /!\ WARNING: If you modify the database through a `python manage.py` command or externally,
you must invalidate the cache with `python manage.py invalidate_db_cache`.
- Allowed cache TTL is now between 1 second and 31536000 (one year). Any value below 1 second will set to the default TTL (1 hour).
- Remove PAPERLESS_DB_READ_CACHE_REDIS_PREFIX: the cache prefix now uses the PAPERLESS_REDIS_PREFIX value so it's the same as the scheduler Redis prefix
- Fix cache not invalidated when a custom prefix was used
- Add integration tests:
  - Cache disabled with default settings
  - Cache enabled with custom settings
- Remove some unit test already covered by the integration tests
- Remove custom manage.py command, use invalidate_cachalot instead
- Simplify the caching logic
- Use LocMemCache for debugging or testing, like the Django cache
- Simplify settings.py file and encapsulate cache settings for easier testing
The "cachalot" cache currently stores SQL read query results from django-cachalot.
Since this cache can use a distinct Redis eviction policy and is intended for storing
read-only, evictable data, it could now also hold suggestions and similar results,
hence the more generic name.
The read cache is used by Django-Cachalot, but this can be extended to any other type or read cache.
@Merinorus Merinorus force-pushed the feature-django-cachalot branch from 0c28c0d to c255997 Compare May 27, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend enhancement New feature or enhancement non-trivial Requires approval by several team members
Projects
No open projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

3 participants