-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Enhancement: Add a database caching for improved performance #9784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Hello @Merinorus, Thank you very much for submitting this PR to us! This is what will happen next:
You'll be hearing from us soon, and thank you again for contributing to our project. |
It’s a cool idea and appreciate the effort. Looking forward to trying it out etc.
Indeed, integration tests are needed (note the failing test), just marked it as a draft in the meantime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this seems to kind of just work, which is great. I dont notice much difference with a normal-sized install in terms of speed etc, at least not as an end-user.
I had initially wondered about enabling this by default, but Im not sure thats a good idea.
So what does happen when you hit OOM?
I fixed the lock file and added tests.
Will wait for our main backend dev to take a look.
3f193b4
to
c1f34a2
Compare
Thank you for your feedback! There are two types of OOM situations:
For the average user with the default Redis instance (without memory limit and without eviction policy), just enabling this feature would be safe. Indeed, this cache would increase memory use by dozens of MB at most, maybe up to 200-300 MB if you set a one-month TTL. While consuming a new document might temporarily eat a few GB of RAM. On second thoughts, I think having a second Redis instance might be useful for users who want to have fine control on their RAM usage. E.g., you can set a 32 MB or 64 MB limit with LRU eviction, and still have most of the cache benefits. But, honestly, I don't see the point when the main Django application may consume almost 1 GB of RAM when idling. The risk of enabling this feature by default is that some people might have crons or custom SQL scripts writing to the database. If they update the Paperless-Ngx app without checking the changelog, they may end up with stale data in the cache. Also, while I didn't find any bug for now, it's quite a sensitive change. Maybe there is a table that is updated outside of the application context I'm not aware of. Note: I see you added a "pngx" prefix to the cache keys. I forgot I set a |
c1f34a2
to
99004dc
Compare
55139a0
to
0d816af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the failing tests, I believe this can be simplified
I've added integration tests, but I didn't pay attention that Redis is not used in CI. I'll fall back to local memory for tests, like the Django default cache. @stumpylog I'm currently simplifying the code, as you said, there is a lot of unneeded logic. Edit 2025-05-05: I simplified the code, it's ready for another review. |
9f8ddee
to
38a3722
Compare
a3ed776
to
1b4a52a
Compare
9117117
to
59c7ac6
Compare
d513f45
to
0c28c0d
Compare
If everything works as expected, the DB read cache could be enabled by default in a future version. I've renamed some settings to have a more generic name. Indeed, we could use these settings to manage other types of cache. |
DB queries made by the Django ORM may now be cached in Redis with django-cachalot. This is an optional setting. Limitations: - When Redis is not available (eg: debug mode with local memory cache), the read cache is disabled. - Whenever a row is modified by a django application (Paperless, Celery...), the whole table is removed from the cache (Limitation of django-cachalot). - /!\ WARNING: If you modify the database through a `python manage.py` command or externally, you must invalidate the cache with `python manage.py invalidate_db_cache`.
- Allowed cache TTL is now between 1 second and 31536000 (one year). Any value below 1 second will set to the default TTL (1 hour). - Remove PAPERLESS_DB_READ_CACHE_REDIS_PREFIX: the cache prefix now uses the PAPERLESS_REDIS_PREFIX value so it's the same as the scheduler Redis prefix - Fix cache not invalidated when a custom prefix was used - Add integration tests: - Cache disabled with default settings - Cache enabled with custom settings - Remove some unit test already covered by the integration tests - Remove custom manage.py command, use invalidate_cachalot instead - Simplify the caching logic - Use LocMemCache for debugging or testing, like the Django cache - Simplify settings.py file and encapsulate cache settings for easier testing
The "cachalot" cache currently stores SQL read query results from django-cachalot. Since this cache can use a distinct Redis eviction policy and is intended for storing read-only, evictable data, it could now also hold suggestions and similar results, hence the more generic name.
The read cache is used by Django-Cachalot, but this can be extended to any other type or read cache.
0c28c0d
to
c255997
Compare
/!\ I tested these changes manually; integration tests might be needed.
Edit 2025-05-05: Added Integration tests and simplified code.
Proposed change
DB queries made by the Django ORM may now be cached in Redis with django-cachalot. This is an opt-in setting.
Limitations:
python manage.py
command or externally, you must invalidate the cache withpython manage.py invalidate_db_cache
.New python dependency: django-cachalot
Generally, each API request that relies on SQL database does it through the Django ORM, which leads to about 50 to 150 SQL requests. The Django ORM is definitely not optimized for database performance. Adding a database read cache greatly improves response time without having to rewrite our application code, as discussed in #8478 (comment).
Various performance issues #6092 can probably be improved with this PR.
How does it work?
Each SQL request generated in the Django application context (paperless, celery, management commands, django migrations etc.) is intercepted and hashed. If a redis key with this hash exists, it means the SQL request has already been cached, so the value is directly retrieved from Redis without querying the database. Otherwise, the DB is queried, and the result is stored in Redis:
Each database modification (UPDATE, INSERT, DELETE, ALTER, CREATE or DROP) is intercepted, and all hashed queries that relate to the affected table(s) are removed from the Redis cache. This ensures that no stale data is stored in the cache.
For instance, if you update one document's title, all the requests that select or join from the
documents
are invalidated from the cache. This may sound inefficient, but it seems safer to avoid stale data in the cache.More information on how Django-Cachalot works: https://django-cachalot.readthedocs.io/en/latest/introduction.html
Performance impact
After more than one month of personal testing, this feature allows me to keep using Paperless-Ngx instead of giving up. On machines with a weak CPU or slow disk IO, this is probably a night-and-day difference.
I've checked with the Django debug toolbar: even after a modification, most of the SQL queries are still cached (user rights, tags, correspondents, etc.). This greatly reduces the number of SQL queries for the average user workflow. Here are a few examples of operations that are greatly accelerated:
The expected additional RAM usage depends on the TTL you set. Once the cache expires, the RAM usage should be negligible.
The default 1-hour TTL is a very safe setting that won't increase RAM usage a lot, and will still help improve performance.
If you want to benefit as much as possible from this cache, you can specify a longer TTL (I've set one month). In this case, you might dedicate a separate Redis instance dedicated to the cache, with a key eviction policy, because he main redis instance doesn't have an eviction policy by default, so out-of-memory situation will forbid new value insertions (cache, but also scheduled tasks, document consumption etc.). You may also disable disk writing, because I don't mind losing the cache when the system is shut down.
Here is the docker command I use for my second Redis instance:
Type of change
Checklist:
pre-commit
hooks, see documentation.