Skip to content

Persist / Restore of large file takes a lot of time #932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JpEncausse opened this issue May 5, 2025 · 4 comments
Open

Persist / Restore of large file takes a lot of time #932

JpEncausse opened this issue May 5, 2025 · 4 comments

Comments

@JpEncausse
Copy link

Describe the bug

I have a small database of 7700 entries with embedding

  • the JSON file on disk takes 50mb
  • the Database Persist in JSON on disk takes 222Mb
  • the Database Persist in Binary on disk takes 262Mb

Restoring the database takes a lot of minutes JSON or Binary and takes 3gb of Memory.
I don't understand why it takes so much time where the idex should have been created

To Reproduce

  1. persist(db, 'binary') a database of 7000 entries with embedding
  2. restore('binary', config.payload)
  3. It takes 10 minutes ? The CPU switch 100% - 0% that's weird and memory crash

If I create a DataBase then insertMultiple() the raw JSON+Embedding it takes < 1 minute and 1gb of memory

Expected behavior

Loading the Binary data should be straight forward. Even if the memory allocation takes time it shouldn't be that much.

Environment Info

- Linux
- Node-RED v4
- Orama 3.1.6

Affected areas

Initialization

Additional context

No response

@micheleriva
Copy link
Member

Hi @JpEncausse thanks for opening this. What embedding model are you using? And how many dimensions does it have?

@JpEncausse
Copy link
Author

Hello, the Embeding Model is OpenAI Large 3 with 640 dimensions.
My work around is to simply rebuild the db and fill it from a JSON file aside. Very fast.
But I assume if the DB evolve I'd have to maintains the JS file.

@micheleriva
Copy link
Member

Ok so 640 * 4 byte per dimension (f32) * 7700 = 19,712MB of embeddings. So embeddings are not likely the problem. I'm wondering if using SeqProto (https://github.com/oramasearch/seqproto) would speed this up... it certainly compresses it way more than JSON would do. Let me talk with @allevo and see if we can use it

@JpEncausse
Copy link
Author

Yes it's not that big, so I was surprised it was very slow and heavier than simply loading the JSON :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants