Suggestions needed for local quantitized model #22

Greatz08 · 2025-03-15T15:00:13Z

Greatz08
Mar 15, 2025

Nice work, but running full deepseek r1 will not be possible for 90-95%+ users so I wanted to know if you have tested some great local llm models under 32B param which can perform small to mid level task easily without much issues ?

It will be best if you can tell both reasoning and non reasoning ones because we don't need unnecessary reasoning for straightforward tasks.

Fosowl · 2025-03-15T16:36:21Z

Fosowl
Mar 15, 2025
Maintainer

Hello, thank you for the feedback! Most tasks seem manageable with a 14B parameter model. I’ve done all the development so far by running the ollama_server.py script on my gaming machine (RTX 3060 with 12GB VRAM, 16GB RAM) and connecting via SSH from my MacBook, and it’s worked quite well. But you raise an important issue deepseek 7B does perform poorly and agenticSeek load other hugging face model (router, tts, stt, summarization) which can crash a lot of users.
The issue i think with having both a reasoning model and non reasoning model. is that you either have to have both loaded into ram (even more hardware requirement) or you have to unload and reload model for every agent which is super slow. but if you know a way to do it i am all hear!
I think it’s possible to run custom models from Hugging Face with ollama. We could try some 4bit or 8bit quantized models, which might be more accessible for users with limited resources.
As a last resort, we do have API provider options, but this compromises privacy.
Let me know what you think

2 replies

Greatz08 Mar 17, 2025
Author

@Fosowl i did saw someone created one open source solution where it would smartly route between thinking and non thinking model according to user's request but i cant remember that exact post or project name. Will try to find some solutions soon and will share it here if i found something valuable.

and yeah i also have same thoughts regarding the usage of thinking and non thinking models. They will consume lot more vram which again will put limiters on us unfortunately.

Hardware Resource is one big issue for all of us and this issue will not solve anytime soon and that's why i was looking for a way to share our(community) resouces (vram) together to run 2-3 good models so that those who are sharing their resources can take advantage of great open source models easily without having to spend too much money on hardware.

I came across this project - https://github.com/kalavai-net/kalavai-client
which works similarly like torrent where community members have that big file and they share parts of that single big file and bandwidth, similarly we can share our vram (4 gb or 8gb or more) and can load medium to big size model (like 32B,70B ones) and can use it together for our day to day tasks. Model can be splitted into parts so if for example we create a vram pool where you share 12GB Vram and i share 8GB Vram and 2-3 more people share similar resources then we can altogether load a 32B model easily and can use its api with different projects to develop app or use it for deep research etc.

So Overall this is the only best possible solution i could find to remove that hardware limits which 95% people have currently and will always have this issue because running llm is extremely resource intensive task.
( I did tried this kalavai some weeks ago but did faced some issue because project is new plus this project does require community support as i mentioned so that we can benefit from combined resources. I am currently busy with some stuff so couldn't dedicate too much time on it which i wanted to but will talk with its developer on discord soon and will see how this will practically work and if everything goes right then this will be the best thing for us. )

I was thinking if we combine the routing of thinking and non thinking model with this kalavai project then we can enjoy best possible thing. These are my thoughts (and kind off dream as well for now :-)) hope to see it in reality soon)

Fosowl Mar 18, 2025
Maintainer

yeah i also think a way to share compute across a community of users would be nice. like if it detect there won't be enought ram/vram for a gpu poor user encrypt the user request and send to the community gpus, then if a gpu rich user has enough ram process the request.
But this would be a whole project on itself. but who knows maybe we can do it if we get enough contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Suggestions needed for local quantitized model #22

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Suggestions needed for local quantitized model #22

Uh oh!

Greatz08 Mar 15, 2025

Replies: 1 comment · 2 replies

Uh oh!

Fosowl Mar 15, 2025 Maintainer

Uh oh!

Greatz08 Mar 17, 2025 Author

Uh oh!

Uh oh!

Fosowl Mar 18, 2025 Maintainer

Greatz08
Mar 15, 2025

Replies: 1 comment 2 replies

Fosowl
Mar 15, 2025
Maintainer

Greatz08 Mar 17, 2025
Author

Fosowl Mar 18, 2025
Maintainer