how to calculate word weight #7855

silear · 2025-05-26T07:46:32Z

silear
May 26, 2025

I want to do a rag of vertical data. I see that many hyperparameters are defined in the code for calculating word weights in ragflow, for example, the weights of some words under ner.json. What is the consideration of this? If I want to use my own data, Which part of the weight adjustment will bring better results?

@dosu-bot · 2025-05-26T07:49:00Z

dosubot[bot]
bot May 26, 2025

Word weight calculation in RAGFlow is based on several factors: term frequency, document frequency, named entity recognition (NER), part-of-speech tagging, and inverse document frequency (IDF). The calculation is implemented in the weights function, which combines these elements to assign importance to each word. The final weights are normalized so that their sum is 1 for each set of terms being processed. This approach helps the system determine which words are most significant for retrieval and ranking purposes [source].

Hyperparameters play a key role in this process. For example:

The similarity_threshold sets the minimum similarity score required for a block to be considered relevant (default is 0.2).
The keywords_similarity_weight (or vector_similarity_weight in some contexts) determines how much weight is given to keyword similarity versus vector (embedding) similarity in the final retrieval score. The sum of these weights should be 1.0 [source].

ner.json is used to map specific words to entity categories such as 'toxic', 'stock', etc. In the weighting function, words that match certain NER categories are assigned different weights. For example, 'stock' entities might get a higher weight than generic words. This allows you to emphasize or de-emphasize certain types of entities in your retrieval process [source].

If you want to use your own data and optimize results:

Adjust the weights in ner.json to reflect the importance of different entity types in your vertical domain. For example, if company names are more important in your data, increase the weight for the 'corp' category.
Tune the keywords_similarity_weight and similarity_threshold hyperparameters to balance between keyword and vector similarity, and to control the strictness of retrieval.
Consider modifying the weighting logic for part-of-speech tags or frequency if your domain has unique language patterns.

Experimenting with these settings—especially the NER category weights and the similarity weights—will have the most direct impact on retrieval quality for your specific data. You may need to iterate and validate the results on your own dataset to find the optimal configuration.

To continue talking to Dosu, mention @dosu-bot.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfiniFlow

how to calculate word weight #7855

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

InfiniFlow

how to calculate word weight #7855

Uh oh!

silear May 26, 2025

Replies: 1 comment

Uh oh!

dosubot[bot] bot May 26, 2025

silear
May 26, 2025

dosubot[bot]
bot May 26, 2025