Skip to content

[Question]: When I manually insert text segment markers into the document and modify them to my chosen words when creating the knowledge base, the segmentation effect still does not meet my expectations. #7585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
mu-dan opened this issue May 12, 2025 · 7 comments
Labels
🐞 bug Something isn't working, pull request that fix bug. 🙋‍♀️ question Further information is requested

Comments

@mu-dan
Copy link

mu-dan commented May 12, 2025

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

Describe your problem

When I manually insert text segment markers into the document and modify them to my chosen words when creating the knowledge base, the segmentation effect still does not meet my expectations.
当我手动给文档打入文本分段标志符,并在知识库创建的时候修改成我手动的词,但是分段效果还是不满足我的分段预期
Image

Image

上传文档中我理想分段是76个分段,即我原文档分段词出现了76次
In the uploaded document, my ideal segmentation is 76 segments, which means the segmentation words in my original document appeared 76 times.

Image

Image

Image
但是上传之后是40个分段,也试过换别的嵌入模型,但是效果仍达不到预期,(在别的平台同样的嵌入模型,分段却是76段)
However, after uploading, there are 40 segments. I have tried using other embedding models, but the results still do not meet expectations (on other platforms with the same embedding model, there are 76 segments).

@mu-dan mu-dan added the 🙋‍♀️ question Further information is requested label May 12, 2025
@dosubot dosubot bot added the 🐞 bug Something isn't working, pull request that fix bug. label May 12, 2025
Copy link

dosubot bot commented May 12, 2025

The issue you're experiencing with the segmentation not meeting your expectations, despite manually inserting text segment markers, could be influenced by several factors in RAGFlow's configuration:

  1. Slicing Method: The 'General' slicing method splits the document into segments and combines adjacent segments until the token count exceeds the threshold specified by 'Chunk token number for text'. This might be affecting the number of segments you end up with, as it doesn't solely rely on the text segment identifier [1].

  2. Recommended Text Block Size: Although I couldn't find specific information on how this setting influences segmentation, it might be interacting with the slicing method to determine the final segment count.

  3. Text Segment Identifier: The identifier '##' is used for segmentation, but its effect might be overridden by the slicing method or other settings. Unfortunately, I couldn't find detailed information on how this identifier specifically influences the segmentation process.

  4. Embedding Model: Different embedding models can affect the segmentation outcome due to their interaction with the text processing settings. Consistency in embedding models across platforms is crucial, as different models might handle text segmentation differently [2][3].

To achieve your desired segmentation of 76 segments, you might need to adjust the 'Chunk token number for text' or explore other slicing methods if available. Additionally, ensure that the embedding model and other settings are consistent with those used on other platforms where you achieved the expected segmentation.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@mu-dan
Copy link
Author

mu-dan commented May 12, 2025

尽管手动插入了文本段标记,但您遇到的分段仍未达到预期的问题可能受到 RAGFlow 配置中的几个因素的影响:

  1. 切片方法:“常规”切片方法将文档拆分为多个段并合并相邻的段,直到令牌计数超过“文本的块令牌编号”指定的阈值。这可能会影响您最终得到的句段数,因为它不仅仅依赖于文本句段标识符[1].
  2. Recommended Text Block Size(建议的文本块大小):虽然我找不到有关此设置如何影响分段的具体信息,但它可能与切片方法交互以确定最终的段计数。
  3. 文本句段标识符:标识符“##”用于分段,但其效果可能会被切片方法或其他设置覆盖。遗憾的是,我找不到有关此标识符如何具体影响细分过程的详细信息。
  4. 嵌入模型:不同的嵌入模型会因与文本处理设置的交互而影响分割结果。跨平台嵌入模型的一致性至关重要,因为不同的模型可能会以不同的方式处理文本分段[2][3].

要实现所需的 76 个片段分割,您可能需要调整“文本的块标记编号”或探索其他切片方法(如果可用)。此外,请确保嵌入模型和其他设置与您实现预期分段的其他平台上使用的设置一致。

要继续与 Dosu 交谈,请提及 。

帮助 Dosu 学习!给它反馈:伟大的回应 |无关紧要的答案 |不正确的来源 |太冗长 |幻觉 |Bug 报告 |其他

也更改过文本块的大小和文本分段标志符的内容,最后生成的效果还是不符合预期
I also changed the size of the text block and the content of the text segmentation markers, but the final result still did not meet expectations.

@KevinHuSh
Copy link
Collaborator

KevinHuSh commented May 13, 2025

If you want to slice text by the delimiter only, turn the token size down to 2 for an example.
And it's multi-char delimiter, you should use ` to embrace them, like `##`

@mu-dan
Copy link
Author

mu-dan commented May 13, 2025

如果要仅按文本切片,请将标记大小减小到 2 作为示例。而且它是多字符分隔符,你应该用 ' 来包含它们,比如 '##'delimiter

Still not working, even with quotes added, this issue still occurs within the paragraphs.
还是不行,加了引号,分段中仍然会出现这种问题

Image

Image

@mu-dan
Copy link
Author

mu-dan commented May 13, 2025

如果要仅按文本切片,请将标记大小减小到 2 作为示例。而且它是多字符分隔符,你应该用 ' 来包含它们,比如 '##'delimiter

单分隔符按文本切块的话,由于文档每一块内容长度不确定,建议文本大小调小了容易把本来某一段(字符较多)拆分两段的
If using a single delimiter to segment the text, since the length of each content block in the document is uncertain, it is recommended to reduce the text size to avoid splitting a particular segment (which contains more characters) into two parts.
调大了,就是最上面40个分段的情况
Raised, it's the situation of the top 40 segments.

@KevinHuSh
Copy link
Collaborator

Be carefull! It's not a pair of quote mark.

@mu-dan
Copy link
Author

mu-dan commented May 14, 2025

小心!它不是一对引号。

Image
This is my oversight, but after the revisions, the paragraph divisions are still not ideal.
这个是我的疏忽,但是改完之后分段仍旧不理想

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working, pull request that fix bug. 🙋‍♀️ question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants