[Question]: When I manually insert text segment markers into the document and modify them to my chosen words when creating the knowledge base, the segmentation effect still does not meet my expectations. #7585

mu-dan · 2025-05-12T08:26:40Z

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (Language Policy).
Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
Please do not modify this template :) and fill in all the required fields.

Describe your problem

When I manually insert text segment markers into the document and modify them to my chosen words when creating the knowledge base, the segmentation effect still does not meet my expectations.
当我手动给文档打入文本分段标志符，并在知识库创建的时候修改成我手动的词，但是分段效果还是不满足我的分段预期

上传文档中我理想分段是76个分段，即我原文档分段词出现了76次
In the uploaded document, my ideal segmentation is 76 segments, which means the segmentation words in my original document appeared 76 times.

但是上传之后是40个分段，也试过换别的嵌入模型，但是效果仍达不到预期，（在别的平台同样的嵌入模型，分段却是76段）
However, after uploading, there are 40 segments. I have tried using other embedding models, but the results still do not meet expectations (on other platforms with the same embedding model, there are 76 segments).

dosubot · 2025-05-12T08:29:35Z

The issue you're experiencing with the segmentation not meeting your expectations, despite manually inserting text segment markers, could be influenced by several factors in RAGFlow's configuration:

Slicing Method: The 'General' slicing method splits the document into segments and combines adjacent segments until the token count exceeds the threshold specified by 'Chunk token number for text'. This might be affecting the number of segments you end up with, as it doesn't solely rely on the text segment identifier ^[1].
Recommended Text Block Size: Although I couldn't find specific information on how this setting influences segmentation, it might be interacting with the slicing method to determine the final segment count.
Text Segment Identifier: The identifier '##' is used for segmentation, but its effect might be overridden by the slicing method or other settings. Unfortunately, I couldn't find detailed information on how this identifier specifically influences the segmentation process.
Embedding Model: Different embedding models can affect the segmentation outcome due to their interaction with the text processing settings. Consistency in embedding models across platforms is crucial, as different models might handle text segmentation differently ^[2]^[3].

To achieve your desired segmentation of 76 segments, you might need to adjust the 'Chunk token number for text' or explore other slicing methods if available. Additionally, ensure that the embedding model and other settings are consistent with those used on other platforms where you achieved the expected segmentation.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

mu-dan · 2025-05-12T08:38:43Z

尽管手动插入了文本段标记，但您遇到的分段仍未达到预期的问题可能受到 RAGFlow 配置中的几个因素的影响：

切片方法：“常规”切片方法将文档拆分为多个段并合并相邻的段，直到令牌计数超过“文本的块令牌编号”指定的阈值。这可能会影响您最终得到的句段数，因为它不仅仅依赖于文本句段标识符[1].

Recommended Text Block Size（建议的文本块大小）：虽然我找不到有关此设置如何影响分段的具体信息，但它可能与切片方法交互以确定最终的段计数。

文本句段标识符：标识符“##”用于分段，但其效果可能会被切片方法或其他设置覆盖。遗憾的是，我找不到有关此标识符如何具体影响细分过程的详细信息。

嵌入模型：不同的嵌入模型会因与文本处理设置的交互而影响分割结果。跨平台嵌入模型的一致性至关重要，因为不同的模型可能会以不同的方式处理文本分段[2][3].

要实现所需的 76 个片段分割，您可能需要调整“文本的块标记编号”或探索其他切片方法（如果可用）。此外，请确保嵌入模型和其他设置与您实现预期分段的其他平台上使用的设置一致。

要继续与 Dosu 交谈，请提及。

帮助 Dosu 学习！给它反馈：伟大的回应 |无关紧要的答案 |不正确的来源 |太冗长 |幻觉 |Bug 报告 |其他

也更改过文本块的大小和文本分段标志符的内容，最后生成的效果还是不符合预期
I also changed the size of the text block and the content of the text segmentation markers, but the final result still did not meet expectations.

KevinHuSh · 2025-05-13T01:28:13Z

If you want to slice text by the delimiter only, turn the token size down to 2 for an example.
And it's multi-char delimiter, you should use ` to embrace them, like `##`

mu-dan · 2025-05-13T04:29:35Z

如果要仅按文本切片，请将标记大小减小到 2 作为示例。而且它是多字符分隔符，你应该用 ' 来包含它们，比如 '##'delimiter

Still not working, even with quotes added, this issue still occurs within the paragraphs.
还是不行，加了引号，分段中仍然会出现这种问题

mu-dan · 2025-05-13T04:34:45Z

如果要仅按文本切片，请将标记大小减小到 2 作为示例。而且它是多字符分隔符，你应该用 ' 来包含它们，比如 '##'delimiter

单分隔符按文本切块的话，由于文档每一块内容长度不确定，建议文本大小调小了容易把本来某一段（字符较多）拆分两段的
If using a single delimiter to segment the text, since the length of each content block in the document is uncertain, it is recommended to reduce the text size to avoid splitting a particular segment (which contains more characters) into two parts.
调大了，就是最上面40个分段的情况
Raised, it's the situation of the top 40 segments.

KevinHuSh · 2025-05-14T04:10:34Z

Be carefull! It's not a pair of quote mark.

mu-dan · 2025-05-14T06:50:36Z

小心！它不是一对引号。

This is my oversight, but after the revisions, the paragraph divisions are still not ideal.
这个是我的疏忽，但是改完之后分段仍旧不理想

mu-dan added the 🙋‍♀️ question Further information is requested label May 12, 2025

dosubot bot added the 🐞 bug Something isn't working, pull request that fix bug. label May 12, 2025

This was referenced May 29, 2025

[Question]: The order of the chunk parsed by General is inconsistent with that of the original text #7943

Closed

[Question]: Why are the number of chunks the same after resolution for different chunk token numbers? #7945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question]: When I manually insert text segment markers into the document and modify them to my chosen words when creating the knowledge base, the segmentation effect still does not meet my expectations. #7585

[Question]: When I manually insert text segment markers into the document and modify them to my chosen words when creating the knowledge base, the segmentation effect still does not meet my expectations. #7585

mu-dan commented May 12, 2025

dosubot bot commented May 12, 2025

Uh oh!

mu-dan commented May 12, 2025

Uh oh!

KevinHuSh commented May 13, 2025 •

edited

Loading

Uh oh!

mu-dan commented May 13, 2025

Uh oh!

mu-dan commented May 13, 2025

Uh oh!

KevinHuSh commented May 14, 2025

Uh oh!

mu-dan commented May 14, 2025

Uh oh!

[Question]: When I manually insert text segment markers into the document and modify them to my chosen words when creating the knowledge base, the segmentation effect still does not meet my expectations. #7585

[Question]: When I manually insert text segment markers into the document and modify them to my chosen words when creating the knowledge base, the segmentation effect still does not meet my expectations. #7585

Comments

mu-dan commented May 12, 2025

Self Checks

Describe your problem

dosubot bot commented May 12, 2025

Uh oh!

mu-dan commented May 12, 2025

Uh oh!

KevinHuSh commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mu-dan commented May 13, 2025

Uh oh!

mu-dan commented May 13, 2025

Uh oh!

KevinHuSh commented May 14, 2025

Uh oh!

mu-dan commented May 14, 2025

Uh oh!

KevinHuSh commented May 13, 2025 •

edited

Loading