Skip to content

Legitimate duplicate text in textbox in docx is being unexpectedly removed #1668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xenv opened this issue May 28, 2025 · 1 comment
Open
Labels
bug Something isn't working

Comments

@xenv
Copy link

xenv commented May 28, 2025

Bug

Bug was make by #1538 @AndrewTsai0406

Input (in a textbox in docx):

abcd  
abc  
abcd  <-- valid duplicate
qqq  
aaa

Output currently:

abcd
abc
qqq 
aaa

line which valid duplicate is missing

Steps to reproduce

test-textbox.docx

converter = DocumentConverter()
doc = converter.convert("test-textbox.docx").document
original_markdown_text = doc.export_to_markdown()

Docling version

2.34.0
...

Python version

3.13.3

@xenv xenv added the bug Something isn't working label May 28, 2025
@xenv
Copy link
Author

xenv commented May 28, 2025

I commented out this line to fix the issue I encountered. Although this approach creates quite a few blank lines :

, but I'm not entirely sure why it was designed this way originally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant