[Bug]: Deafult Pptx reader with fs directory reader (azure in this case) not working, Path Error #18806

federicocaccialanzaabb · 2025-05-22T10:20:45Z

Bug Description

I have recently found a bug while trying to upload a pptx file with a fs directory (azure blob storage in my case)

The error thrown as a warning is the following: 'PurePosixPath' object has no attribute 'startswith'. Skipping...

This leads to the file not being processed.

The error was thrown in the class PptxReader in:
Source code in llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py

reference llamaindex docs: PptxReader

in the function load_data from line 97 to 101 the code is the following:

if fs:
            with fs.open(file) as f:
                presentation = Presentation(f)
        else:
            presentation = Presentation(file)`

On the other hand should be the following (casting Path object to string and reading the bytes afterwards):

if fs:
            with fs.open(str(file)) as f:
                presentation = Presentation(io.BytesIO(f.read()))
        else:
            presentation = Presentation(file)

If you want you can also see the pdf reader how it handles it
link: PdfReader

code:

fs = fs or get_default_fs()
        with fs.open(str(file), "rb") as fp:
            # Load the file in memory if the filesystem is not the default one to avoid
            # issues with pypdf
            stream = fp if is_default_fs(fs) else io.BytesIO(fp.read())

            # Create a PDF object
            pdf = pypdf.PdfReader(stream)

it is using the is_default_fs(fs) method and then doing more or less what I did
The code could also be changed in a way to only use Path from Pathlib and not strings but I think this fix is easier and less time consuming.

Version

llama-index==0.12.35, llama-index-core==0.12.35, llama-index-readers-file==0.4.7

Steps to Reproduce

just try to upload a pptx file in a pipeline or some code that uses a fs as a directory reader
(for sure Azure, I haven't tested others like AWS or GCP but whenever a Path and not a string is passed it should trigger)

Relevant Logs/Tracbacks

warnings.warn(
Failed to load file placeholder_container_name/placeholder_file_name.pptx with error: 'PurePosixPath' object has no attribute 'startswith'. Skipping...

and then file is not processed

The text was updated successfully, but these errors were encountered:

dosubot · 2025-05-22T10:25:20Z

The issue you're encountering with the PptxReader and Azure Blob Storage is due to the use of a Path object where a string is expected, resulting in the error: 'PurePosixPath' object has no attribute 'startswith'. This happens because the current implementation tries to open the file using fs.open(file), which can pass a Path object to libraries that expect a string path.

Your proposed fix—casting the Path object to a string and reading the bytes with io.BytesIO—is consistent with how the PdfReader handles similar scenarios. The PdfReader converts the file path to a string and uses io.BytesIO(fp.read()) when the filesystem is not the default one, which avoids compatibility issues with non-local filesystems like Azure Blob Storage. This approach is considered best practice for handling file-like objects across different filesystems and aligns with the current implementation in the codebase (see PDFReader implementation).

The relevant section of the PptxReader should be updated as follows:

if fs:
    with fs.open(str(file), "rb") as f:
        presentation = Presentation(io.BytesIO(f.read()))
else:
    presentation = Presentation(file)

This change should resolve the error and bring the PptxReader in line with the PdfReader's handling of file streams.

I did not find any existing issues or discussions in the repository specifically about this 'PurePosixPath' error, nor did I identify recent commits that introduced this problem. The root cause appears to be a general mismatch between expected types when working with non-local filesystems.

If you implement this fix, it should address the issue for Azure and likely for other remote filesystems as well.

To continue talking to Dosu, mention @dosu-bot.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

logan-markewich · 2025-05-23T22:26:18Z

@federicocaccialanzaabb feel free to open a PR!

federicocaccialanzaabb · 2025-05-27T09:12:11Z

@logan-markewich opened here: PR 18862

federicocaccialanzaabb added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 22, 2025

federicocaccialanzaabb mentioned this issue May 27, 2025

Fix/llama index readers file pptx #18862

Merged

18 tasks

AstraBert closed this as completed May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Deafult Pptx reader with fs directory reader (azure in this case) not working, Path Error #18806

[Bug]: Deafult Pptx reader with fs directory reader (azure in this case) not working, Path Error #18806

federicocaccialanzaabb commented May 22, 2025

dosubot bot commented May 22, 2025

Uh oh!

logan-markewich commented May 23, 2025

Uh oh!

federicocaccialanzaabb commented May 27, 2025

Uh oh!

[Bug]: Deafult Pptx reader with fs directory reader (azure in this case) not working, Path Error #18806

[Bug]: Deafult Pptx reader with fs directory reader (azure in this case) not working, Path Error #18806

Comments

federicocaccialanzaabb commented May 22, 2025

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented May 22, 2025

Uh oh!

logan-markewich commented May 23, 2025

Uh oh!

federicocaccialanzaabb commented May 27, 2025

Uh oh!