Run MarkItDown On Local And Check The Accuracy Of Some Files Types Like PPT, Excel, Images

Link github:

https://github.com/microsoft/markitdown

git clone [email protected]:microsoft/markitdown.git

cd markitdown

pip install -e packages/markitdown

git clone git@github.com:microsoft/markitdown.git

Now try it

First I try it with this PDF document

PDF document

Then, this is result


It looks good!

Then, I try it with PPTX file

This is the result. It seems to be able to get quite complete information on the slide including links, page numbers, comments, etc. but it cannot get information on images, shapes or charts.

Finally, I try it with Excel file

It seems to be able to get the content of all sheets. However it takes blank cells and sets the value to NaN. Also it can’t get the text on the image.

I think this output needs to be processed further if I want to use it.

Thanks for reading!