Link github:
https://github.com/microsoft/markitdown
git clone [email protected]:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown
Now try it
First I try it with this PDF document
Then, this is result
It looks good!
Then, I try it with PPTX file
This is the result. It seems to be able to get quite complete information on the slide including links, page numbers, comments, etc. but it cannot get information on images, shapes or charts.
Finally, I try it with Excel file
It seems to be able to get the content of all sheets. However it takes blank cells and sets the value to NaN. Also it can’t get the text on the image.
I think this output needs to be processed further if I want to use it.
Thanks for reading!