Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docx parsing time is slow compared to docx2txt tool #162

Open
hwo411 opened this issue Dec 20, 2023 · 0 comments
Open

docx parsing time is slow compared to docx2txt tool #162

hwo411 opened this issue Dec 20, 2023 · 0 comments

Comments

@hwo411
Copy link

hwo411 commented Dec 20, 2023

Hello!

We've recently stress tested the library in our app and noticed that the docx parsing performance is pretty poor compared to other tools on somewhat big files.

Example docx file:
https://tolstoy.ru/upload/iblock/b22/voina-i-mir.docx

The tools we compared the library to:

  1. https://docx2txt.sourceforge.net/
  2. https://github.com/jgm/pandoc

On my laptop (Ryzen 5800H, 64GB RAM) it parses file in around 40 seconds.
Pandoc has a similar performance.

But docx2txt parses it under a second.

On the servers the difference is much bigger, since we're not running a powerful server yet.

Is there something that can be improved in the docx parsing to make it comparable to docx2txt? At first glance the output is similar, so it's not that they have worse quality at cost of the speed.

I also want to mention that the parsing of pdf file with the same content as this docx file takes less (around 4 seconds), while pdf is larger (30MB vs 4MB).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant