New LaTeX OCR model; block visualizer; better links/references
Improved LaTeX OCR
We trained a new LaTeX OCR model that works a lot better overall. It will reliably output KaTeX-compatible math. It also operates on longer sequences than before.
The rendered output is on the right, original document on the left:
![image](https://private-user-images.githubusercontent.com/913340/407822592-a3158fd5-a027-4798-a58e-bf8e30af8d42.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxMzYyMDgsIm5iZiI6MTczOTEzNTkwOCwicGF0aCI6Ii85MTMzNDAvNDA3ODIyNTkyLWEzMTU4ZmQ1LWEwMjctNDc5OC1hNThlLWJmOGUzMGFmOGQ0Mi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOVQyMTE4MjhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iN2M5MGM5ZDg1MDg0ZmI3ZDI1MzZiZDVjN2Y4MzBjYWExNDRhMGM4YTVjNDgyMGU2OWU3ZTkwYTQxN2Y4M2IzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.oSPeV62cFjeU-nRvYeVSYHj3nHfPzE1VfUZzeZqdtzQ)
Block visualization
You can now visualize blocks in the streamlit app, thanks to @jazzido . By selecting json output and checking "show blocks", you get a nice visualization where you can see how marker parsed the page. Clicking on blocks will show the HTML.
![image](https://private-user-images.githubusercontent.com/913340/407823086-04c83792-a6a8-429b-b596-5124e8a6b9c1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxMzYyMDgsIm5iZiI6MTczOTEzNTkwOCwicGF0aCI6Ii85MTMzNDAvNDA3ODIzMDg2LTA0YzgzNzkyLWE2YTgtNDI5Yi1iNTk2LTUxMjRlOGE2YjljMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOVQyMTE4MjhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT04ZDg3ZGI2ZTIwODlkZGMyNDg4N2MyZjU1NTVlMDJlOWFmZWNkMTUxOWNjZGI3MTdkNDA1ZTQzMjQ1ZGZhNTA4JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.bwiypfjnVYrPCe61rOGfffzNd4kUC8H6MiRJkL35oko)
Links and references
We fixed a bug with links and references, they now render as one block. You can see the extracted references here:
![image](https://private-user-images.githubusercontent.com/913340/407824416-109a289d-5fd2-4cbb-bf7c-903a32581d51.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxMzYyMDgsIm5iZiI6MTczOTEzNTkwOCwicGF0aCI6Ii85MTMzNDAvNDA3ODI0NDE2LTEwOWEyODlkLTVmZDItNGNiYi1iZjdjLTkwM2EzMjU4MWQ1MS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOVQyMTE4MjhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZjJlODI5ZGQzOWRiMGEyNTBhNjQ4NjZhZTg2ZjNkYjdkY2VkZjJhZDk4OTRiYmYzZTIwN2I2MDlkZTI5ZWFhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.Heu6Y9_KFn-sQou6RFdcJ-dwSAReQAFWVUX3Vz_6IQc)
Misc bugfixes
- Fixed some bugs with tables and row splitting
- Escaped $ inside text and tables so we don't accidentally render things as equations
What's Changed
- [streamlit_app] Visualize extracted blocks by @jazzido in #502
- Texify by @VikParuchuri in #513
- Update texify by @VikParuchuri in #514
New Contributors
Full Changelog: v1.3.2...v1.3.3