Using Docling Library to create Training Dataset #807
mehfuzh
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello-
First of all, good work on the library, I have migrated over our existing document parser to IBM Docling library use a uniform data structure to process Docx, PDF and other file types, I use this information to build a dataset that then been used to for both RAG and building out SLM model.
You can take a look at the project here
Web:
https://smartloop.ai
Command Line Interface:
https://github.com/smartloop-ai/smartloop
Happy to show a demo. However, here is my other question, like
SectionItem
,TableItem
, etc. I want to parse and embed Images in the processed output or save it in the disk or blob and then use it as a metadata in vector DB and training dataset, is there a best practice that I should followHere is a code snippet for the DocX parser:
Regards
Mehfuz Hossain
Co-founder | smartloop.ai
We (L) open source
Beta Was this translation helpful? Give feedback.
All reactions