Using Docling Library to create Training Dataset #807

mehfuzh · 2025-01-25T15:59:51Z

mehfuzh
Jan 25, 2025

Hello-

First of all, good work on the library, I have migrated over our existing document parser to IBM Docling library use a uniform data structure to process Docx, PDF and other file types, I use this information to build a dataset that then been used to for both RAG and building out SLM model.

You can take a look at the project here

Web:
https://smartloop.ai

Command Line Interface:
https://github.com/smartloop-ai/smartloop

Happy to show a demo. However, here is my other question, like SectionItem, TableItem, etc. I want to parse and embed Images in the processed output or save it in the disk or blob and then use it as a metadata in vector DB and training dataset, is there a best practice that I should follow

Here is a code snippet for the DocX parser:

    def process_items(self, document: DoclingDocument,  items: [TextItem]) -> List[Paragraph]:
        paragraphs = []

        for _, item in enumerate(items, start=0):
            if item is not None:
                if isinstance(item, SectionHeaderItem):
                    texts = []

                    page_no = item.prov[0].page_no if len(item.prov) > 0 else 0
                    page_ref = f"document:{document.origin.filename}:page_no:{page_no}"

                    texts.append(page_ref)
                      
                    # process header
                    texts.append(f"{''.join(['#' for i in range(item.level)])} {item.text}")

                    if len(item.children) > 0:
                        items = self.get_text_items(document,  item.children)
                        texts.extend(self.process_items(document, items))

                    paragraphs.append(Paragraph(item.text , True))
                # process table or text
                elif isinstance(item, TableItem):
                    arr = np.array([item.data.table_cells[i].text for i in range(len(item.data.table_cells))])
                    arr.resize(item.data.num_rows, item.data.num_cols)
                    table = tabulate(arr, headers='firstrow' if item.data.table_cells[0].column_header else ())
                   
                    paragraphs.append(Paragraph(table , True))
                elif isinstance(item, TextItem):
                    if len(item.children) > 0:
                        items = self.get_text_items(document, item.children)

                        texts = self.process_items(document, items)
                   
                        paragraphs.extend([Paragraph(text, False) for text in texts])
                    else:
                        if len(item.text) > 0:
                            paragraphs.append(Paragraph(item.text , False))

        return paragraphs

Regards
Mehfuz Hossain
Co-founder | smartloop.ai

We (L) open source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Docling Library to create Training Dataset #807

{{title}}

Replies: 0 comments

Select a reply

Using Docling Library to create Training Dataset #807

mehfuzh Jan 25, 2025

Replies: 0 comments

mehfuzh
Jan 25, 2025