Skip to content

A local LLM+RAG chatbot with structured pdf ingestion using Word to convert pdf to docx, and then pandoc to convert from docx to markdown enabling the use of langchain ParentDocumentRetriever with MarkdownTextSplitter.

License

Notifications You must be signed in to change notification settings

fiammante/ParentMarkDownChatbot

Repository files navigation

ParentMarkDownChatbot

A local LLM+RAG chatbot with structured pdf ingestion using Word to convert pdf to docx, and then pandoc to convert from docx to markdown enabling the use of langchain ParentDocumentRetriever with MarkdownTextSplitter.

Uses Powershell with MS Word to open pdf file and convert to docx format, an impressive converter that finds and converts tables, then converts with pandoc to Markdown format. ParentDocumentRetriever uses a chain of splitters to avoid having too small chunks retrieved in the context.

Replace the paths in the powershell script and notebook by your path. The script should be executed first to create the markdown files from pdfs

Once the Chroma and VectorStore are loaded with documents the streamlit chatbotparentchild.py can be used by using the command "streamlit run chatbotparentchild.py" Screenshot

About

A local LLM+RAG chatbot with structured pdf ingestion using Word to convert pdf to docx, and then pandoc to convert from docx to markdown enabling the use of langchain ParentDocumentRetriever with MarkdownTextSplitter.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages