Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAG0017: Clean Knowledge Graph Data #35

Open
3 tasks
tenzin3 opened this issue Sep 16, 2024 · 2 comments
Open
3 tasks

RAG0017: Clean Knowledge Graph Data #35

tenzin3 opened this issue Sep 16, 2024 · 2 comments
Assignees

Comments

@tenzin3
Copy link
Contributor

tenzin3 commented Sep 16, 2024

Description

Knowledge graph triples are generated by providing prompts to LLMs. Due to constraints like context length and the need for better output quality, the unstructured text is processed in smaller chunks rather than all at once. As a result, a large amount of fragmented graph data is produced. In this scenario, the processes of collating, deduplicating, and eliminating similar relations and entities become crucial to ensure accuracy and efficiency.

Image

Expected Output

A deduplicated and consolidated knowledge graph with unique entities and relations, ensuring clarity and eliminating redundancy.

Implementation Plan

  • combine all graphs
  • filter with string similarity
  • filter with embedding similarity checking
@tenzin3 tenzin3 self-assigned this Sep 16, 2024
@tenzin3 tenzin3 converted this from a draft issue Sep 16, 2024
@tenzin3
Copy link
Contributor Author

tenzin3 commented Sep 16, 2024

Methods to clean the knowledge graph

  • Perform string similarity when collating nodes and relations into one giant knowledge graph.
  • convert nodes (name: string) into embedding using the fintuned embedding model and then perform cosine similarity check to get similar nodes.
  • check overlapping relations and properties.
  • human in the loop for final quality review

@tenzin3
Copy link
Contributor Author

tenzin3 commented Sep 16, 2024

Graph Schema(uncleaned):

Image

Observation:

  • there are few entities schema in total 15, which i believe is very simplified and good.
  • some entities(Structure and Deity) has few associated with it while some(Location) has huge number of nodes associated with it.
  • lot of filtering and cleaning needed for relation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: IN PROGRESS
Development

No branches or pull requests

1 participant