How to construct knowledge graphs from unstructured data sources.
- event: https://live.zoho.com/PBOB6fvr6c
- video: https://youtu.be/B6_NfvQL-BE
- slides: https://derwen.ai/s/2njz#1
Caveat: this repo provides the source code and notebooks which accompany an instructional tutorial; it is not intended as a package library or product.
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -U pip wheel
python3 -m pip install -r requirements.txt
The full demo app is in demo.py
:
python3 demo.py
This demo scrapes text sources from articles about the linkage between
dementia and regularly eating processed red meat, then produces a graph
using NetworkX
, a vector database of text chunk embeddings using
LanceDB
, and an entity embedding model using gensim.Word2Vec
,
where the results are:
data/kg.json
-- serialization ofNetworkX
graphdata/lancedb
-- vector database tablesdata/entity.w2v
-- entity embedding modelkg.html
-- interactive graph visualization inPyVis
A collection of Jupyter notebooks illustrate important steps within this workflow:
./venv/bin/jupyter-lab
- Part 1:
construct.ipynb
-- detailed KG construction using a lexical graph - Part 2:
chunk.ipynb
-- simple example of how to scrape and chunk text - Part 3:
vector.ipynb
-- query LanceDB table for text chunk embeddings (after runningdemo.py
) - Part 4:
embed.ipynb
-- query the entity embedding model (after runningdemo.py
)
Objective: Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.
These steps define a generalized process, where this tutorial picks up at the lexical graph:
Semantic overlay:
- load any pre-defined controlled vocabularies directly into the KG
Data graph:
- load the structured data sources or updates into a data graph
- perform entity resolution (ER) on PII extracted from the data graph
- use ER results to generate a semantic overlay as a "backbone" for the KG
Lexical graph:
- parse the text chunks, using lemmatization to normalize token spans
- construct a lexical graph from parse trees, e.g., using a textgraph algorithm
- analyze named entity recognition (NER) to extract candidate entities from NP spans
- analyze relation extraction (RE) to extract relations between pairwise entities
- perform entity linking (EL) leveraging the ER results
- promote the extracted entities and relations up to the semantic overlay
This approach is in contrast to using a large language model (LLM) as a one size fits all "black box" approach to generate the entire graph automagically. Black box approaches don't work well for KG practices in regulated environments, where audits, explanations, evidence, data provenance, etc., are required.
Better yet, review the intermediate results after each inference step to
collect human feedback for curating the KG components, e.g., using
Argilla
.
KGs used in mission-critical apps such as investigations generally rely on updates, not a one-step construction process. By producing a KG based on the steps above, updates can be handled more effectively. Downstream apps such as Graph RAG for grounding the LLM results will also benefit from improved data quality.
spaCy
: https://spacy.io/GLiNER
: https://github.com/urchade/GLiNERGLiREL
: https://github.com/jackboyla/GLiRELOpenNRE
: https://github.com/thunlp/OpenNRENetworkX
: https://networkx.org/PyVis
: https://github.com/WestHealth/pyvisLanceDB
: https://github.com/lancedb/lancedbgensim
: https://github.com/piskvorky/gensimpandas
: https://pandas.pydata.org/Pydantic
: https://github.com/pydantic/pydanticPyinstrument
: https://github.com/joerick/pyinstrument
Note: you must use the nre.sh
script to load OpenNRE pre-trained models before running the opennre.ipynb
notebook.