The idea of this project is to use Agents to improve the RAG pipeline over python repositories. Why over python repositories? Because I'm going to use the rag-pychunk
library to obtain better relationships between them. Also, this project uses Postgresql to store the vectors and documents, but you can implement the same logic in any database. The important part is the parsing form rag-pychunk-Nodes to your model, maintaining and keeping the important information (as we'll see in section 2, we'll try to get the node by matching some values in the database).
In order to use agents, I've classified the possible queries into different categories:
- Simple vs Complex.
- If Simple: Particular vs Complex.
What decides which category the query falls into is the number of subjects. What is a subject?
In order to classify the query, I've developed two different types of agents, which will return different types of tools depending on the task. Both of the agents have been trained using prompt engineer few shot example
since the task is "straightforward", altough we'll see that it can get tricky and complicated depending on the query. The agents are:
- Simple vs Complex Agent. This agent will classify the query of the user into one of these categories:
Simple: Zero or one subject present in the question. Example: How does the function x work?
Complex: More than one subject. Example: What are the differences between Class A and Class B?
Tool returned:
When classifying between simple and complex multiple things can happen:
- One or zero subjects are detected yet the agent classify the question as complex.
- Multiple subjects but the agent classify the question as simple.
In both of this cases, a retry will be applied. We want to get a coherent answer, which means:
- Simple
$\implies$ 0 or 1 subject. - Complex
$\implies$ multiple subjects.
This is controlled with pydantic validators
class Output(BaseModel):
query: str
reasoning: str
question_type: str
subject: Set[str | None]
valid: Optional[bool] = True
def coherence_between_question_type_and_length_of_subjects(cls, values: Dict[str, Any]) -> Dict[str, Any]:
subject, question_type = values.get('subject'), values.get('question_type')
if question_type not in ('simple', 'complex'):
return values
if (len(subject) <= 1 and question_type != 'simple') or (len(subject) > 1 and question_type != 'complex'):
values['valid'] = False
return values
- General vs Particular Agent. This agent will take the output from the first agent only if the question type was:
and classify the question into one of two categories:
Particular: The question involves only the subject. Example: How does the function x work?
General: The question is formulated in such a way that the question does not have to do only with the subject. Will my changes break anything?
Tool returned:
(for particular) orGeneralRetriever
(for general)
When the agent pipeline returns a tool, I'll use that tool to obtain the nodes which will be used to fed the LLM. Let's see the hole process.
To chunk the different files I'll be using rag-pychunk, which is a library I've developed to leverage python programming language syntax improving the chunking and relationships between those chunks. You can learn more about that here: rag-pychunk
Node Model
class Node(Base):
__tablename__ = "node"
id = Column(UUID(as_uuid=True), primary_key=True, index=True, default=uuid.uuid4)
node_type = Column(Enum(NodeType), nullable=False)
file_id = Column(UUID, ForeignKey("", ondelete='CASCADE'), nullable=False)
parent_node_id = Column(UUID(as_uuid=True), ForeignKey(""), nullable=True)
previous_node_id = Column(UUID(as_uuid=True), ForeignKey(""), nullable=True)
next_node_id = Column(UUID(as_uuid=True), ForeignKey(""), nullable=True)
text = Column(Text, nullable=False)
embedding = Column(Vector(384))
hash = Column(String, nullable=False, index=True)
node_relationships = Column(JSONB)
parent = relationship("Node",foreign_keys=[parent_node_id],
backref=backref("children", cascade="all, delete-orphan"))
previous = relationship("Node", foreign_keys=[previous_node_id], remote_side=[id], backref=backref("next_node"), uselist=False)
next = relationship("Node", foreign_keys=[next_node_id], remote_side=[id], backref=backref("previous_node"), uselist=False)
file = relationship("File", back_populates="nodes", foreign_keys=[file_id])
node_metadata = relationship("NodeMetadata", back_populates="node", cascade="all, delete-orphan")
NodeMetadata Model
class NodeMetadata(Base):
__tablename__ = "node_metadata"
node_id = Column(UUID(as_uuid=True), ForeignKey("", ondelete="CASCADE"), primary_key=True)
node_metadata = Column(JSONB)
node = relationship("Node", foreign_keys=[node_id], back_populates="node_metadata")
In the node_metadata column of the NodeMetadata model, some key-value pairs will be stored for each node. For example:
{'method_name': <name of the method if the node is a method>}
{'class_name': <name of the class if the node is a class>}
{'function_name': <name of the function if the node is a function>}
In the node_relationships column of the Node Model will be stored in the form of:
{node_id: [lines in which the node appears]}
where node_id is the id of the node that appears in an specific node. For example:
text_of_node_1 = "def hello(): ..."
text_of_node_2 = "def bye(): hello(); print("bye!")"
relationships_of_node_2 = {'id of node 1': [line 1]}
These two columns will be leveraged to improve the retrieval as we'll in the next section.
What's a tool?
This retriever will be used when the query is identified as particular or complex. Depending on each case, the retrieval will be done differently.
For each subject, we'll try to get the Node in which the subject appears directly from the database, since in the column node_metadata we are storing either the method, function or class name of the node:
for subject in subjects: # complex case --> multiple subjects
# we try to find the node in case it is a method, function or class by looking it up in the database directly
node_of_subject = self._db.query(Node).join(NodeMetadata, NodeMetadata.node_id ==\
NodeMetadata.node_metadata['additional_metadata']['function_name'].astext == subject,
NodeMetadata.node_metadata['additional_metadata']['method_name'].astext == subject,
NodeMetadata.node_metadata['additional_metadata']['class_name'].astext == subject
If this fails, we'll get the nodes via similarity search. Either way, we'll get the relationships of the retrieved nodes.
In case we have multiple subjects and the node can't be obtained directly from the database we want to maximise the probability of getting the correct nodes via similarity search. In order to do so, we modify the query like this:
query_to_embed, = (query.replace(subject, "") if subject in query else query.lower().replace(subject, "")) if len(subjects) > 1 else (query,)
That way, only one subject name will appear in the query. For each subject, only itself will appear int the query by removing the other subjects, decreasing the similarity with other nodes.
With this tool we want to answer queries like: "Will changing X break something?", "What will happen If I change this parameter of the function X to...?", "Is there any errors in the repo?"
We need to get the proper node and everything that is depending on it. Again, we'll try to get the node directly from the database based on the subject name and if it does not succeed then via similarity search. Besides that, we also want to get all of the nodes whose values in the column: node_relationships have the id of this node. Why? Because that means that in those nodes, the retrieved node appears in some way, which means that changing the retrieved node would affect those nodes as well.
all_nodes_related_to_this_node = self._db.query(Node).filter(Node.node_relationships.has_key(str(
Where is the retrieved node.
If no subjects are detected, then all of the nodes are retrieved and will need to be parsed one by one (maybe with its relationships to give the LLM more context) to answer the query for each and every one of them.
This retriever is not available for the agent to return. We'll need to use it as a postprocessor.
With this "retriever" the relationshps will be filtered to only keep those similar to the retrieved nodes given a threshold:
def filter_relationships(self, threshold: float) -> Dict[str, List[Node]] | Dict:
filtered_relationships = {}
for node in self._nodes:
for rel, rel_nodes in self._relationships.items():
filtered_nodes = []
for rel_node in rel_nodes:
similarity = 1 - np.cos(
if similarity > threshold:
filtered_relationships[rel] = filtered_nodes
Does it improve the answer?
: How do the methods: _check_common_parent_nodes, _check_relationships_of_retrieved_nodes, return_nodes_after_apply_threshold_filter and return_nodes_with_score_after_apply_threshold_filter work together to improve the result of the function query_vector_database?Answer
I'd be happy to help you understand how these methods work together!
To answer your question, let's focus on what these methods do:
1. _check_common_parent_nodes(): This method seems to analyze the retrieved nodes and calculates some frequency values
(parent_node_freq and file_node_freq) based on their relationships.
2. _check_relationships_of_retrieved_nodes(depth=3): This method likely checks the relationships between the retrieved nodes, possibly up to
a certain depth (in this case, 3).
3. return_nodes_after_apply_threshold_filter(): This method appears to filter the retrieved nodes based on some threshold criteria,
perhaps related to their scores.
4. return_nodes_with_score_after_apply_threshold_filter(): Similarly, this method filters the nodes with their corresponding scores, likely applying the
same threshold as above.
Now, let's consider how these methods work together:
The _check_common_parent_nodes() and _check_relationships_of_retrieved_nodes() methods seem to analyze the relationships between the retrieved nodes. The
results from these methods (parent_node_freq and file_node_freq) are then used in some way to improve
the result of the query_vector_database function.
The return_nodes_after_apply_threshold_filter() and return_nodes_with_score_after_apply_threshold_filter() methods, on the other hand, seem to refine the retrieved nodes
based on their scores. The results from these methods are also used to improve the
final result.
In summary, these methods work together by:
- Analyzing relationships between retrieved nodes
- Refining nodes based on their scores and threshold criteria
- Improving the overall result of the query_vector_database function
By combining these methods, the function can likely improve its accuracy in retrieving relevant nodes
for a given query.
Clearly! As we can see, without the relationships the LLM just guess what the functions do, whereas when the relationships are used, it actually has the definition of the function, so it does not have to guess anymore!
Another agent could be used here to decide if the relationship is relevant to the query or not.
: here you can find end-to-end explanations of how this multi-agent project is implemented.0_subjects.ipynb
: an example of a question with 0 subjects answered.relationship_retriever_example.ipynb
: to showcase how the relationships can be leveraged to improve the results.
- Retrievers: The implementation of all of the retrievers.
- Agents: LLMs utils, including the different agents.
- Prompts: The prompts used to "train" the agents and to answer the queries.
- You'll need llama3 running
- In the terminal:
uvicorn app.main:app --reload --host --port 3000
(it is not include in the docker because of incompatibities with rag-pychunk) - docker compose up -d
- Go to:
and upload your python code in a zip file. - Start chatting with your repository!
When you update your code, all the files will be saved on a folder called user_code inside the folder of this cloned repository. The files will be saved according to your zip file (the operation is just an unzip function).
If you want to update your code, you don't need to make changes and upload the zip file again! Just make the changes inside the user_code folder, save them and click on the Update your database
The changed files will be rewritten in your database and new relationships will be computed!
Env example:
export DB_USER='<user>'
export DB_PASSWORD='<password>'
export DB_NAME='<db-name>'
export USER_CODE_DIRECTORY='./user_code'
export FILES_STRUCTURE_FOLDER='./files_structure'
export NAMES_FILE='./name.txt'
export RELATIONSHIPS_FILE='./node_relationships.txt'
export THRESHOLD='0.25'
If for some reason you can't install pgvector on the docker image for postgres, remove the part of pgvector and create it manually.
FROM postgres:16-alpine
RUN apk update && apk add --no-cache \
build-base \
postgresql-dev \
wget \
unzip \
Then follow the instructions of pgvector to create the extension on the database you wanted.