Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata to pgvector #912

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

marcomd
Copy link

@marcomd marcomd commented Feb 1, 2025

Description

I added a jsonb field called metadata to pgvector. Now you can add metadata to catalog a record.

  • Added the metadata field to the db
  • Updated the pgvector class to
    • Stored metadata in the new field on the db
    • Embeddings enriched with metadata
    • Also added metadata_filter parameter to similarity_search to simply exclude results based on metadata

How to test

Let's create a list of documents to do some testing.

# I built this hash from a json, forgive me for the unidiomatic syntax
documents = [
  {
    "content": "I am me and who the hell are you? The author of this phrase was a marquis, a very lazy person. His main pastime, which makes him famous throughout the city, consists of countless pranks and pranks, in which he is always accompanied by his faithful Ricciotto, and of which his aristocratic family, made up of extravagant characters closed to the outside world, often becomes the victim.",
    "metadata": {
      "title": "I am the only one who exists",
      "author": "Onofrio Del Grillo",
      "year": 1981
    }
  },
  {
    "content": "Accountant Ugo Fantozzi is a humble and unfortunate employee of Megaditta, servile towards his superiors and ignored by his colleagues, so much so that he was mistakenly walled up in the old company offices for eighteen days without anyone noticing.",
    "metadata": {
      "title": "Fantozzi",
      "author": "Paolo Villaggio",
      "year": 1975
    }
  },
  {
    "content": "The story of a woman who, after a long period of illness, decides to leave her husband and children to go to the sea.",
    "metadata": {
      "title": "The sea",
      "author": "John Banville",
      "year": 2005
    }
  },
  {
    "content": "A cat and a dog, who are best friends, decide to go on a journey together. They face many challenges along the way, but their friendship helps them overcome all obstacles.",
    "metadata": {
      "title": "The Journey",
      "author": "Mark Twain",
      "year": 1884
    }
  },
  {
    "content": "A young man named John travels to a distant land in search of a legendary treasure. Along the way, he encounters many dangers and makes new friends who help him on his quest.",
    "metadata": {
      "title": "The Treasure Hunt",
      "author": "Robert Louis Stevenson",
      "year": 1883
    }
  },
  {
    "content": "A group of friends go on a camping trip in the mountains. They have a great time hiking, fishing, and telling stories around the campfire. But when a storm hits, they must work together to survive.",
    "metadata": {
      "title": "The Camping Trip",
      "author": "Jack London",
      "year": 1903
    }
  }
];nil

model = "gemma2"
llm = Langchain::LLM::Ollama.new(url: "http://localhost:11434", default_options: { temperature: 0.1, chat_model: model, completion_model: model, embedding_model: model })

# Create a db on postgres and update this data with yours
vector_client = Langchain::Vectorsearch::Pgvector.new(url: "postgres://username:password@localhost:5432/langchain_rag", index_name: "documents", llm: llm)

# vector_client.destroy_default_schema
vector_client.create_default_schema

# Load the documents into pgvector
documents.each do |document|
  vector_client.add_texts(texts: [document[:content]], metadata: document[:metadata])
end

results=vector_client.similarity_search(query: "Author Paolo Villaggio", k: 2)

results.each { |result| metadata=JSON.parse(result.metadata);puts "#{result.content} \n- #{metadata['author']} (#{metadata['year']})" };nil

I am me and who the hell are you? The author of this phrase was a marquis, a very lazy person. His main pastime, which makes him famous throughout the city, consists of countless pranks and pranks, in which he is always accompanied by his faithful Ricciotto, and of which his aristocratic family, made up of extravagant characters closed to the outside world, often becomes the victim. 
- Onofrio Del Grillo (1981)
Accountant Ugo Fantozzi is a humble and unfortunate employee of Megaditta, servile towards his superiors and ignored by his colleagues, so much so that he was mistakenly walled up in the old company offices for eighteen days without anyone noticing. 
- Paolo Villaggio (1975)

As you can see the metadata are used to generate the embeddings...even if the rank is not what I expected 🤔

results=vector_client.similarity_search(query: "Year 2005", k: 2)
results.each { |result| metadata=JSON.parse(result.metadata);puts "#{result.content} \n- #{metadata['author']} (#{metadata['year']})" };nil

A cat and a dog, who are best friends, decide to go on a journey together. They face many challenges along the way, but their friendship helps them overcome all obstacles. 
- Mark Twain (1884)
The story of a woman who, after a long period of illness, decides to leave her husband and children to go to the sea. 
- John Banville (2005)

Anyway I also added a new parameter metadata_filter :

results=vector_client.similarity_search(query: "Year 2005", k: 2, metadata_filter: {year: 2005})

results.each { |result| metadata=JSON.parse(result.metadata);puts "#{result.content} \n- #{metadata['author']} (#{metadata['year']})" };nil

The story of a woman who, after a long period of illness, decides to leave her husband and children to go to the sea. 
- John Banville (2005)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant