Idea: Add personally-curated data feeds for content to support RAG #450

draknor · 2024-07-07T15:16:25Z

draknor
Jul 7, 2024

Here's an idea I've been pondering that could be an interesting & powerful feature for hostedgpt -- I'd love to help contribute and make this a reality!

Background: I'm browsing the web & social media and saving/bookmarking/interacting with blogs, articles, and other content that I find personally interesting & engaging.

User Story:

As a User of hostedgpt I can add Feeds (from supported services) that represent my curated content
Then I can chat/QA and hostedgpt will use my content to generate responses.

Sample Example:

I am browsing Reddit and save this post: New version of HostedGPT running the GPT-4o announced today!
Later that day/week/month, I am chatting with hostedgpt and ask "What was that ChatGPT clone built on Rails that I found on reddit?"
hostedgpt responds with: "The project you are looking for is HostedGPT - it is an open source rails clone of ChatGPT that recently added supported for GPT-4o, announced in a post on Mon, May 13 link. The GitHub repository is: https://github.com/allyourbot/hostedgpt

I need to spend some time reviewing hostedgpt's design & architecture to see how this could fit in, but at a high-level design here's what I'm thinking:

Developers would add support for specific feed sources - examples may include iCAL or RSS feeds, Reddit saved posts, articles from Medium / Pocket / etc, Evernote / OneNote pages, etc.
- Requirements for sources would include a user-specific feed URL or API call to retrieve the list of content articles, the ability to retrieve the full (text) content of that content, and review of the relevant terms of service to ensure compliance
- In the code we'd have an abstraction layer (FeedSource) that handles source-specific implementation details
A User would setup a Feed - it would belong_to a FeedSource (in addition to User), to contain any user-specific auth information (source username/personal URL, API keys, etc), as well as configuration (eg poll frequency) and metadata (last_retrieved_at, retrieval errors), etc
A background job would poll each Feed (via FeedSource) to look for new content.
If new content is found, then an Article (belongs_to User, belongs_to Feed) would be created for each piece of content (as defined by the source-specific implementation of FeedSource), containing content_url, retrieved_at, and article_body_text (at a minimum).
Each Article.article_body_text would then be vectorized and/or indexed for full-text searching.

Then each Article is available for retrieval-augmented generation (RAG), ideally using something like what Microsoft calls hybrid search - using both vector + free-text search and merging the results. (I'm not very well-versed in this space yet so I don't have a lot of opinions / experience on best approach for retrieval & answer generation).

Questions for the collective:

Initial reactions & comments?
Does my high-level design sound like a reasonable approach?

If this sounds like a useful & interesting feature, I'll start digging deeper into design work and figuring out how to bite off small MVP chunks to test & validate the concept!

krschacht · 2024-07-07T22:40:12Z

krschacht
Jul 7, 2024
Maintainer

David, I think this sounds great. Thanks for laying out your thinking on this. You filled in a few pieces that I was wondering about. I've started thinking about this and have some thinking that could inform a start.

First, to react to a few things you said, I really like the idea of being able to find something with some vague memory of it. Your example of "What was that ChatGPT clone built on Rails that I found on reddit?" I also agree we should do this hybrid search. That jives with the advice I've read. It sounds like good keyword search has some key advantages while vector search has other advantages. You get the best results when you combine both of them.

I like your idea for creating feeds from different sources. I'm inclined to get the core models & search capabilities built out first, and then second we can focus on how to effectively populate the information. But I think that's a promising approach.

Let me break up some additional thoughts across a few posts, just to help facilitate threading.

0 replies

krschacht · 2024-07-07T22:47:48Z

krschacht
Jul 7, 2024
Maintainer

A few high level architecture points: I think we should get this working for OpenAI first before worrying about other LLMs. I think we should use postgres as the vector database rather than adding another dependency.

I think the vector search should be done using langchainrb_rails and I think the keyword/text search should use pg_search. I've not implemented either of these so I'm not certain they're the best, but based on some asking around over the past few months I've heard these recommended a few times and I did some reading on each and they look solid.

And I think, from the beginning, we should design this to work with multiple types of things that can be remembered. There are three that I have in mind. (1) I'm inclined to jam everything you listed into a model such as Posts because they all share a similar schema (title, some meta data like date, and a big body of text). (2) Conversations that you've having with the AI. "A few weeks ago we were talking about an issue I had configuring redis. What was the solution we came to?" (3) One of the main things I want to add to the app soon is Emails (the AI will help me manage my inbox, everything coming in, and will help compose/reply to emails).

Let's just focus on Post from the start, but let's architect it with an eye towards expanding it to Conversations soon after and eventually Email. Keeping multiple models in mind will help us architect this thing in a bit more generic way. I think this can be as simple as creating a Post model with a "search" concern so we're set up to include Searchable in other models.

2 replies

draknor Jul 8, 2024
Author

I think the vector search should be done using langchainrb_rails and I think the keyword/text search should use pg_search.

I'm using pg_search in another project I'm working on, so I've got some familiarity with it already. A handy/cool feature is that pg_search supports full-text search (FTS) across models (assuming you set it up that way), so we don't need to shoe-horn everything into a single model just to support FTS.

I looked into langchainrb_rails just a bit -- but I'm wondering if that's a bit too high-level for this use case? Vector support in postgresql looks like it's implemented via pgvector extension, and there's a ruby gem pgvector-ruby, but that recommends using neighbor for Rails. That'll keep us more LLM-agnostic and give us a bit more control over the embedding.

I definitely agree with starting with Post to start (and I'm fine with that as our model name) -- I think Conversations and Email will (probably) have unique enough needs / constraints that I don't think we should try & design for those at the start (but certainly keep an eye towards extending to them once we have the core functionality in place!)

https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/

krschacht Jul 8, 2024
Maintainer

Your suggestion for a more focused pgvector gem sounds good! I’m glad you’ve used of_search before, I’ve been meaning to dig in because I implemented trigram manually once and it was complex to get it right.

krschacht · 2024-07-07T23:37:58Z

krschacht
Jul 7, 2024
Maintainer

Is there one core model or are there multiple? I think we can have a single one that subsumes bookmarks from Reddit, maybe Facebook, Instagram, Twitter, too. I also bookmark regularly on TikTok and YouTube. Those are probably a bit different since they're video content and may need extra considerations like extracting the transcript and capturing visual information ("it was animated video about ..."). I don't think we should design for videos yet, but maybe the initial model could eventually include them, or maybe not.

You threw out Article. That feels pretty good. I was leaning towards Post as that's a bit more abstract. Twitter and Reddit have posts, but I don't call them Articles since they're generally so short.

You mentioned Evernote / OneNote (I use Apple Notes but no API so I may switch to Obsidian or something else). "Post" doesn't feel like quite the right name for that, but the schema feels basically identical: title, body, created, etc. We could push for a name one degree more abstract like "Content", although that feels too vague and I'd expect that to subsume Conversations and Email, which I don't think it should. "Entry" or "Record" are both even more abstract.

Post is the best I can think of and I'd just include notes as if they are posts.

We need to figure out what the schema is. In order to do this we should be clear on I think we should consider making things retrievable through three different strategies:

Keyword search: normally a selection of keywords are inputted into a search box and those keywords could appear in any of a bunch of different fields (e.g. subject, body, source, etc)
Faceted search: this is when specific filters are applied to specific fields, e.g. before a certain created_at date, from a specific source such as reddit, a certain person commented on it, etc.
Vector search: we can think of this as fuzzy matching a big long text string. One important thing with vector search is how a "snippet" is constructed which is then turned into a vector. For example, if a friend and I were sitting next to each other for days in a row and we were co-browsing a bunch of websites. At some point I might say, "remember that reddit post about the XYZ that mentioned ABC, I think we saw it yesterday" or I might say "do you remember the article we read about XYZ, I don't remember where it was posted but the title was something like vector vs keywords and it gave a pro-con list of the two approaches." We need to auto-construct a really solid snippet which includes all the types of things that someone might say in order to jog another person's memory and this is what we turn into a vector, because this is the kind of thing someone might say aloud (or type) when wanting to retrieve something.

Maybe the schema:

Post:

title (would we put a full tweet in the title?)
body
created_at is already there with activeindex but maybe we alias this to saved_at, since created is a bit ambiguous — this isn't the date the post was created, it's the date we're saving it. if it's easy to rename activerecord default columns that might be cleaner than alias.
topics (I think we would be well-served to auto extract a list of key topics covered by this post with some LLM analysis upon save, is this an array comma separated or maybe this is an actual array stored as jsonb. probably that.)
source (string for: reddit, facebook, for articles would we put "pocket" or would we put the news source?)
site (maybe an article has "source" = pocket and site = wsj ? Not sure about this one...)
url (the permalink)

Where would the subreddit of a post be saved? It's inherent in the URL, so maybe that's enough. Maybe we put that in "site", actually. There is probably a better name than "site", maybe "outlet" is slight more generic? WSJ and r/rails both feel like news outlets.

2 replies

draknor Jul 8, 2024
Author

Is there one core model or are there multiple? I think we can have a single one that subsumes bookmarks from Reddit, maybe Facebook, Instagram, Twitter, too. I also bookmark regularly on TikTok and YouTube. Those are probably a bit different since they're video content and may need extra considerations like extracting the transcript and capturing visual information ("it was animated video about ..."). I don't think we should design for videos yet, but maybe the initial model could eventually include them, or maybe not.

I think we KISS and start with just Post (as a content model). TikTok and Youtube could potentially fit in this design via source-specific FeedSource classes (eg class FeedSource::Youtube < FeedSource) that can extract the transcript and relevant metadata to put into the Post model.

As for the retrieval strategies -- those all sounds reasonable. Keyword / text search is easily supported by pg_search (and facets can be added using standard Active Record Query clauses). The vector search implementation is new to me -- esp since embeddings are LLM-specific, we'll need to consider how to store those in a smart way. As you noted previously, the simple happy path is to just start with OpenAI support, but a future potential direction would be to support multiple embedding providers, or even include a local model to calculate embeddings, like SentenceTransformers.

Regarding your schema proposal, I was thinking to start a bit more simply:

title
body
retrieved_at (which is maybe redundant with ActiveRecord's created_at, unless/until we find a need to implement refreshing content)
published_at (which would ideally be when the content was originally published, for sources that provide that discretely)
belongs_to Feed (which belongs_to FeedSource, which will tell us the source system - could mirror/cache it on Post to better support searching & performance)
url (permalink)

I'm not sure about site or topics yet - those sound like useful things to add later when we've identified the specific needs/use cases for them. (I tend to over-design/over-engineer things and have been trying very hard in my newer projects to build just what I need, when I need it, to get things functional -- and then iterate!)

krschacht Jul 8, 2024
Maintainer

Good thoughts! I’m all for simplifying. I just added another comment below about creating the vector but maybe we won’t need all that for v1, not sure.

I forgot to mention that when we get to FeedSource and if those have unique authentication schemes, the app has a nice Authentication abstraction already and I’ve implemented a few different schemes. The GmailAuthentication (not GoogleAuthentication) is a good example of using auth for a service we are connecting to. To see it at work, turn GOOGLE_TOOLS_FEATURE on, go to account settings to auth, and then Toolbox::Gmail is what actually uses this with.

My instinct is that we get things working with manual Post entry first (just via console) and then add Feeds, but maybe we’ll do it all at once.

krschacht · 2024-07-08T11:35:34Z

krschacht
Jul 8, 2024
Maintainer

Assuming the thinking above ^ is roughly on track, the implementation plan may be:

Implement a model for storing the various bookmarks that are to be "remembered" e.g. Posts or whatever we settle on
Implement a recall class which takes in a user phrased input and retrieves memories
Implement a recall tool which the AI will use to interface with the recall class
...

I starting stubbing out those first two tasks to add thinking on implementation.

5 replies

draknor Jul 8, 2024
Author

Here's what I've been thinking about for implementation milestones:

Implement the Post model - just a skeleton, without even FTS or vector searching yet.
Implement the FeedSource and Feed models -- this way we can start to pull content in, which is a significant prereq to this actually being a viable feature!

Start with RSS/Atom as the only supported FeedSource -- this is a well-known spec that's used in many places, and there's already a gem rss to do the actual feed parsing.
Probably just manual rake or command-line task to actually pull some content at this stage -- adding bkg jobs can come later once the feature is fully functional.

Then add FTS search via pg_search, which will have a couple of components:

Add pg_search + content indexing
Create a FTS service object to define what behaviors / facets we want to support (I've got something similar in another project I can use as a sample)
Figure out what UI we want to expose to run the FTS search as well as review results (either a full user UI or just a dev UI for testing), WITHOUT engaging any LLM activity.

Then start to build up towards what you are listing - user chat/QA interaction.
Then - once all of that is functional (MVP-style) -- then come back & work on the vector embeddings.

Assuming we get all of this done - then we'll probably know what pain points we need to prioritize next!

Thoughts / reactions to this progression?

I figured I'd start working on 1-3 - that'll take me a bit anyway (given my other current life commitments), but those are relatively straight-forward steps that don't have a lot of unknowns in them, AND are foundational to this feature working as I'm imagining.

krschacht Jul 8, 2024
Maintainer

I replied above before reading this, but what you just laid out sounds great! I’m all for it!

It’s fun collaborating with you :)

draknor Jul 8, 2024
Author

Great!

I'll set a goal for myself for this week to:

Get my local dev env setup for hostedgpt
Create a feature branch off main
Create a basic Post model

I know it doesn't sound like much, but sometimes I need those small wins just to build momentum! :D

krschacht Jul 8, 2024
Maintainer

It’s going to be cool once we have this all working!

krschacht Jul 20, 2024
Maintainer

Hi David, I wanted to check in and see how things are going and if you had any blockers that I could be helpful with?

krschacht · 2024-07-27T13:37:46Z

krschacht
Jul 27, 2024
Maintainer

Came across this approach to using postgres for RAG: https://github.com/postgresml/korvus

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Add personally-curated data feeds for content to support RAG #450

{{title}}

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Idea: Add personally-curated data feeds for content to support RAG #450

draknor Jul 7, 2024

Replies: 5 comments · 9 replies

krschacht Jul 7, 2024 Maintainer

krschacht Jul 7, 2024 Maintainer

draknor Jul 8, 2024 Author

krschacht Jul 8, 2024 Maintainer

krschacht Jul 7, 2024 Maintainer

draknor Jul 8, 2024 Author

krschacht Jul 8, 2024 Maintainer

krschacht Jul 8, 2024 Maintainer

draknor Jul 8, 2024 Author

krschacht Jul 8, 2024 Maintainer

draknor Jul 8, 2024 Author

krschacht Jul 8, 2024 Maintainer

krschacht Jul 20, 2024 Maintainer

krschacht Jul 27, 2024 Maintainer

draknor
Jul 7, 2024

Replies: 5 comments 9 replies

krschacht
Jul 7, 2024
Maintainer

krschacht
Jul 7, 2024
Maintainer

draknor Jul 8, 2024
Author

krschacht Jul 8, 2024
Maintainer

krschacht
Jul 7, 2024
Maintainer

draknor Jul 8, 2024
Author

krschacht Jul 8, 2024
Maintainer

krschacht
Jul 8, 2024
Maintainer

draknor Jul 8, 2024
Author

krschacht Jul 8, 2024
Maintainer

draknor Jul 8, 2024
Author

krschacht Jul 8, 2024
Maintainer

krschacht Jul 20, 2024
Maintainer

krschacht
Jul 27, 2024
Maintainer