You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I really appreciated the clarity of presentation in your whitepaper and I'm excited to explore this area because it's so abundantly clear that the vast majority of practical value add of LLMs are locked behind better context management.
I'd been thinking about this too and the analogy to memory management in computer, OS, and application architecture is a solid one! To squeeze the most performance and quality out of these things we need to be much smarter about the use of context.
I wanted to continue a discussion i saw on hacker news which is truncated as I guess that site locks topics after just a few days. It is here https://news.ycombinator.com/item?id=37899268 but I will also copy the content below:
-- empath-nirvana 15 days ago
Is there any reason you're just doing everything within a single context window? I experimented with similar stuff months ago and basically parallelized everything into multiple requests to different agents in pre and post-processing steps. The main context window, for example, wasn't aware of memories being generated or retrieved. I had a post-processor just automatically generating memories and saving them, along with all the conversations being saved in a vector database, and a pre-processor that would automatically inject relevant memories and context based on the conversation, even re-writing the history so it would look to the main context window like the memory had always been there.
It saved a lot of space in the main context window for unnecessary system prompts and so on.
-- pacjam 15 days ago
These are all great points - who or what you ask to manage memory is a design decision and IMO there's two main ways to do it (in the context of chatbots):
implicit memory management, where the "main LLM" (or for chat, the "dialogue thread") is unaware that memory is being managed in the background (by a "memory LLM", a rule-based script, a small neural network, etc.), and
explicit memory management (MemGPT), where one LLM does everything
Prior research in multi-session / long-range chat is often implicit, with a designated memory creation process. If I had to guess, I'd say the vast majority of consumer chatbots that implement some type of memory store are also implicit. This is because getting explicit memory management to work requires a lot of complex instruction following, and in our experience this just isn't possible at the moment with most publicly available LLMs (we're actively looking into ways to fix this via eg fine-tuning open models).
The tradeoffs are as you mentioned: with implicit, you don't have to stuff all the memory management instructions into the LLM preprompt (in MemGPT, the total system message is ~1k tokens). But on the upside, explicit memory management (when the LLM works) makes the overall system a lot simpler - there's no need to manage multiple LLM models running on parallel threads, which can add a lot of overhead.
I do appreciate the benefit of having a simpler framework but I wonder if there is another thing that hasn't been discussed which is that if we have separate agents, where agent A is able to dedicate all of its context to its conversation, and agent B has access to agent A's context and is partly or wholly responsible for managing its contents, I can see the benefit that both can execute at the same time provided suitable hardware, for better performance, and there should be few issues with retrieving data from the conversation for storage.
But what about fetching data?
My initial thought upon learning about memgpt's approach is definitely to try to think about how we could reduce the overhead of having a huge system prompt. So I looked and this other commenter already asked the same question. My other initial impression was that having a LLM iteratively slog its way through multiple pagination chunks of haystack data hunting for its needle would likely not be practical due to time and compute costs, but as long as we use the context somewhat efficiently, there's hardly a more efficient way to do it, at least while insisting on feeding that data through an LLM anyway. I want to say that data retrieval should not even be a deep learning class of problem, and more of a specialized database implementation problem, so I will try to handwave that one away for now. An LLM is certainly going to be valuable at least for the last mile to massage retrieved data for end user or self.
But anyway, I want to keep pulling on this thread... I think that having a separate agent (like say a 7B model somehow finetuned for memory management) handle memory management may not even work well, because I suspect that part of what makes memgpt work well is that the main model (and i guess it was developed using gpt4) is the one that is already in the driver's seat and knows about what its goals are and can inherently drive the data retrieval decision making in a sensible way.
So perhaps a hybrid sort of approach may be more ideal --- yes the main model can be freed from having a heavy system prompt during active conversation, and can offload memory management to dumber components, but under other types of work loads (the classification of which is unclear how to define) it will need to be increasingly involved and take on more system prompting...
come to think of it, even memory recording benefits from full intelligence as it will have a better grip on the importance of different aspects of conversation. But I do think that if a memory system is actually good then it should not be too detrimental for a dumber data storage agent to have recorded a few extra irrelevant facts into the database. You never know when they may be relevant later on anyway.
Maybe what would also be a good approach is if the memgpt system prompt could be fine tuned (or fully trained even deeper) into a model so that the model itself will naturally know how to implement its own memory management, and we could gain more use out of context. But this seems hardly a worthwhile effort when the effort could be put toward getting a much more flexible general purpose model that has a larger context size!
A second question I wanted to float is how does this compare to RAG stuff I've been hearing about. Unfortunately I haven't found an easy to digest paper related to other approaches in this space, so I cannot gain an understanding of them to compare with this. I suppose a big part of why this is is that memgpt operates pretty much exclusively at the upper levels of the system that I understand, and other papers that do anything under the hood leave me completely lost.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I really appreciated the clarity of presentation in your whitepaper and I'm excited to explore this area because it's so abundantly clear that the vast majority of practical value add of LLMs are locked behind better context management.
I'd been thinking about this too and the analogy to memory management in computer, OS, and application architecture is a solid one! To squeeze the most performance and quality out of these things we need to be much smarter about the use of context.
I wanted to continue a discussion i saw on hacker news which is truncated as I guess that site locks topics after just a few days. It is here https://news.ycombinator.com/item?id=37899268 but I will also copy the content below:
I do appreciate the benefit of having a simpler framework but I wonder if there is another thing that hasn't been discussed which is that if we have separate agents, where agent A is able to dedicate all of its context to its conversation, and agent B has access to agent A's context and is partly or wholly responsible for managing its contents, I can see the benefit that both can execute at the same time provided suitable hardware, for better performance, and there should be few issues with retrieving data from the conversation for storage.
But what about fetching data?
My initial thought upon learning about memgpt's approach is definitely to try to think about how we could reduce the overhead of having a huge system prompt. So I looked and this other commenter already asked the same question. My other initial impression was that having a LLM iteratively slog its way through multiple pagination chunks of haystack data hunting for its needle would likely not be practical due to time and compute costs, but as long as we use the context somewhat efficiently, there's hardly a more efficient way to do it, at least while insisting on feeding that data through an LLM anyway. I want to say that data retrieval should not even be a deep learning class of problem, and more of a specialized database implementation problem, so I will try to handwave that one away for now. An LLM is certainly going to be valuable at least for the last mile to massage retrieved data for end user or self.
But anyway, I want to keep pulling on this thread... I think that having a separate agent (like say a 7B model somehow finetuned for memory management) handle memory management may not even work well, because I suspect that part of what makes memgpt work well is that the main model (and i guess it was developed using gpt4) is the one that is already in the driver's seat and knows about what its goals are and can inherently drive the data retrieval decision making in a sensible way.
So perhaps a hybrid sort of approach may be more ideal --- yes the main model can be freed from having a heavy system prompt during active conversation, and can offload memory management to dumber components, but under other types of work loads (the classification of which is unclear how to define) it will need to be increasingly involved and take on more system prompting...
come to think of it, even memory recording benefits from full intelligence as it will have a better grip on the importance of different aspects of conversation. But I do think that if a memory system is actually good then it should not be too detrimental for a dumber data storage agent to have recorded a few extra irrelevant facts into the database. You never know when they may be relevant later on anyway.
Maybe what would also be a good approach is if the memgpt system prompt could be fine tuned (or fully trained even deeper) into a model so that the model itself will naturally know how to implement its own memory management, and we could gain more use out of context. But this seems hardly a worthwhile effort when the effort could be put toward getting a much more flexible general purpose model that has a larger context size!
A second question I wanted to float is how does this compare to RAG stuff I've been hearing about. Unfortunately I haven't found an easy to digest paper related to other approaches in this space, so I cannot gain an understanding of them to compare with this. I suppose a big part of why this is is that memgpt operates pretty much exclusively at the upper levels of the system that I understand, and other papers that do anything under the hood leave me completely lost.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions