Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/caching #1922

Open
wants to merge 11 commits into
base: v2.6
Choose a base branch
from
Open

Feature/caching #1922

wants to merge 11 commits into from

Conversation

hmoazam
Copy link
Collaborator

@hmoazam hmoazam commented Dec 10, 2024

One single caching interface which has two levels of cache - in memory lru cache and fanout (on disk)

@hmoazam hmoazam requested a review from okhat December 10, 2024 22:15
dspy/__init__.py Outdated
@@ -64,6 +66,13 @@
configure = settings.configure
context = settings.context

CACHE_DIR = os.environ.get("DSPY_CACHEDIR") or os.path.join(Path.home(), ".dspy_cache")
Copy link
Collaborator

@okhat okhat Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: .dspy_cache is already used for litellm, this PR should be swapped (it uses .dspy_litellm_cache for the existing cache but that cache doesn't exist at that new path)

also the limit should be 30GB

num_retries=num_retries,
)
from dspy import settings
@dspy_cache_decorator(settings.cache)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is a strange pattern; is the inner function redefined every time? does this mean the decorator... works? I guess it does, but yeah it's a bit overkill probably?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work but definitely a weird pattern. Not 100% sure if it has any negative side effects.

def cached_litellm_text_completion_inner(request: Dict[str, Any], num_retries: int):
return litellm_text_completion(
request,
cache={"no-cache": False, "no-store": False},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This passes cache to the function below. But for some reason we removed the cache kwarg below? Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default (both kwargs true) should be added back to litellm_text_completion function (I forgot to fix this)

key = self.cache_key(request)
except Exception:
return None
with self.lock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm do we need to lock just to read?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can remove this lock, we may wanna do something like if key := self.memory_cache.get() to avoid double access

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think reads are thread safe but the docs aren't clear.

@okhat
Copy link
Collaborator

okhat commented Dec 12, 2024

Recording some thoughts here.

  1. We need to make sure that caching behavior makes sense if the function being cached raises an Exception.
  2. We need to make sure that the caching behavior makes sense for LiteLLM. Specifically, LiteLLM's default caching is a bit nuanced: it does not cache api_* keys (which is important) and it does not cace metadata (like cost), as far as I can tell.

@okhat okhat changed the base branch from main to v2.6 December 13, 2024 15:38
"""

self.memory_cache = LRUCache(maxsize=mem_size_limit)
self.fanout_cache = FanoutCache(shards=16, timeout=2, directory=directory, size_limit=disk_size_limit)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmoazam As I understand it, part of the motivation for this PR is to enable users to dump the current state of the cache for a particular DSPy session.

Is the FanoutCache introduced for that purpose? If so, since FanoutCache shares the same directory as the litellm.cache, what extra utility is it providing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, I think we should define the behavior for dumping / loading a cache.

For example, consider the case where I've already run "Program 1" as a Python script / notebook. Now, I'm running "Program 2" in a separate notebook / Python script invocation. I'm using the default DSPy cache directory for both program executions.

If I dump the cache after I run Program 2, do I also want to include cache contents from Program 1? Or, do I only want to include cache contents from Program 1, since that's my "current session"? Should users have a choice?

Once we align on the desired semantics, we can figure out how to make adjustments to the implementation.

cc @okhat

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbczumar For disk caching, we will migrate from LiteLLM's caching to FanoutCache. The former has proved limiting when users need many read/write processes and threads in parallel.

For a short period of time, we might keep the LiteLLM cache as well. This means that highly active users that upgrade versions regularly will have an implicit seamless migration of their caches, because LiteLLM's requests will remain cached and will implicitly transfer to the FanoutCache. (This behavior is not required, and in principle there are other ways to more intentionally allow migration, but explicit migration is almost never worthwhile for caches.)

Cache dumping is fairly simple. It saves the current in-memory Python session, unless the user triggers a reset for the in-memory cache. (The current in-memory session draws on saved caches on disk, as usual.)

Copy link
Collaborator

@dbczumar dbczumar Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okhat @hmoazam Apologies for the delay.

I took a look at this further and found that diskcache does not natively handle objects that can't be pickled (e.g. lambdas like LiteLLM's azure_ad_token_provider arg), but LiteLLM does. There's a nontrivial amount of logic that goes into this. It may be easier to add support for FanoutCache to LiteLLM, rather than implement our own diskcache wrapper. Are there other reasons that we need to move away from LiteLLM?

Separately, it's not clear to me that dumping the in-memory Python session is what we want. If I understand correctly, the user's objective is to run a workload and store all of the cached values for that workload on disk. That way, another user can set DSPY_CACHEDIR to that disk location and automatically use the cached values. Dumping memory isn't quite the same thing because users might have memory limits set on their cache (we probably want to set a memory limit by default anyway), thereby unintentionally discarding cached values from larger workloads. Further, dumping memory to a separate location than the disk cache (as is done in this PR) complicates the experience of loading the cache (I can't just set DSPY_CACHEDIR to the desired location - I have to worry about loading this pickled memory state as well). To solve both of these problems, I'd propose an API like the following:

# Start recording (dual writing) cache results to the specified destination, which will be
# created if it does not exist
dspy.settings.cache.start_recording(destination="/path/to/cache/directory/that/I/want/to/share")
# Run workload
dspy.settings.cache.stop_recording() # Optional

When recording is enabled, results will be dual written to the main disk cache location (e.g. the one specified by DSPY_CACHEDIR) and the user-specified disk cache location. This way, recipients of the cache who want to reuse its values don't have to reason about loading stuff from disk vs memory - the snapshotted (recorded) cache looks just like any other DSPy disk cache (because it is :D).

Copy link
Collaborator Author

@hmoazam hmoazam Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dbczumar! I've been looking into the failing tests and was still figuring out the part with pickling. I'll let Omar comment on any other reasons for moving away from litellm's cache, but the main motivations were for us to have more options e.g. use fanoutcache, and offer additional methods like saving and loading.

We originally envisioned the user journey for saving/loading as the following:

  1. User executes a program, assuming caching is turned on, calls are cached both in memory and on disk.
  2. User decides they want to port some subset of the cache, so they run a new session of the notebook, executing only the calls they wish to export, which end up in the in-memory cache, and can then be dumped out with the save method. This means they do not have to reset the disk cache or create another one.
  3. User calls the load method with the path to the saved cache, which loads it into memory

I do think that we should keep the 2 levels of caching, and I do see we have a gap with point 3, in that we should also populate the disk cache with those values explicitly. Your suggestion does help us achieve a similar experience for the users, but my main concern is whether we would run into issues simply because of the pickling limitations. I haven't gone through all the relevant litellm code yet, but with your proposed approach we still face that limitation from diskcache itself.

Copy link
Collaborator

@dbczumar dbczumar Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmoazam We can also add Fanout to LiteLLM directly. My motivation here is that LiteLLM handles pickling on top of diskcache directly. The problem with building our own wrapper over diskcache is that diskcache does not handle pickling directly - i.e. we won't be able to save many valid LiteLLM request objects (e.g. requests with lambdas) to disk unless we reuse LiteLLM's pickle compatibility logic (which is private / internal) or roll our own. This is a problem even without memory pickling (two separate issues).

Regarding save / load, I think this is what the memory / dumping logic in your PR (and my proposed start_recording() API) is attempting to handle. We can still wrap LiteLLM to achieve this; I don't think save / load is sufficient motivation for rewriting the wrapper around diskcache to achieve this (why not wrap litellm.caching.Cache, which saves us all the pickle + disk headaches?)

User decides they want to port some subset of the cache, so they run a new session of the notebook, executing only the calls they wish to export, which end up in the in-memory cache, and can then be dumped out with the save method. This means they do not have to reset the disk cache or create another one.

  • By definition porting means creating another cache (let me know if I'm missing something)

  • From a solution standpoint, "dump memory" is an indirect way of achieving portability. It relies on the assumption that all of the data in the new session is cached in memory and that, therefore, dumping memory will create the desired "port" (subset copy) of the cache. However, this may not be the case if the user has set memory limits on the cache (in which case some of the data from the session is only on disk and never gets ported), and it also relies on the user clearing memory before running the session to ensure a "clean port" (which can be inefficient if most of the data is already in memory, and only a small subset needs to be excluded from the port). "start_recording()" is more explicit, doesn't require clearing a memory cache, and doesn't risk data loss from the cache port due to configured memory limits.

@okhat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants