Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed cache strategy #5

Open
AlexMikhalev opened this issue Mar 15, 2016 · 7 comments
Open

Distributed cache strategy #5

AlexMikhalev opened this issue Mar 15, 2016 · 7 comments

Comments

@AlexMikhalev
Copy link

Hello Fellows,
I am looking at using dask for distributed computation and I was wondering what is your strategy to make cachey into horizontally scalable cache and common calculation space?

Regards,
Alex

@mrocklin
Copy link
Member

Are you already familiar with dask's distributed scheduler at http://distributed.readthedocs.org/en/latest/ ?

@AlexMikhalev
Copy link
Author

Hello Matt. Yes, I have seen distributed. What I am after is some sort of key value storage for storing results and available on multiple nodes locally. Is it something inside distributed?
A

@mrocklin
Copy link
Member

The distributed scheduler maintains distributed data indexed by key in the active memory of multiple nodes. It doesn't persist to disk but easily could by running some function against the data. I would not recommend using distributed as a persistent key-value store or as a database but instead as a staging ground for distributed computations. For persistence I would generally recommend some sort of parallel storage solution like HDFS, Ceph, or a more traditional Posix file system. It tends to be fairly easy to write ingest functions to read from these with distributed in a data local way.

In [1]: from distributed import Executor

In [2]: e = Executor('127.0.0.1:8786')

In [3]: futures = e.scatter(range(10))

In [4]: futures
Out[4]: 
[<Future: status: finished, key: f1b37646b1331806c57001fca979724e>,
 <Future: status: finished, key: 8068024105615db5ba4cbd263d42a6db>,
 <Future: status: finished, key: d1d6feb3b70239a6081e64b7272d18c8>,
 <Future: status: finished, key: 613f9ec13ceecaed260a81ade5f37311>,
 <Future: status: finished, key: 205604040f767e399cb8b3c9322abdaa>,
 <Future: status: finished, key: 7b4b5c735acdd7ee3b2fafd41bc5be78>,
 <Future: status: finished, key: 3ab29504c46cb29e851081c84b5e4249>,
 <Future: status: finished, key: f7aa9090f9d4af3b10066e98993388de>,
 <Future: status: finished, key: 25e70ec51f367dbbff88d67a9684a5a6>,
 <Future: status: finished, key: 42cb136fea0b4175909749d31c9a6c84>]

In [5]: e.has_what()
Out[5]: 
{'192.168.1.141:33088': ['d1d6feb3b70239a6081e64b7272d18c8',
  '25e70ec51f367dbbff88d67a9684a5a6',
  '8068024105615db5ba4cbd263d42a6db',
  '613f9ec13ceecaed260a81ade5f37311',
  'f7aa9090f9d4af3b10066e98993388de',
  '205604040f767e399cb8b3c9322abdaa',
  '7b4b5c735acdd7ee3b2fafd41bc5be78',
  '3ab29504c46cb29e851081c84b5e4249'],
 '192.168.1.141:52787': ['f1b37646b1331806c57001fca979724e',
  '42cb136fea0b4175909749d31c9a6c84']}

In [6]: def inc(x):
    return x + 1
   ...: 

In [7]: futures2 = e.map(inc, futures)

In [8]: e.has_what()
Out[8]: 
{'192.168.1.141:33088': ['inc-6a9bbed6c5c042aac23ac3739b3080ab',
  'inc-80a8293859f4e85484b3fa9f5ae37e66',
  '3ab29504c46cb29e851081c84b5e4249',
  'inc-a684480530c8117e48025ba71b62a8fb',
  '25e70ec51f367dbbff88d67a9684a5a6',
  'd1d6feb3b70239a6081e64b7272d18c8',
  'inc-9824ed6eb573348151189779d5fa084a',
  '8068024105615db5ba4cbd263d42a6db',
  '613f9ec13ceecaed260a81ade5f37311',
  'inc-585e542b4025afe01b51bbab1e738cf6',
  'f7aa9090f9d4af3b10066e98993388de',
  'inc-b7ea7bc3d40dfc9dbaaab71d5a9f7a0a',
  'inc-7348f0bb324a145903cbe7c249e062e4',
  '205604040f767e399cb8b3c9322abdaa',
  '7b4b5c735acdd7ee3b2fafd41bc5be78',
  'inc-4ade3c6e95e45a1afc31f887a57733f5'],
 '192.168.1.141:52787': ['f1b37646b1331806c57001fca979724e',
  '42cb136fea0b4175909749d31c9a6c84',
  'inc-b9caf9d500d82c4b712e016a1263da32',
  'inc-80d3fd1f19423d863391d5d468b29e54']}

@AlexMikhalev
Copy link
Author

Thank you. I was thinking about plugging something like tarantool or Riak.

@mrocklin
Copy link
Member

That could be a very fun combination

@AlexMikhalev
Copy link
Author

Matt, I am interested in your opinition, I was inspired by http://mxnet.readthedocs.org/en/latest/distributed_training.html#how-to-write-a-distributed-program-on-mxnet
and I think Tarantool is an interesting storage for high cost values.

@jakirkham
Copy link
Member

Sorry to necro this old thread, but figured this might be worth mentioning. Zarr recently added a Redis store ( zarr-developers/zarr-python#372 ), which is an in-memory key-value store. It implements the MutableMapping API. So should be a nice thing to use with Cachey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants