Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent storage of matrices that enables quick indexed lookup #9

Open
dhimmel opened this issue Jul 25, 2016 · 4 comments
Open

Persistent storage of matrices that enables quick indexed lookup #9

dhimmel opened this issue Jul 25, 2016 · 4 comments
Labels

Comments

@dhimmel
Copy link
Member

dhimmel commented Jul 25, 2016

Currently, we're storing our datasets (which are matrices) as compressed TSVs which are great for long-term interoperable storage. However, we'd love a way to lookup specific rows and columns without having to read the entire dataset. We began discussing options at cognoma/cognoma#17 (comment). We want a persistent storage format (i.e. file) that allows reading only specified rows and columns into a numpy array/matrix or a pandas dataframe.

A primary benchmark for judging implementations is how much time are you saving over reading in the entire bzipped TSV into python via pandas for a variety of setups.

@dhimmel dhimmel added the task label Jul 25, 2016
@dhimmel dhimmel changed the title Persistent storage that enables quick indexed lookup Persistent storage of matrices that enables quick indexed lookup Jul 25, 2016
@clairemcleod
Copy link
Member

Questions from the group at Tuesday night discussion: Do you anticipate complete randomness in the subselection (i.e. totally user selected), or is there some structure that governs what might be asked for? IOW, is chunking an option?

--> Perhaps a cached or database format might be more appropriate? Or microservice? An advantage of a microservice would be the ability to respond to demand.

@dhimmel
Copy link
Member Author

dhimmel commented Jul 27, 2016

Do you anticipate complete randomness in the subselection (i.e. totally user selected)

Yes we should be prepared to serve any combination of rows.

Perhaps a cached or database format might be more appropriate? Or microservice? An advantage of a microservice would be the ability to respond to demand.

I like solutions that don't require any running services. Life is so much easier when all you need is a single file. Another option is feather which is a binary format for storing dataframes. While it doesn't support indexed reading (reading only a subset of the overall dataset), it's supposedly really quick.

Currently, it's not too too slow to read the full files, so this may be prematurely optimizing... we could stick with TSV until it becomes a bottleneck?

@clairemcleod
Copy link
Member

Tagging @stephenshank and @mike19106, who I think were both interested in this topic.

@awm33
Copy link
Member

awm33 commented Aug 12, 2016

We may be running a single job per worker instance at a time, with multiple jobs running concurrently via multiple instances. I like the idea of doing it mostly so they are less likely to interfere with each other in isolation.

What makes that relevent to this discussion and and cognoma/cognoma#17 is that we can dedicate a decent amount of memory per job. So in-memory caching becomes more possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants