Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to list details of available datasets, rename =help to =list #24

Merged
merged 1 commit into from
Feb 15, 2024

Conversation

daverigby
Copy link
Collaborator

Add a new value for --pinecone-datasets - 'list-details'. This lists
all available datasets, including all information available about
them. This is probably more detail than users would typically need,
but it can be useful for development / debugging.

Example output from a subset of the datasets:

name                                          created_at                      documents    queries  source                                                                                                                                                      bucket                                                           task               dense_model                                                                                      sparse_model                                                            license    description    tags    args
--------------------------------------------  ----------------------------  -----------  ---------  ----------------------------------------------------------------------------------------------------------------------------------------------------------  ---------------------------------------------------------------  -----------------  -----------------------------------------------------------------------------------------------  ----------------------------------------------------------------------  ---------  -------------  ------  ------
ANN_DEEP1B_d96_angular                        2023-03-10 14:17:01.481785        9990000      10000  https://github.com/erikbern/ann-benchmarks                                                                                                                  gs://pinecone-datasets-dev/ANN_DEEP1B_d96_angular                ANN                {'name': 'ANN benchmark dense model', 'tokenizer': None, 'dimension': 96}                        {'name': None, 'tokenizer': None}
langchain-python-docs-text-embedding-ada-002  2023-06-27                           3476          0  https://huggingface.co/datasets/jamescalam/langchain-docs-23-06-27                                                                                                                                                                              {'name': 'text-embedding-ada-002', 'tokenizer': None, 'dimension': 1536}
quora_all-MiniLM-L6-bm25-100K                 2023-06-25 10:00:00.000000         100000      15000  https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs                                                                                      gs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25              similar questions  {'name': 'sentence-transformers/msmarco-MiniLM-L6-cos-v5', 'tokenizer': None, 'dimension': 384}  {'name': 'naver/splade-cocondenser-ensembledistil', 'tokenizer': None}

Type of Change

  • New feature (non-breaking change which adds functionality)

Test Plan

Unit tests updated to cover new functionality.

@daverigby daverigby marked this pull request as ready for review February 14, 2024 11:54
Rename --pinecone-datasets=help to --pinecone-datasets=list.

Add a new value for --pinecone-datasets - 'list-details'. This lists
all available datasets, including all information available about
them. This is probably more detail than users would typically need,
but it can be useful for development / debugging.

Example output from a subset of the datasets:

    name                                          created_at                      documents    queries  source                                                                                                                                                      bucket                                                           task               dense_model                                                                                      sparse_model                                                            license    description    tags    args
    --------------------------------------------  ----------------------------  -----------  ---------  ----------------------------------------------------------------------------------------------------------------------------------------------------------  ---------------------------------------------------------------  -----------------  -----------------------------------------------------------------------------------------------  ----------------------------------------------------------------------  ---------  -------------  ------  ------
    ANN_DEEP1B_d96_angular                        2023-03-10 14:17:01.481785        9990000      10000  https://github.com/erikbern/ann-benchmarks                                                                                                                  gs://pinecone-datasets-dev/ANN_DEEP1B_d96_angular                ANN                {'name': 'ANN benchmark dense model', 'tokenizer': None, 'dimension': 96}                        {'name': None, 'tokenizer': None}
    langchain-python-docs-text-embedding-ada-002  2023-06-27                           3476          0  https://huggingface.co/datasets/jamescalam/langchain-docs-23-06-27                                                                                                                                                                              {'name': 'text-embedding-ada-002', 'tokenizer': None, 'dimension': 1536}
    quora_all-MiniLM-L6-bm25-100K                 2023-06-25 10:00:00.000000         100000      15000  https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs                                                                                      gs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25              similar questions  {'name': 'sentence-transformers/msmarco-MiniLM-L6-cos-v5', 'tokenizer': None, 'dimension': 384}  {'name': 'naver/splade-cocondenser-ensembledistil', 'tokenizer': None}
@daverigby daverigby merged commit b484f90 into main Feb 15, 2024
7 checks passed
@daverigby daverigby deleted the datasets_list_details branch February 15, 2024 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant