Add option to list details of available datasets, rename =help to =list #24

daverigby · 2024-02-14T11:52:55Z

Add a new value for --pinecone-datasets - 'list-details'. This lists
all available datasets, including all information available about
them. This is probably more detail than users would typically need,
but it can be useful for development / debugging.

Example output from a subset of the datasets:

name                                          created_at                      documents    queries  source                                                                                                                                                      bucket                                                           task               dense_model                                                                                      sparse_model                                                            license    description    tags    args
--------------------------------------------  ----------------------------  -----------  ---------  ----------------------------------------------------------------------------------------------------------------------------------------------------------  ---------------------------------------------------------------  -----------------  -----------------------------------------------------------------------------------------------  ----------------------------------------------------------------------  ---------  -------------  ------  ------
ANN_DEEP1B_d96_angular                        2023-03-10 14:17:01.481785        9990000      10000  https://github.com/erikbern/ann-benchmarks                                                                                                                  gs://pinecone-datasets-dev/ANN_DEEP1B_d96_angular                ANN                {'name': 'ANN benchmark dense model', 'tokenizer': None, 'dimension': 96}                        {'name': None, 'tokenizer': None}
langchain-python-docs-text-embedding-ada-002  2023-06-27                           3476          0  https://huggingface.co/datasets/jamescalam/langchain-docs-23-06-27                                                                                                                                                                              {'name': 'text-embedding-ada-002', 'tokenizer': None, 'dimension': 1536}
quora_all-MiniLM-L6-bm25-100K                 2023-06-25 10:00:00.000000         100000      15000  https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs                                                                                      gs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25              similar questions  {'name': 'sentence-transformers/msmarco-MiniLM-L6-cos-v5', 'tokenizer': None, 'dimension': 384}  {'name': 'naver/splade-cocondenser-ensembledistil', 'tokenizer': None}

Type of Change

New feature (non-breaking change which adds functionality)

Test Plan

Unit tests updated to cover new functionality.

Rename --pinecone-datasets=help to --pinecone-datasets=list. Add a new value for --pinecone-datasets - 'list-details'. This lists all available datasets, including all information available about them. This is probably more detail than users would typically need, but it can be useful for development / debugging. Example output from a subset of the datasets: name created_at documents queries source bucket task dense_model sparse_model license description tags args -------------------------------------------- ---------------------------- ----------- --------- ---------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- ----------------- ----------------------------------------------------------------------------------------------- ---------------------------------------------------------------------- --------- ------------- ------ ------ ANN_DEEP1B_d96_angular 2023-03-10 14:17:01.481785 9990000 10000 https://github.com/erikbern/ann-benchmarks gs://pinecone-datasets-dev/ANN_DEEP1B_d96_angular ANN {'name': 'ANN benchmark dense model', 'tokenizer': None, 'dimension': 96} {'name': None, 'tokenizer': None} langchain-python-docs-text-embedding-ada-002 2023-06-27 3476 0 https://huggingface.co/datasets/jamescalam/langchain-docs-23-06-27 {'name': 'text-embedding-ada-002', 'tokenizer': None, 'dimension': 1536} quora_all-MiniLM-L6-bm25-100K 2023-06-25 10:00:00.000000 100000 15000 https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs gs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25 similar questions {'name': 'sentence-transformers/msmarco-MiniLM-L6-cos-v5', 'tokenizer': None, 'dimension': 384} {'name': 'naver/splade-cocondenser-ensembledistil', 'tokenizer': None}

daverigby force-pushed the datasets_list_details branch from b9e686c to a5aeb9f Compare February 14, 2024 11:53

daverigby marked this pull request as ready for review February 14, 2024 11:54

daverigby force-pushed the datasets_list_details branch from a5aeb9f to 555613d Compare February 15, 2024 09:09

daverigby merged commit b484f90 into main Feb 15, 2024
7 checks passed

daverigby deleted the datasets_list_details branch February 15, 2024 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to list details of available datasets, rename =help to =list #24

Add option to list details of available datasets, rename =help to =list #24

daverigby commented Feb 14, 2024

Add option to list details of available datasets, rename =help to =list #24

Add option to list details of available datasets, rename =help to =list #24

Conversation

daverigby commented Feb 14, 2024

Type of Change

Test Plan