Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_faiss command #601

Open
wants to merge 16 commits into
base: dev
Choose a base branch
from
126 changes: 126 additions & 0 deletions docs/analysis/build_faiss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
## Overview

The `build_faiss` command creates a trained Faiss index for performing nearest neighbor searches on a given set of vectors. This is intended to be used for graph embeddings, so we accept the output formats of the `graph_embeddings` command as input and set default parameter values to those suitable for a graph with ~50M nodes.

## Input format

There are three supported formats: glove, w2v, and kgtk.

### glove format
When using this format, the input is a .tsv file where each line is the embedding for a node. Each line is represented by a single node followed respectively by the components of its embedding, each in a different column, all separated by tabs. For example:
```
"work" -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ...
```

### w2v format (default)
When using this format, the input is a .tsv file which is almost the same as the glove format. The only difference is that the word2vec format has a first line which indicates the shape of the embeddings (e.g., "9 4" for 9 entities with 4 dimensions). Each column of the first line is separated by tabs. For example:
```
16213 100
"work" -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ...
"home" -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ...
```
Here 16231 represents the number of nodes, 100 represents the dimension of each node's embedding.

### kgtk format
When using this format, the input is a .tsv file where each line contains three columns describing the embedding for a node. In each line, the first column contains a node, the second contains a label (which we ignore), and the third contains the node's embedding as a comma-separated string of floats. The first row of the file is assumed to contain a header unless specified otherwise by the `--no_input_header` option. For example:
```
node1 label node2
Q5 graph_embeddings 014022544,-0.062030070,-0.012535412,0.038317516
Q6 graph_embeddings 014022544,-0.062030070,-0.012535412,0.038317516
```

## Output

This command saves the created Faiss index in a `.idx` file and optionally saves a second file that contains a mapping from each node's index in the Faiss index to itself.

## Usage

```bash
usage: kgtk build-faiss [-h] -i EMBEDDINGS_FILE -o INDEX_FILE_OUT [-id2n INDEX_TO_QNODE_FILE_OUT] [-ef kgtk|w2v|glove]
[--no_input_header [True|False]] [-te MAX_TRAIN_EXAMPLES] [-w WORKERS] [-is INDEX_STRING]
[-m Inner_product|L2|L1|Linf|Lp|Canberra|BrayCurtis|JensenShannon] [-ma METRIC_ARG]
[-v [optional True|False]]

Train and populate a faiss index that can compute nearest neighbors of given embeddings.

optional arguments:
-h, --help show this help message and exit
-i EMBEDDINGS_FILE, --input_file EMBEDDINGS_FILE, --embeddings_file EMBEDDINGS_FILE
Input file containing the embeddings for which a Faiss index will be created.
-o INDEX_FILE_OUT, --output_file INDEX_FILE_OUT, --index_file_out INDEX_FILE_OUT
Output .idx file where the index fill be saved.
-id2n INDEX_TO_NODE_FILE_OUT, --index_to_node_file_out INDEX_TO_NODE_FILE_OUT
Output Kgtk-format file containing index --> node.
-ef kgtk|w2v|glove, --embeddings_format kgtk|w2v|glove
Format of the input embeddings [Default: w2v] Choice: kgtk | w2v | glove
--no_input_header [True|False]
If your input embeddings file is in KGTK format, this allows you to specify if it has a header
line or not.
-te MAX_TRAIN_EXAMPLES, --max_train_examples MAX_TRAIN_EXAMPLES
The maximum number of embeddings that will be used to train the index.
-w WORKERS, --workers WORKERS
The number of worker processes for training. If not given, set to CPU count.
-is INDEX_STRING, --index_string INDEX_STRING
A string denoting the type of index to be used. This will be passed to faiss.index_factory()
-m Inner_product|L2|L1|Linf|Lp|Canberra|BrayCurtis|JensenShannon, --metric_type Inner_product|L2|L1|Linf|Lp|Canberra|BrayCurtis|JensenShannon
A string denoting the Faiss metric to be used. This will be passed to faiss.index_factory().
-ma METRIC_ARG, --metric_arg METRIC_ARG
If you choose Lp as your metric_type, this parameter should be used to specify the value of p
to use.

-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
```

## Index design considerations

There are several parameters that affect the index's performance in terms of accuracy, space, query time, and training time.

### Index_string

The `index_string` is passed to faiss.index_factory() and controls many options for how the index will be built. For further details on how to format this string, please refer to [the official Faiss documentation](https://github.com/facebookresearch/faiss/wiki/The-index-factory).

When deciding on an index to use, please refer to [this documentation](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index). One important decision to be made is the number of centroids to learn. This decision will impact training-time and query-time. A higher number of centroids will lead to longer training time and shorter query time. A heuristic to follow here is to choose a function of `sqrt(N)` where `N` is the number of nodes to index. As mentioned previously, the default parameter values have been chosen to handle an input size of ~50M nodes. The Faiss documentation linked above advises choosing 262,144 (2^18) as the number of centroids for this input size. However, we instead choose 8,192 (2^13) as this is ~1*sqrt(50M), and this significantly decreases training time while keeping query time reasonable. If query time is too slow for your use-case, increase the number of centroids accordingly.

### Max_train_examples

This parameter limits the number of vectors that will be used to train the index. Having more vectors to train on will make the index more accurate, however there are diminishing returns. Based on [this discussion](https://github.com/facebookresearch/faiss/issues/126), having k*1000 training examples where k is the number of centroids is safely sufficient. Therefore, we set a default value of 10M based on the default number of centroids chosen for the `index_string` parameter.

## Notes on memory usage

Faiss stores all vectors in memory, so running this command (as well as later loading the index it creates) will require that you can fit all vectors in memory. For reference, running this command with the default settings for ~50M vectors of dimension 100 creates a .idx file that is ~21GB. If you have stricter memory requirements, you can choose to compress your vectors (relevant Faiss documentation [here](https://github.com/facebookresearch/faiss/wiki/Lower-memory-footprint) which will have the tradeoff of causing the distance calculations to be estimates. Opting for vector compression can be done via the `index_string`.

Another option for reducing RAM requirements is to [shard the index](https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM). This command does not support index sharding at this time.


## Examples

### Example 1
Default settings for w2v-format embeddings
```
kgtk build_faiss -i w2v_embeddings.tsv -o index_file_out.idx
```

The index will be saved in `index_file_out.idx`


### Example 2
Running with more specific parameters (glove format input embeddings, saving the index-->node mapping to a file, 65,536 clusters for the index to learn, 25M vectors to train on, and using inner product as the distance metric):
```
kgtk build_faiss -i glove_embeddings.tsv -o index_file_out.idx
--embedding_format glove \
--index_to_node_file_out id2n_mapping.tsv \
--index_string IVF65536_HNSW32,Flat \
--max_train_examples 25000000 \
--metric_type Inner_product
```

The index will be saved in `index_file_out.idx` and the index-->node mapping file will be saved in `id2n_mapping.tsv`.
`id2n_mapping.tsv` may look like
```
node1 label node2
0 index_to_node Q30
1 index_to_node Q5
...
```

111 changes: 111 additions & 0 deletions kgtk/cli/build_faiss.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
"""
Train and populate a faiss index that can compute nearest neighbors of given embeddings.
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename the command and the file in cli to build-faiss-index`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from argparse import Namespace
from kgtk.cli_argparse import KGTKArgumentParser


def parser():
return {
'help': 'Train and populate a faiss index that can compute nearest neighbors of given embeddings.',
'description': 'Train and populate a faiss index that can compute nearest neighbors of given embeddings.'
}


def add_arguments_extended(parser: KGTKArgumentParser, parsed_shared_args: Namespace):
"""
Parse arguments
Args:
parser (kgtk.cli_argparse.KGTKArgumentParser)
"""

# import modules locally
from kgtk.io.kgtkreader import KgtkReader, KgtkReaderOptions, KgtkReaderMode
from kgtk.utils.argparsehelpers import optional_bool # not used yet
from kgtk.value.kgtkvalueoptions import KgtkValueOptions

_expert: bool = parsed_shared_args._expert

# REQUIRED #
# Related to input file
parser.add_argument('-i', '--input_file', '--embeddings_file', action='store', dest='embeddings_file',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a add_input_file function, please use it
https://github.com/usc-isi-i2/kgtk/blob/master/kgtk/cli/cat.py#L45

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

required=True, metavar="EMBEDDINGS_FILE",
help='Input file containing the embeddings for which a Faiss index will be created.')

# Related to output
parser.add_argument('-o', '--output_file', '--index_file_out', action='store', dest='index_file_out',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

required=True, help="Output .idx file where the index fill be saved.",
metavar="INDEX_FILE_OUT")

parser.add_argument('-id2n', '--index_to_node_file_out', action='store', dest='index_to_node_file_out',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the secondary options should only have -, instead of underscore.

--index_to_node_file_out should be --index-to-node-file-out.

Also as a general rule, do not use stop words like to in the parameter name, This review comment applies to all the subsequent parameters.

The dest parameter is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason to avoid stop words simply for brevity? I am struggling to think of a name for this that omits 'to' and isn't ambiguous. This parameter specifies the path where a kgtk file will be saved. The kgtk file contains a mapping of index to corresponding node. Do you have a suggestion for this parameter name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain a bit more as to what exactly this file contains? mapping of what index ?how is this different from the output file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you use a faiss index to search for nearest neighbors, it returns distances and corresponding indexes/IDs of the nearest neighbors, rather than the names or embeddings of the nearest neighbors. This file would allow you to look up the entity name that corresponds to the index/ID. Here's an example of what the file would look like:

Input embedding file:

q1 <embedding>
q2 <embedding>
q3 <embedding>

output index_to_node file:
node1 label node2
0 index_to_node q1
1 index_to_node q2
2 index_to_node q3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I could use the word ID rather than index here to avoid confusion. Then I could call this 'node_id_file_out' to avoid 'to'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, using the stop word to makes sense here. I would call this --faiss-id-to-node-mapping-file As you can, brevity is not the issue, meaningful name is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

required=False, help="Output Kgtk-format file containing index --> node.",
default=None, metavar="INDEX_TO_NODE_FILE_OUT")

# OPTIONAL #
# Related to input file
parser.add_argument('-ef', '--embeddings_format', action='store', type=str,
dest='embeddings_format', required=False,
help='Format of the input embeddings [Default: w2v] Choice: kgtk | w2v | glove',
default="w2v", choices=["kgtk", "w2v", "glove"], metavar="kgtk|w2v|glove")

parser.add_argument('--no_input_header', action='store', type=optional_bool, dest="no_input_header",
required=False, help='If your input embeddings file is in KGTK format, this ' +
'allows you to specify if it has a header line or not.',
const=True, nargs='?', default=False, metavar='True|False')

# Related to index building
parser.add_argument('-te', '--max_train_examples', action='store', type=int,
dest='max_train_examples', required=False, default=10000000,
help="The maximum number of embeddings that will be used to train the index.")

parser.add_argument('-w', '--workers', action='store', type=int, dest='workers', required=False,
help="The number of worker processes for training. If not given, set to CPU count.",
default=None)

parser.add_argument('-is', '--index_string', action='store', type=str, dest='index_string', required=False,
help="A string denoting the type of index to be used. This will be passed to " +
"faiss.index_factory()",
default="IVF8192_HNSW32,Flat")

parser.add_argument('-m', '--metric_type', action='store', type=str, dest='metric_type', required=False,
help="A string denoting the Faiss metric to be used. This will be passed to " +
"faiss.index_factory().", default="L2",
choices=["Inner_product", "L2", "L1", "Linf", "Lp", "Canberra", "BrayCurtis", "JensenShannon"],
metavar="Inner_product|L2|L1|Linf|Lp|Canberra|BrayCurtis|JensenShannon")

parser.add_argument('-ma', '--metric_arg', action='store', type=float, dest='metric_arg', required=False,
help="If you choose Lp as your metric_type, this parameter should be used to " +
"specify the value of p to use.",
default=None)

KgtkReader.add_debug_arguments(parser, expert=_expert)
KgtkReaderOptions.add_arguments(parser,
mode_options=True,
default_mode=KgtkReaderMode[parsed_shared_args._mode],
expert=_expert)
KgtkValueOptions.add_arguments(parser)


def run(**kwargs):
from kgtk.exceptions import KGTKException
from kgtk.graph_embeddings.build_faiss import build_faiss
try:
build_faiss(kwargs['embeddings_file'],
kwargs['embeddings_format'],
kwargs['no_input_header'],
kwargs['index_file_out'],
kwargs['index_to_node_file_out'],
kwargs['max_train_examples'],
kwargs['workers'],
kwargs['index_string'],
kwargs['metric_type'],
kwargs['metric_arg'],
kwargs['verbose'])

except SystemExit as e:
raise KGTKException("Exit requested")
except KGTKException as e:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise the caught exception

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

except Exception as e:
raise KGTKException(str(e))
Loading