usc-isi-i2 · nicklein · Dec 8, 2021 · Dec 18, 2021 · Jan 4, 2022 · Jan 11, 2022
diff --git a/docs/analysis/build_faiss.md b/docs/analysis/build_faiss.md
@@ -0,0 +1,126 @@
+## Overview
+
+The `build_faiss` command creates a trained Faiss index for performing nearest neighbor searches on a given set of vectors. This is intended to be used for graph embeddings, so we accept the output formats of the `graph_embeddings` command as input and set default parameter values to those suitable for a graph with ~50M nodes. 
+
+## Input format
+
+There are three  supported formats: glove, w2v, and kgtk.
+
+### glove format
+When using this format, the input is a .tsv file where each line is the embedding for a node. Each line is represented by a single node followed respectively by the components of its embedding, each in a different column, all separated by tabs. For example: 
+```
+"work"    -0.014022544    -0.062030070    -0.012535412    -0.023111001    -0.038317516 ...
+```
+
+### w2v format (default)
+When using this format, the input is a .tsv file which is almost the same as the glove format. The only difference is that the word2vec format has a first line which indicates the shape of the embeddings (e.g., "9 4" for 9 entities with 4 dimensions). Each column of the first line is separated by tabs. For example:
+```
+16213    100 
+"work"    -0.014022544    -0.062030070    -0.012535412    -0.023111001    -0.038317516 ...
+"home"    -0.014021411    -0.090830070    -0.012534120    -0.073111301    -0.068317516 ...
+```
+Here 16231 represents the number of nodes, 100 represents the dimension of each node's embedding.
+
+### kgtk format
+When using this format, the input is a .tsv file where each line contains three columns describing the embedding for a node. In each line, the first column contains a node, the second contains a label (which we ignore), and the third contains the node's embedding as a comma-separated string of floats. The first row of the file is assumed to contain a header unless specified otherwise by the `--no_input_header` option. For example: 
+```
+node1   label   node2
+Q5    graph_embeddings   014022544,-0.062030070,-0.012535412,0.038317516 
+Q6    graph_embeddings   014022544,-0.062030070,-0.012535412,0.038317516 
+```
+
+## Output
+
+This command saves the created Faiss index in a `.idx` file and optionally saves a second file that contains a mapping from each node's index in the Faiss index to itself. 
+
+## Usage
+
+```bash
+usage: kgtk build-faiss [-h] -i EMBEDDINGS_FILE -o INDEX_FILE_OUT [-id2n INDEX_TO_QNODE_FILE_OUT] [-ef kgtk|w2v|glove]
+                        [--no_input_header [True|False]] [-te MAX_TRAIN_EXAMPLES] [-w WORKERS] [-is INDEX_STRING]
+                        [-m Inner_product|L2|L1|Linf|Lp|Canberra|BrayCurtis|JensenShannon] [-ma METRIC_ARG]
+                        [-v [optional True|False]]
+
+Train and populate a faiss index that can compute nearest neighbors of given embeddings.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -i EMBEDDINGS_FILE, --input_file EMBEDDINGS_FILE, --embeddings_file EMBEDDINGS_FILE
+                        Input file containing the embeddings for which a Faiss index will be created.
+  -o INDEX_FILE_OUT, --output_file INDEX_FILE_OUT, --index_file_out INDEX_FILE_OUT
+                        Output .idx file where the index fill be saved.
+  -id2n INDEX_TO_NODE_FILE_OUT, --index_to_node_file_out INDEX_TO_NODE_FILE_OUT
+                        Output Kgtk-format file containing index --> node.
+  -ef kgtk|w2v|glove, --embeddings_format kgtk|w2v|glove
+                        Format of the input embeddings [Default: w2v] Choice: kgtk | w2v | glove
+  --no_input_header [True|False]
+                        If your input embeddings file is in KGTK format, this allows you to specify if it has a header
+                        line or not.
+  -te MAX_TRAIN_EXAMPLES, --max_train_examples MAX_TRAIN_EXAMPLES
+                        The maximum number of embeddings that will be used to train the index.
+  -w WORKERS, --workers WORKERS
+                        The number of worker processes for training. If not given, set to CPU count.
+  -is INDEX_STRING, --index_string INDEX_STRING
+                        A string denoting the type of index to be used. This will be passed to faiss.index_factory()
+  -m Inner_product|L2|L1|Linf|Lp|Canberra|BrayCurtis|JensenShannon, --metric_type Inner_product|L2|L1|Linf|Lp|Canberra|BrayCurtis|JensenShannon
+                        A string denoting the Faiss metric to be used. This will be passed to faiss.index_factory().
+  -ma METRIC_ARG, --metric_arg METRIC_ARG
+                        If you choose Lp as your metric_type, this parameter should be used to specify the value of p
+                        to use.
+
+  -v [optional True|False], --verbose [optional True|False]
+                        Print additional progress messages (default=False).
+```
+
+## Index design considerations
+
+There are several parameters that affect the index's performance in terms of accuracy, space, query time, and training time.
+
+### Index_string
+
+The `index_string` is passed to faiss.index_factory() and controls many options for how the index will be built. For further details on how to format this string, please refer to [the official Faiss documentation](https://github.com/facebookresearch/faiss/wiki/The-index-factory).
+
+When deciding on an index to use, please refer to [this documentation](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index). One important decision to be made is the number of centroids to learn. This decision will impact training-time and query-time. A higher number of centroids will lead to longer training time and shorter query time. A heuristic to follow here is to choose a function of `sqrt(N)` where `N` is the number of nodes to index. As mentioned previously, the default parameter values have been chosen to handle an input size of ~50M nodes. The Faiss documentation linked above advises choosing 262,144 (2^18) as the number of centroids for this input size. However, we instead choose 8,192 (2^13) as this is ~1*sqrt(50M), and this significantly decreases training time while keeping query time reasonable. If query time is too slow for your use-case, increase the number of centroids accordingly.
+
+### Max_train_examples
+
+This parameter limits the number of vectors that will be used to train the index. Having more vectors to train on will make the index more accurate, however there are diminishing returns. Based on [this discussion](https://github.com/facebookresearch/faiss/issues/126), having k*1000 training examples where k is the number of centroids is safely sufficient. Therefore, we set a default value of 10M based on the default number of centroids chosen for the `index_string` parameter.
+
+## Notes on memory usage
+
+Faiss stores all vectors in memory, so running this command (as well as later loading the index it creates) will require that you can fit all vectors in memory. For reference, running this command with the default settings for ~50M vectors of dimension 100 creates a .idx file that is ~21GB. If you have stricter memory requirements, you can choose to compress your vectors (relevant Faiss documentation [here](https://github.com/facebookresearch/faiss/wiki/Lower-memory-footprint) which will have the tradeoff of causing the distance calculations to be estimates. Opting for vector compression can be done via the `index_string`.
+
+Another option for reducing RAM requirements is to [shard the index](https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM). This command does not support index sharding at this time.
+
+
+## Examples
+
+### Example 1
+Default settings for w2v-format embeddings
+```
+kgtk build_faiss -i w2v_embeddings.tsv  -o index_file_out.idx
+```
+
+The index will be saved in `index_file_out.idx`
+
+
+### Example 2
+Running with more specific parameters (glove format input embeddings, saving the index-->node mapping to a file, 65,536 clusters for the index to learn, 25M vectors to train on, and using inner product as the distance metric):
+```
+kgtk build_faiss -i glove_embeddings.tsv -o index_file_out.idx 
+    --embedding_format glove \
+    --index_to_node_file_out id2n_mapping.tsv \
+    --index_string IVF65536_HNSW32,Flat \
+    --max_train_examples 25000000 \
+    --metric_type Inner_product
+```
+
+The index will be saved in `index_file_out.idx` and the index-->node mapping file will be saved in `id2n_mapping.tsv`.
+`id2n_mapping.tsv` may look like
+```
+node1   label   node2
+0   index_to_node   Q30
+1   index_to_node   Q5
+...
+```
+
diff --git a/kgtk/cli/build_faiss.py b/kgtk/cli/build_faiss.py
@@ -0,0 +1,111 @@
+"""
+Train and populate a faiss index that can compute nearest neighbors of given embeddings.
+"""
+
+from argparse import Namespace
+from kgtk.cli_argparse import KGTKArgumentParser
+
+
+def parser():
+    return {
+        'help': 'Train and populate a faiss index that can compute nearest neighbors of given embeddings.',
+        'description': 'Train and populate a faiss index that can compute nearest neighbors of given embeddings.'
+    }
+
+
+def add_arguments_extended(parser: KGTKArgumentParser, parsed_shared_args: Namespace):
+    """
+    Parse arguments
+    Args:
+        parser (kgtk.cli_argparse.KGTKArgumentParser)
+    """
+
+    # import modules locally
+    from kgtk.io.kgtkreader import KgtkReader, KgtkReaderOptions, KgtkReaderMode
+    from kgtk.utils.argparsehelpers import optional_bool    # not used yet
+    from kgtk.value.kgtkvalueoptions import KgtkValueOptions
+
+    _expert: bool = parsed_shared_args._expert
+
+    # REQUIRED #
+    # Related to input file
+    parser.add_argument('-i', '--input_file', '--embeddings_file', action='store', dest='embeddings_file',
+                        required=True, metavar="EMBEDDINGS_FILE",
+                        help='Input file containing the embeddings for which a Faiss index will be created.')
+
+    # Related to output
+    parser.add_argument('-o', '--output_file', '--index_file_out', action='store', dest='index_file_out',
+                        required=True, help="Output .idx file where the index fill be saved.",
+                        metavar="INDEX_FILE_OUT")
+
+    parser.add_argument('-id2n', '--index_to_node_file_out', action='store', dest='index_to_node_file_out',
+                        required=False, help="Output Kgtk-format file containing index --> node.",
+                        default=None, metavar="INDEX_TO_NODE_FILE_OUT")
+
+    # OPTIONAL #
+    # Related to input file
+    parser.add_argument('-ef', '--embeddings_format', action='store', type=str,
+                        dest='embeddings_format', required=False,
+                        help='Format of the input embeddings [Default: w2v] Choice: kgtk | w2v | glove',
+                        default="w2v", choices=["kgtk", "w2v", "glove"], metavar="kgtk|w2v|glove")
+
+    parser.add_argument('--no_input_header', action='store', type=optional_bool, dest="no_input_header",
+                        required=False, help='If your input embeddings file is in KGTK format, this ' +
+                        'allows you to specify if it has a header line or not.',
+                        const=True, nargs='?', default=False, metavar='True|False')
+
+    # Related to index building
+    parser.add_argument('-te', '--max_train_examples', action='store', type=int,
+                        dest='max_train_examples', required=False, default=10000000,
+                        help="The maximum number of embeddings that will be used to train the index.")
+
+    parser.add_argument('-w', '--workers', action='store', type=int, dest='workers', required=False,
+                        help="The number of worker processes for training. If not given, set to CPU count.",
+                        default=None)
+
+    parser.add_argument('-is', '--index_string', action='store', type=str, dest='index_string', required=False,
+                        help="A string denoting the type of index to be used. This will be passed to " +
+                        "faiss.index_factory()",
+                        default="IVF8192_HNSW32,Flat")
+
+    parser.add_argument('-m', '--metric_type', action='store', type=str, dest='metric_type', required=False,
+                        help="A string denoting the Faiss metric to be used. This will be passed to " +
+                        "faiss.index_factory().", default="L2",
+                        choices=["Inner_product", "L2", "L1", "Linf", "Lp", "Canberra", "BrayCurtis", "JensenShannon"],
+                        metavar="Inner_product|L2|L1|Linf|Lp|Canberra|BrayCurtis|JensenShannon")
+
+    parser.add_argument('-ma', '--metric_arg', action='store', type=float, dest='metric_arg', required=False,
+                        help="If you choose Lp as your metric_type, this parameter should be used to " +
+                        "specify the value of p to use.",
+                        default=None)
+
+    KgtkReader.add_debug_arguments(parser, expert=_expert)
+    KgtkReaderOptions.add_arguments(parser,
+                                    mode_options=True,
+                                    default_mode=KgtkReaderMode[parsed_shared_args._mode],
+                                    expert=_expert)
+    KgtkValueOptions.add_arguments(parser)
+
+
+def run(**kwargs):
+    from kgtk.exceptions import KGTKException
+    from kgtk.graph_embeddings.build_faiss import build_faiss
+    try:
+        build_faiss(kwargs['embeddings_file'],
+                    kwargs['embeddings_format'],
+                    kwargs['no_input_header'],
+                    kwargs['index_file_out'],
+                    kwargs['index_to_node_file_out'],
+                    kwargs['max_train_examples'],
+                    kwargs['workers'],
+                    kwargs['index_string'],
+                    kwargs['metric_type'],
+                    kwargs['metric_arg'],
+                    kwargs['verbose'])
+
+    except SystemExit as e:
+        raise KGTKException("Exit requested")
+    except KGTKException as e:
+        raise
+    except Exception as e:
+        raise KGTKException(str(e))