KGTK Notebooks has a rich set of notebooks to demonstrate KGTK commands and use cases.
The KGTK notebooks often have similar setup cells at the start. For example, setting up the SQLite Cache DB, environment variables, a TEMP directory and so on. Configure KGTK Notebooks is written to make the notebook setup easy and consistent.
We will go through the steps to initialize this class and discuss the parameters and functions in detail.
Import the class ,
from kgtk.configure_kgtk_notebooks import ConfigureKGTK
Create the class object ,
ck = ConfigureKGTK(file_list: List[str],
kgtk_path: str = None,
input_files_url: str = None)
The class constructor has 3 parameters, let's discuss each of them in detail.
file_list
is a required parameter and is a list of file names. KGTK has standard names for the common KGTK Edge files, for example:
- alias: aliases.en.tsv.gz
- monolingualtext: claims.monolingualtext.tsv.gz
- label: labels.en.tsv.gz
- datatypes: metadata.property.datatypes.tsv.gz and so on...
There are a total of 93 file names
, the full list is here
file_list
is a list of files in the input path, which will be used in the current KGTK notebook.
kgtk_path
is the absolute path to the cloned kgtk
folder. This parameter is optional. It is required when the current KGTK notebook will call
another KGTK notebook as part of its execution.
If not specified, kgtk_path
will be set to the parent of current directory.
input_files_url
is an optional parameter to specify the URL to download the file_list
from, in case the input path is not specified.
If not specified, this parameter will be set to https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold
which contains the files for ISWC tutorial.
Call the function configure_kgtk
, signature:
ck.configure_kgtk(input_graph_path: str = None,
project_name: str = "kgtk",
output_path: str = None,
graph_cache_path: str = None,
json_config_file: str = None,
additional_files: dict = None,
debug=False)
The parameters are all optional with reasonable defaults. Details ,
input_graph_path
is an absolute path to a folder where the files specified by file_list should be present.
If not specified, the default value of this parameter will be: USER_HOME/isi-kgtk-tutorial
.
project_name
is the name of the folder where the output files will be created. Defaults to kgtk
.
output_path
is the absolute path where the output files will be created. If not specified, the default value will be set to
USER_HOME/isi-kgtk-tutorial/kgtk
graph_cache_path
point to the SQLite graph cache used in kypher
.
If not specified, the default value will be calculated as: {output_path}/{project_name}/temp.{project_name}/wikidata.sqlite3.db
.
For example, if output_path
= /data/amandeep/wikidata-20211027-dwd-v3 and
project_name
= useful-files, then
graph_cache_path
= /data/amandeep/wikidata-20211027-dwd-v3/temp.useful-files/wikidata.sqlite3.db
json_config_file
is absolute path to a json file with additional names
to file name mapping.
Suppose there are additional files you want to use that are
outside the official kgtk file list, you can then create a json file and specify the path
to json_config_file
.
For example, the file can look like:
{
'augmented_datatypes': 'augmented_datatypes.tsv.gz',
'augmented_wikianchor': 'wikitables.anchor.tsv.gz'
}
additional_files
is a dictionary of the same format as json_config_file
, it can also be used to specify additional files to be used while processing the
current notebook.
For example, the following dictionary can be supplied ,
{
'augmented_datatypes': 'augmented_datatypes.tsv.gz',
'augmented_wikianchor': 'wikitables.anchor.tsv.gz'
}
debug
adds the option --debug
to all the kgtk commands being used in the current notebook.
You can call the function print_env_variables
,
ck.print_env_variables()
to print out all the environment variables setup by ConfigureKGTK
class.
In addition to all the files you specify using the file_list
parameter, this class will setup a few additional environment variables.
For example, if you have the following files,
['claims', 'label_all', 'alias_all', 'description_all']
the output of ck.print_env_variables()
can look like ,
KGTK_OPTION_DEBUG: false
TEMP: /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files
EXAMPLES_DIR: /Users/amandeep/github/kgtk/examples
KGTK_LABEL_FILE: /data/amandeep/wikidata-20211027-dwd-v3/labels.en.tsv.gz
STORE: /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files/wikidata.sqlite3.db
kgtk: kgtk
KGTK_GRAPH_CACHE: /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files/wikidata.sqlite3.db
GRAPH: /data/amandeep/wikidata-20211027-dwd-v3
USE_CASES_DIR: /Users/amandeep/github/kgtk/use-cases
OUT: /data/amandeep/wikidata-20211027-dwd-v3/useful-files
kypher: kgtk query --graph-cache /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files/wikidata.sqlite3.db
claims: /data/amandeep/wikidata-20211027-dwd-v3/claims.tsv.gz
label_all: /data/amandeep/wikidata-20211027-dwd-v3/labels.tsv.gz
alias_all: /data/amandeep/wikidata-20211027-dwd-v3/aliases.tsv.gz
description_all: /data/amandeep/wikidata-20211027-dwd-v3/descriptions.tsv.gz
A few notable mentions ,
STORE
=KGTK_GRAPH_CACHE
: the path to kypher graph cache.GRAPH
: path where all input files should be present.OUT
: output path.TEMP
: a folder inside theOUT
path to keep temporary files.
Next step is to load the files in file_list
into graph cache.
ck.load_files_into_cache()
This function call will load all the files in file_list
plus any additional files specified by json_config_file
and additional_files
in to the cache.
The file names
will be used as aliases for the files, for example, continuing from the previous step,
kgtk query --graph-cache /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files/wikidata.sqlite3.db
-i "/data/amandeep/wikidata-20211027-dwd-v3/claims.tsv.gz" --as claims
-i "/data/amandeep/wikidata-20211027-dwd-v3/labels.tsv.gz" --as label_all
-i "/data/amandeep/wikidata-20211027-dwd-v3/aliases.tsv.gz" --as alias_all
-i "/data/amandeep/wikidata-20211027-dwd-v3/descriptions.tsv.gz" --as description_all --limit 3
id | node1 | label | node2 | rank | node2;wikidatatype |
---|---|---|---|---|---|
P10-P1628-32b85d-7927ece6-0 | P10 | P1628 | http://www.w3.org/2006/vcard/ns#Video | normal | url |
P10-P1628-acf60d-b8950832-0 | P10 | P1628 | https://schema.org/video | normal | url |
P10-P1629-Q34508-bcc39400-0 | P10 | P1629 | Q34508 | normal | wikibase-item |
Notice the files are loaded with an alias, "/data/amandeep/wikidata-20211027-dwd-v3/claims.tsv.gz" --as claims
.
At this step, the notebook setup is complete with set environment variables.
Call this function to print a list of all file nicknames and corresponding files or a subset of file nicknames.
ck.print_kgtk_file_names(file_names=None)
The variable file_names
can be a str
of a List[str]
, if its None
, the
function will print all known file nick names and corresponding KGTK file.
ck.print_kgtk_file_names(file_names=['isa', 'p279star', 'claims'])
isa: derived.isa.tsv.gz
p279star: derived.P279star.tsv.gz
claims: claims.tsv.gz
Note: This function will only print file names, the files are not required to be in the input folder.