This is the repository for the text2GQL generator implementation. Awesome-Text2GQL aims to generate cyphers/gqls and corresponding prompts as training corpus for fine-tuning of large language models (LLMs). Based on TuGraph-DB, the training corpus helps to train the Text2GQL and Text2Cypher models that are suitable for TuGraph-DB query engine capabilities.
For Linux, it is recommended to use miniconda to manage your python environment while other tools may also work.
conda create --name text2gql python=3.10
conda activate text2gql
git clone https://github.com/TuGraph-family/Awesome-Text2GQL
cd Awesome-Text2GQL
mkdir output
Install related python dependency packages
pip install .
To run generating qusetions and generalization functions based on LLMs,apply API-KEY before you run the whole flow.
- Apply API-KEY
We build the corpus generalization module based on the Qwen Inference Service served by Aliyvn, you can refer to Aliyvn to apply the API-KEY.
- Set API-KEY via environment variables (recommended)
# replace YOUR_DASHSCOPE_API_KEY with your API-KEY
echo "export DASHSCOPE_API_KEY='YOUR_DASHSCOPE_API_KEY'" >> ~/.bashrc
source ~/.bashrc
echo $DASHSCOPE_API_KEY
Install related python dependency packages
pip install .
To run generating qusetions and generalization functions based on LLMs,apply API-KEY before you run the whole flow.
- Apply API-KEY
We build the corpus generalization module based on the Qwen Inference Service served by Aliyvn, you can refer to Aliyvn to apply the API-KEY.
- Set API-KEY via environment variables (recommended)
# replace YOUR_DASHSCOPE_API_KEY with your API-KEY
echo "export DASHSCOPE_API_KEY='YOUR_DASHSCOPE_API_KEY'" >> ~/.bashrc
source ~/.bashrc
echo $DASHSCOPE_API_KEY
Make sure you have done the preparations above. To experience the whole flow recommended, you can run as below:
sh ./scripts/run_the_whole_flow.sh
The following steps will be execuated in sequence:
- generate cyphers by generation module based on Antlr4 with templates as input.
- generate questions by generalization module based on LLMs with the cyphers generated in the last step as input.
- generalize the questions generated in the last step by generalization module based on LLMs.
- transform the corpus generated above into model training format.
sh ./scripts/gen_query.sh
The corpus generation module can be run in two modes, that is generating querys by instantiator and generating questions by translator.
Set GEN_QUERY=true
to generate querys according to templates in batch.
- generate questions based on LLMs with template as additional input(recommened)
sh ./scripts/gen_question_with_template_llm.sh
- generate questions based on LLMs without template as input. It helps to generate questions which don't have corresponding template initially.
sh ./scripts/gen_question_directly_llm.sh
- generate questions based on Antlr4(deprecated)
Set GEN_QUERY=false
to generate questions using translator of the generation module based on Antlr4.
sh ./scripts/gen_question.sh
- generalize corpus with query and question as input(recommened)
sh ./scripts/generalize_llm.sh
- generalize question without querys as input(deprecated)
sh ./scripts/general_questions_directly_llm.sh
transform the corpus generated above into model training format.
sh ./scripts/generate_dataset.sh
This project is still under development, suggestions or issues are welcome.