GLAN (Generalized Instruction Tuning)

Overview

GLAN uses a systematic taxonomy of human knowledge and abilities to generate large-scale synthetic instruction data across a variety of fields. This is a method of generating data from scratch, without relying on seed examples or existing datasets.

GLAN mimics the systematic structure of human learning systems, generating a wide range of instruction data covering a variety of disciplines and technologies. It is not limited to a specific domain and can encompass a variety of tasks across all fields. This method shows excellent performance in many aspects such as mathematical reasoning, coding, academic testing, and logical reasoning, and is effective even without training data for a specific task.

Implementation

This open-source implementation is based on the contents of the paper - https://arxiv.org/pdf/2402.13064.

Main function

glan_instruction_generation(): GLAN pipeline

Sub functions

generate_taxonomy(): Generate a taxonomy of human knowledge and capabilities. Disciplines derived from taxonomy are used to create subjects. You can have GPT automatically create disciplines (in this case, human verification is required), or you can use the disciplines (disciplines.txt) we created.
generate_subjects(): Generate a list of subjects for a given discipline. Please refer to section 2.2 of the paper.
generate_syllabus(): Generate a syllabus for a given subject at a specific level. Please refer to section 2.3 of the paper.
sample_class_sessions_and_key_concepts(): Sample class sessions and key concepts to generate questions of varying difficulty.
generate_questions(): Generate questions based on class sessions and key concepts using LangChain pipeline. Please refer to section 2.4 of the paper.
generate_answers(): Generate answers to the questions using LangChain pipeline. Please refer to section 2.4 of the paper.

How to create dataset

Example datasets are placed in this folder. Please try the minimal example first and configure your dataset by referring to the tunable parameters.

Example

Debug for test - You need to change parameters

chmod +x run_debug.sh
./run_debug.sh

Generate large amounts of data for each discipline (Recommended)

chmod +x run_each_discipline.sh
./run_each_discipline.sh

Generate large amounts of data

chmod +x run.sh
./run.sh

Tunable parameters

QnA generation

See generate.py for details.

parser.add_argument("--generate_disciplines", type=bool, default=False)
parser.add_argument("--generate_question_only", type=bool, default=False)

parser.add_argument("--disciplines_filepath", type=str, default="disciplines_sample.txt")
parser.add_argument("--language", type=str, default="Korean")
parser.add_argument("--model_name", type=str, default="gpt-4o")
parser.add_argument("--model_name_for_answer", type=str, default="gpt-4o")

parser.add_argument("--max_number_of_fields", type=int, default=1)
parser.add_argument("--max_number_of_subjects", type=int, default=2)
parser.add_argument("--max_number_of_subtopics", type=int, default=5)
parser.add_argument("--max_number_of_session_name", type=int, default=3)

parser.add_argument("--num_iterations", type=int, default=2)
parser.add_argument("--num_questions_per_iteration", type=int, default=5)

parser.add_argument("--question_max_tokens", type=int, default=768)
parser.add_argument("--question_batch_size", type=int, default=5)
parser.add_argument("--answer_max_tokens", type=int, default=2048)
parser.add_argument("--answer_batch_size", type=int, default=5)

parser.add_argument("--output_dir", type=str, default="outputs")
parser.add_argument("--logfile_name", type=str, default="logfile.log")

Answer generation

See generate_answer_only.py for details.

parser.add_argument("--questions_filepath", type=str, default="[YOUR JSONL]")
parser.add_argument("--model_name_for_answer", type=str, default="gpt-4o")
parser.add_argument("--answer_max_tokens", type=int, default=2048)
parser.add_argument("--answer_batch_size", type=int, default=5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GLAN (Generalized Instruction Tuning)

Overview

Implementation

Main function

Sub functions

How to create dataset

Example

Tunable parameters

QnA generation

Answer generation

Files

README.md

Latest commit

History

README.md

File metadata and controls

GLAN (Generalized Instruction Tuning)

Overview

Implementation

Main function

Sub functions

How to create dataset

Example

Tunable parameters

QnA generation

Answer generation