Skip to content
/ STAIR Public

Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"

License

Notifications You must be signed in to change notification settings

thu-ml/STAIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

STAIR: Improving Safety Alignment with Introspective Reasoning

arXiv GitHub stars

🚧 Codebase, datasets & models are coming soon!
⭐ Star to stay updated on our release progress!

Official implementation of STAIR, the framework presented in our paper "Improving Safety Alignment with Introspective Reasoning". STAIR enhances LLM safety with the incorporation step-by-step analysis of potential risks, providing more robust alignment while better maintaining model capabilities.

TODO List

  • Official implementation of STAIR, including Safety-Informed MCTS (SI-MCTS), test-time scaling, etc.
  • 🤗 SFT dataset of structured CoT format alignment
  • 🤗 Model weights of LLMs (Llama-3.1-8B-Instruct, Qwen2-7B-Instruct) aligned with STAIR

About

Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published