Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to configure layers independently #168

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Mar 3, 2025

✨ Description

Fixes: #154, #155.

This PR proposes a simple way to obtain layer-dependent configuration by leveraging Fast-LLM's existing config update mechanism. It works by providing a "default" layer configuration (same as before), and optional overrides for specified layer ranges.
See tests/test_transformer.py for examples.

The thing works, but is admittedly far from perfect and I do have some concern on user-friendliness:

  • The update syntax supports both single key and full dict override, ex. "normalization/epsilon": 1 overrides only normalization epsilon, while "normalization" : {"epsilon": 1} overrides the entire dict, i.e., everything other than epsilon reverts to its default value. This could be confusing and needs to be well documented.
  • If a layer index is covered by multiple update ranges, only the first update is applied. This could be confusing to some users. (Another option would be to apply them all in order, not sure which one is better)
  • Simple cases should be easy to understand, but this feature is extremely powerful and enables users to do all kinds of crazy things that may be confusing or lead to unexpected behaviour.
  • I moved the transformer config from transformer to layers/default, which adds a small amount of complexity when not using the feature. (We could probably revert that change though.)
  • For parameters (num_layers, hidden_size, full_precision_residual) overriding doesn't really make sense to override. I left them as-is and added assertions, but we may want to think about moving them away from the layer config.
  • I disabled layer-dependent rotary embeddings until we add support for it in the rotary preprocessor.
  • TensorSpace wasn't designed for that kind of thing. I made a quick fix using a hierarchy of tensor spaces, but not sure about long-term viability.
  • The feature makes conversion more complicated, I had to explicitly prevent conversion for any kind of layer-dependent configuration. We'll need to address this to use the feature in practice.

This feature removes the need for max_window_layers, but I kept it for now because of the likely conversion issues. @bigximik I also added back support for backup windowed attention and fixed the layer range by shifting the layer index, see comments in #157)

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier changed the title [Prototype] Option to configure layers independently Option to configure layers independently Mar 5, 2025
@jlamypoirier jlamypoirier marked this pull request as ready for review March 6, 2025 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configuration override mechanism
1 participant