Option to configure layers independently #168
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
✨ Description
Fixes: #154, #155.
This PR proposes a simple way to obtain layer-dependent configuration by leveraging Fast-LLM's existing config update mechanism. It works by providing a "default" layer configuration (same as before), and optional overrides for specified layer ranges.
See
tests/test_transformer.py
for examples.The thing works, but is admittedly far from perfect and I do have some concern on user-friendliness:
"normalization/epsilon": 1
overrides only normalization epsilon, while"normalization" : {"epsilon": 1}
overrides the entire dict, i.e., everything other thanepsilon
reverts to its default value. This could be confusing and needs to be well documented.transformer
tolayers/default
, which adds a small amount of complexity when not using the feature. (We could probably revert that change though.)num_layers
,hidden_size
,full_precision_residual
) overriding doesn't really make sense to override. I left them as-is and added assertions, but we may want to think about moving them away from the layer config.TensorSpace
wasn't designed for that kind of thing. I made a quick fix using a hierarchy of tensor spaces, but not sure about long-term viability.This feature removes the need for
max_window_layers
, but I kept it for now because of the likely conversion issues. @bigximik I also added back support for backup windowed attention and fixed the layer range by shifting the layer index, see comments in #157)🔍 Type of change
Select all that apply: