Proposed implementation of (IBM) Granite 3.1 Patch for Transformers #17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a proposed patch for IBM granite https://huggingface.co/ibm-granite/granite-3.1-8b-instruct which follows very closely with the llama patch, it's largely the same except for a small alteration where Granite models use logit scaling https://huggingface.co/ibm-granite/granite-3.1-8b-instruct/blob/main/config.json#L15
There's a hiccup at least when using TRL to train, where in cce_backward.py there's a mandatory change on line 257/258 to ensure the dtypes are float16
This was too hacky for my tastes, so I've omitted it from the patch and am not sure how to currently handle this as trl seems to want to use float32 for classifier and embedding weights and I couldn't find a way to fix it on the trl end. I'd appreciate your takes on a proper way to set this up.
For reference, if someone would like to use this patch now, the required alteration is simply to change cce_backward_kernel in cce_backward.py pending a better solution
I was able to get a good loss curve and proper training, it appears that this patch is working barring a few questions I have.