-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce modular files for speech models #35902
base: main
Are you sure you want to change the base?
Introduce modular files for speech models #35902
Conversation
""" | ||
for assignment, node in assignments.items(): | ||
should_keep = any(re.search(pattern, assignment) for pattern in ASSIGNMENTS_REGEX_TO_KEEP) | ||
|
||
# If it's a DOCSTRING var and is assigned to None, the parent's docstring is kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add this because for many of the models I've used, their docstring was kinda custom (e.g. contained link to original paper). So instead of just copying the docstring from modular file, I figured it would be best to adopt this hybrid approach.
If you agree with the change, I should also update the modular docs: https://github.com/huggingface/transformers/blob/main/docs/source/en/modular_transformers.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, I don't really get here. This is already the actual behavior to have the docstring use the parent if it's None
new_node = node.with_changes(body=node.body.with_changes(body=new_statements)) | ||
imports_to_keep.append(new_node) | ||
existing_protected_statements.update({str(stmt) for stmt in new_statements}) | ||
import_statements = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this beacuse the code before had problematic behaviour for "safe" imports that had multiple other statements inside them, e.g. L381:395 on modeling_wav2vec2.py
if is_deepspeed_zero3_enabled():
import deepspeed
with deepspeed.zero.GatheredParameters(self.conv.weight, modifier_rank=0):
...
The whole block after the import statement would be displaced in the top of the new modeling script (in the import statements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's one of the current limitations. However, removing everything else does not seem like a good solution either. Could not wrap my mind around a nice rule for this. For now, the best is maybe to patch the original modeling file to dissociate safe import and other logic? Would that require a lot of change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks for the contribution! I just looked at the modular part, let me know if something is unclear!! 🤗
# Exclude names to prevent edge cases where we want to keep a name that may | ||
# exist in the mapping, e.g. `Wav2Vec2BaseModelOutput` where `Wav2Vec2` is | ||
# a "base" model identifier but we want the type to pass as is in the produced modeling file | ||
EXCLUDE_NAMES = ["Wav2Vec2BaseModelOutput"] | ||
|
||
|
||
def preserve_case_replace(text, patterns: dict, default_name: str): | ||
# Create a regex pattern to match all variations | ||
regex_pattern = "|".join(re.escape(key) for key in patterns.keys()) | ||
compiled_regex = re.compile(f"(?<![a-z0-9])({regex_pattern})(.|$)", re.IGNORECASE | re.DOTALL) | ||
|
||
# Create exclude pattern | ||
exclude_pattern = "|".join(re.escape(key) for key in EXCLUDE_NAMES) | ||
compiled_regex = re.compile(f"(?<![a-z0-9])(?!{exclude_pattern})({regex_pattern})(.|$)", re.IGNORECASE | re.DOTALL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not a fan of having exclusions here. And the regex is already way too complicated 🥲 Moreover, I don't think we actually want an output type from another model, do we?
""" | ||
for assignment, node in assignments.items(): | ||
should_keep = any(re.search(pattern, assignment) for pattern in ASSIGNMENTS_REGEX_TO_KEEP) | ||
|
||
# If it's a DOCSTRING var and is assigned to None, the parent's docstring is kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, I don't really get here. This is already the actual behavior to have the docstring use the parent if it's None
|
||
# Keep return annotation in `modular_xxx.py` if any, else original return annotation | ||
new_return_annotation = updated_methods[name].returns if updated_methods[name].returns else func.returns | ||
|
||
if not re.match( | ||
r"\ndef .*\(.*\):\n raise.*Error\(.*", | ||
mapper.python_module.code_for_node(updated_methods[name]), | ||
): | ||
func = func.with_changes(body=updated_methods[name].body, params=new_params, decorators=new_decorators) | ||
func = func.with_changes( | ||
body=updated_methods[name].body, | ||
params=new_params, | ||
decorators=new_decorators, | ||
returns=new_return_annotation, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this one! Nice!
new_node = node.with_changes(body=node.body.with_changes(body=new_statements)) | ||
imports_to_keep.append(new_node) | ||
existing_protected_statements.update({str(stmt) for stmt in new_statements}) | ||
import_statements = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's one of the current limitations. However, removing everything else does not seem like a good solution either. Could not wrap my mind around a nice rule for this. For now, the best is maybe to patch the original modeling file to dissociate safe import and other logic? Would that require a lot of change?
What does this PR do?
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @Cyrilvallez
Additional details
modeling_wav2vec2.py
: Hubert, WavLM, Data2VecAudio, Wav2Vec2Conformer, Wav2Vec2Bert, UniSpeech, UniSpeechSat