Introduce modular files for speech models #35902

nikosanto13 · 2025-01-27T08:35:10Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @Cyrilvallez

Additional details

Added modular files for models that have heavy duplication with classes from modeling_wav2vec2.py: Hubert, WavLM, Data2VecAudio, Wav2Vec2Conformer, Wav2Vec2Bert, UniSpeech, UniSpeechSat
Added some modifications on the modular converter script, from issues that came up during writing the above modular scripts (see inline comments for justification)

nikosanto13 · 2025-01-27T08:38:17Z

utils/modular_model_converter.py

        """
        for assignment, node in assignments.items():
            should_keep = any(re.search(pattern, assignment) for pattern in ASSIGNMENTS_REGEX_TO_KEEP)

+            # If it's a DOCSTRING var and is assigned to None, the parent's docstring is kept.


I had to add this because for many of the models I've used, their docstring was kinda custom (e.g. contained link to original paper). So instead of just copying the docstring from modular file, I figured it would be best to adopt this hybrid approach.

If you agree with the change, I should also update the modular docs: https://github.com/huggingface/transformers/blob/main/docs/source/en/modular_transformers.md

Humm, I don't really get here. This is already the actual behavior to have the docstring use the parent if it's None

nikosanto13 · 2025-01-27T08:42:08Z

utils/modular_model_converter.py

-                new_node = node.with_changes(body=node.body.with_changes(body=new_statements))
-                imports_to_keep.append(new_node)
-                existing_protected_statements.update({str(stmt) for stmt in new_statements})
+            import_statements = [


I added this beacuse the code before had problematic behaviour for "safe" imports that had multiple other statements inside them, e.g. L381:395 on modeling_wav2vec2.py

if is_deepspeed_zero3_enabled(): import deepspeed with deepspeed.zero.GatheredParameters(self.conv.weight, modifier_rank=0): ...

The whole block after the import statement would be displaced in the top of the new modeling script (in the import statements).

Yes it's one of the current limitations. However, removing everything else does not seem like a good solution either. Could not wrap my mind around a nice rule for this. For now, the best is maybe to patch the original modeling file to dissociate safe import and other logic? Would that require a lot of change?

Rocketknight1 · 2025-01-27T17:31:07Z

cc @ArthurZucker @qubvel

Cyrilvallez

Hey! Thanks for the contribution! I just looked at the modular part, let me know if something is unclear!! 🤗

Cyrilvallez · 2025-01-28T17:38:33Z

utils/modular_model_converter.py

+# Exclude names to prevent edge cases where we want to keep a name that may
+# exist in the mapping, e.g. `Wav2Vec2BaseModelOutput` where `Wav2Vec2` is
+# a "base" model identifier but we want the type to pass as is in the produced modeling file
+EXCLUDE_NAMES = ["Wav2Vec2BaseModelOutput"]
+
+
 def preserve_case_replace(text, patterns: dict, default_name: str):
    # Create a regex pattern to match all variations
    regex_pattern = "|".join(re.escape(key) for key in patterns.keys())
-    compiled_regex = re.compile(f"(?<![a-z0-9])({regex_pattern})(.|$)", re.IGNORECASE | re.DOTALL)
+
+    # Create exclude pattern
+    exclude_pattern = "|".join(re.escape(key) for key in EXCLUDE_NAMES)
+    compiled_regex = re.compile(f"(?<![a-z0-9])(?!{exclude_pattern})({regex_pattern})(.|$)", re.IGNORECASE | re.DOTALL)


Definitely not a fan of having exclusions here. And the regex is already way too complicated 🥲 Moreover, I don't think we actually want an output type from another model, do we?

Cyrilvallez · 2025-01-28T17:41:44Z

utils/modular_model_converter.py

        """
        for assignment, node in assignments.items():
            should_keep = any(re.search(pattern, assignment) for pattern in ASSIGNMENTS_REGEX_TO_KEEP)

+            # If it's a DOCSTRING var and is assigned to None, the parent's docstring is kept.


Humm, I don't really get here. This is already the actual behavior to have the docstring use the parent if it's None

Cyrilvallez · 2025-01-28T17:44:29Z

utils/modular_model_converter.py

+
+            # Keep return annotation in `modular_xxx.py` if any, else original return annotation
+            new_return_annotation = updated_methods[name].returns if updated_methods[name].returns else func.returns
+
            if not re.match(
                r"\ndef .*\(.*\):\n    raise.*Error\(.*",
                mapper.python_module.code_for_node(updated_methods[name]),
            ):
-                func = func.with_changes(body=updated_methods[name].body, params=new_params, decorators=new_decorators)
+                func = func.with_changes(
+                    body=updated_methods[name].body,
+                    params=new_params,
+                    decorators=new_decorators,
+                    returns=new_return_annotation,
+                )


Love this one! Nice!

Cyrilvallez · 2025-01-28T17:55:21Z

utils/modular_model_converter.py

-                new_node = node.with_changes(body=node.body.with_changes(body=new_statements))
-                imports_to_keep.append(new_node)
-                existing_protected_statements.update({str(stmt) for stmt in new_statements})
+            import_statements = [


Yes it's one of the current limitations. However, removing everything else does not seem like a good solution either. Could not wrap my mind around a nice rule for this. For now, the best is maybe to patch the original modeling file to dissociate safe import and other logic? Would that require a lot of change?

nikosanto13 added 4 commits January 27, 2025 08:26

WAV_2_VEC_2 to WAV2VEC2

c194873

added modular files for hubert, wavlm, wav2vec2_bert, data2vec_audio

6386f29

remove unnessary definitions in modulars

b9e0b4d

added modular files for UniSpeech, UniSpeechSat, Wav2Vec2Conformer

323e2db

nikosanto13 commented Jan 27, 2025

View reviewed changes

docstring fix for UniSpeechForCTC

e919608

Cyrilvallez reviewed Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce modular files for speech models #35902

Introduce modular files for speech models #35902

nikosanto13 commented Jan 27, 2025 •

edited

Loading

nikosanto13 Jan 27, 2025

Cyrilvallez Jan 28, 2025

nikosanto13 Jan 27, 2025

Cyrilvallez Jan 28, 2025

Rocketknight1 commented Jan 27, 2025

Cyrilvallez left a comment

Cyrilvallez Jan 28, 2025

Cyrilvallez Jan 28, 2025

Cyrilvallez Jan 28, 2025

Cyrilvallez Jan 28, 2025

Introduce modular files for speech models #35902

Are you sure you want to change the base?

Introduce modular files for speech models #35902

Conversation

nikosanto13 commented Jan 27, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

Additional details

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rocketknight1 commented Jan 27, 2025

Cyrilvallez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikosanto13 commented Jan 27, 2025 •

edited

Loading