-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: improved GPU job support #173
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe changes enhance the SLURM job submission process within the Snakemake executor plugin by improving the handling of GPU resources and Generic Resource (GRES) specifications. A new method, Changes
Possibly related PRs
Suggested reviewers
Poem
Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
docs/further.md (2)
121-121
: Enhance the GRES resource documentation with more details and examplesWhile the current description is accurate, it would be helpful to expand it with:
- Common GRES types (e.g., gpu, mps, fpga)
- Multiple resource syntax (e.g.,
gpu:tesla:2
)- Resource property specifications
- Links to SLURM's GRES documentation
Consider updating the table entry to something like:
-| `--gres` | `gres` | generic resource specification, e.g. `gres=gpu:1`, allows for arbitray strings after the `=` | +| `--gres` | `gres` | Generic Resource Scheduling specification. Examples: `gres=gpu:1` (1 GPU), `gres=gpu:tesla:2` (2 Tesla GPUs), `gres=mps:100` (100% MPS). See [SLURM GRES Guide](https://slurm.schedmd.com/gres.html) for details. |
Line range hint
123-130
: Add a dedicated GRES example for AI/ML workloadsGiven the PR's focus on GRES support for AI applications, it would be valuable to add a specific example showing how to request GPU resources.
Consider adding this example after the current generic example:
rule: input: ... output: ... resources: partition=<partition name> runtime=<some number> + +# Example: GPU-accelerated deep learning training +rule train_model: + input: "data/training_set.h5" + output: "models/trained_model.pt" + resources: + # Request 2 NVIDIA A100 GPUs + gres="gpu:a100:2", + # Ensure sufficient CPU cores and memory for data preprocessing + cpus_per_task=8, + mem_mb=32000snakemake_executor_plugin_slurm/__init__.py (1)
Line range hint
313-315
: Consider making status_attempts configurableThe number of status check attempts is hardcoded. Consider making it configurable through the ExecutorSettings class to allow users to adjust based on their cluster's characteristics.
+ status_check_attempts: int = field( + default=5, + metadata={ + "help": "Number of attempts to check job status before giving up", + "env_var": False, + "required": False, + }, + ) - status_attempts = 5 + status_attempts = self.workflow.executor_settings.status_check_attempts
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (2)
docs/further.md
(1 hunks)snakemake_executor_plugin_slurm/__init__.py
(1 hunks)
🔇 Additional comments (2)
snakemake_executor_plugin_slurm/__init__.py (2)
Line range hint 238-251
: LGTM: Robust error handling implementation
The error handling for job submission is comprehensive, covering both return code errors and submission failure messages, with clear error reporting.
Line range hint 174-198
: LGTM: Comprehensive resource management
The implementation handles various SLURM resources effectively with appropriate warnings and fallback mechanisms. The code properly manages:
- Runtime constraints
- Memory allocation (both per-job and per-CPU)
- Constraint specifications
@CodeRabbit no, this PR only adds |
Tip For best results, initiate chat on the files or code changes. @cmeesters: Understood, thank you for the clarification. I acknowledge that this PR only adds ✏️ Learnings added
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)
161-170
: LGTM! Well-implemented GRES support.The implementation correctly validates the GRES format and provides clear error messages. The regex pattern accurately matches SLURM's GRES syntax requirements.
Consider adding code comments with common GRES examples to help users:
if job.resources.get("gres"): + # Common GRES examples: + # gpu:1 - Request 1 GPU of any type + # gpu:tesla:2 - Request 2 Tesla GPUs + # mps:100 - Request 100% of MPS (CUDA Multi-Process Service) # Validate GRES format (e.g., "gpu:1", "gpu:tesla:2") gres = job.resources.gres
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. In addition, I would however suggest to further support two resources gpu: int
and gpu_model: str
, which can be internally translated into a gres parametrization (while an explicit gres specification takes always precedence). This way, we are consistent with an emerging common practice, which is currently also being implemented for kubernetes here: snakemake/snakemake-executor-plugin-kubernetes#31
@johanneskoester SLURM's generic approach is to write I'm fine with providing two more resources for the sake of consistency. However, we need to keep an eye on ease, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
snakemake_executor_plugin_slurm/__init__.py (2)
161-169
: Enhance GRES format validation and resource specification flexibilityThe current regex enforces
[a-zA-Z0-9]+:([a-zA-Z0-9]+:)?\d+
, disallowing underscores or hyphens in resource names. SLURM setups sometimes include underscores (e.g.,gpu:tesla_v100:2
). Consider extending the pattern to capture these valid resource naming conventions. Also, use caution if resource values could come from untrusted sources; adding shell quoting enhances safety.- if not re.match(r"^[a-zA-Z0-9]+:([a-zA-Z0-9]+:)?\d+$", gres): + if not re.match(r"^[a-zA-Z0-9_]+(:[a-zA-Z0-9_]+){0,1}:\d+$", gres): raise WorkflowError( f"Invalid GRES format: {gres}. Expected format: " "'<name>:<number>' or '<name>:<type>:<number>'" )
170-183
: Consider supporting GPU without specifying a modelRight now, if
gpu
is set butgpu_model
is missing, this block does nothing. You might allow a simplergpu:<number>
usage for users who don’t need a specific GPU model. If that’s not intended, it might help to clarify via error or docstring that both fields are required.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_executor_plugin_slurm/__init__.py
(1 hunks)
🧰 Additional context used
📓 Learnings (1)
📓 Common learnings
Learnt from: cmeesters
PR: snakemake/snakemake-executor-plugin-slurm#0
File: :0-0
Timestamp: 2024-11-25T21:10:22.280Z
Learning: PR #173 adds `gres` as a possible resource specification in the Snakemake executor plugin for SLURM; other features are addressed in separate PRs.
@coderabbitai jolly good remark in your last comment. I hope the last commit is an improvement. Now, the code base is growing too big, IMO, and needs to be outsourced to a different module. |
Tip For best results, initiate chat on the files or code changes. @cmeesters: Thank you for the update! Your latest commit does appear to address the prior remark, and it’s great to see improvement. Regarding the code base growing too large, would you like to open a separate PR or GitHub issue to track refactoring to a dedicated module? I’m happy to assist as needed. Let me know how you’d like to proceed. 🧠 Learnings used
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
snakemake_executor_plugin_slurm/__init__.py (3)
86-87
: Consider using a more strict regex pattern for GRES validation.The current GRES regex pattern allows consecutive underscores and potentially invalid combinations. Consider using a more restrictive pattern:
- gres_re = re.compile(r"^[a-zA-Z0-9_]+:([a-zA-Z0-9_]+:)?\d+$") + gres_re = re.compile(r"^[a-zA-Z0-9]+(?:_[a-zA-Z0-9]+)*:(?:[a-zA-Z0-9]+(?:_[a-zA-Z0-9]+)*:)?\d+$")
173-184
: Consider improving error handling and variable scope.The implementation is correct but could be improved:
- Initialize
gres_string
at the beginning of the method to avoid potential undefined variable issues- Add more context to the GPU integer validation error message
+ gres_string = "" # Initialize at the beginning if job.resources.get("gpu"): if job.resources.get("gres"): raise WorkflowError( "GRES and GPU are set. Please only set one of them." ) if not isinstance(job.resources.gpu, int): raise WorkflowError( - "The 'gpu' resource must be an integer. " - f"Got: {job.resources.gpu} ({type(job.resources.gpu)})." + f"The 'gpu' resource must be an integer specifying the number of GPUs. " + f"Got: {job.resources.gpu} of type {type(job.resources.gpu).__name__}." )
186-198
: Enhance GPU model validation error message.The GPU model validation is correct, but the error message could be more helpful.
if not Executor.gpu_model_re.match(job.resources.gpu_model): raise WorkflowError( - f"Invalid GPU model format: {job.resources.gpu_model}. " - "Expected format: '<name>'" + f"Invalid GPU model format: {job.resources.gpu_model}. " + "GPU model must contain only alphanumeric characters and underscores. " + "Example: 'tesla' or 'a100'" )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_executor_plugin_slurm/__init__.py
(2 hunks)
🧰 Additional context used
📓 Learnings (1)
📓 Common learnings
Learnt from: cmeesters
PR: snakemake/snakemake-executor-plugin-slurm#0
File: :0-0
Timestamp: 2024-11-25T21:10:22.280Z
Learning: PR #173 adds `gres` as a possible resource specification in the Snakemake executor plugin for SLURM; other features are addressed in separate PRs.
🔇 Additional comments (2)
snakemake_executor_plugin_slurm/__init__.py (2)
164-172
: LGTM! Well-implemented GRES validation.The GRES validation is thorough and provides clear error messages.
199-201
: LGTM! Clean conditional assignment of GRES string.The final GRES string assignment is well-structured and handles all cases correctly.
@johanneskoester the code ought to be ok-ish. As mentioned, I rather refactor in the future to avoid an extended single file. That aside, please review. I would prefer to merge the main into this PR after PR #159 is done, because of the various changes in the docs and the help strings, which will only lead to merge conflicts, if done twice. (And I have plenty of other things on my mind, too.) |
…-gpu - needs adjustment in jobstep executor code, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
snakemake_executor_plugin_slurm/__init__.py (3)
117-118
: LGTM! Consider adding pattern documentation.The regex patterns are well-defined and efficiently compiled at class level.
Consider adding docstrings to explain the expected formats:
class Executor(RemoteExecutor): + # Validates GRES format (e.g., "gpu:1", "gpu:tesla:2") gres_re = re.compile(r"^[a-zA-Z0-9_]+:([a-zA-Z0-9_]+:)?\d+$") + # Validates GPU model names (e.g., "tesla", "a100") gpu_model_re = re.compile(r"^[a-zA-Z0-9_]+$")
234-262
: Fix variable naming inconsistency.The code uses both
gpus
andgpu
in different contexts. On line 294, it checks forgpus
but the resource is defined asgpu
.Apply this change for consistency:
- gpu_job = job.resources.get("gpus") or "gpu" in job.resources.get("gres", "") + gpu_job = job.resources.get("gpu") or "gpu" in job.resources.get("gres", "")
309-311
: Enhance comment clarity for CPU allocation.The current comment could be more descriptive about the CPU allocation strategy.
- # we need to set cpus-per-task OR cpus-per-gpu, the function - # will return a string with the corresponding value + # Determine CPU allocation strategy: + # - For GPU jobs: sets --cpus-per-gpu based on job resources + # - For non-GPU jobs: sets --cpus-per-task based on job resources + # The get_cpus_per_task function returns the appropriate SLURM parameter call += f" {get_cpus_per_task(job, gpu_job)}"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_executor_plugin_slurm/__init__.py
(4 hunks)
🧰 Additional context used
📓 Learnings (1)
📓 Common learnings
Learnt from: cmeesters
PR: snakemake/snakemake-executor-plugin-slurm#0
File: :0-0
Timestamp: 2024-11-25T21:10:22.280Z
Learning: PR #173 adds `gres` as a possible resource specification in the Snakemake executor plugin for SLURM; other features are addressed in separate PRs.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: formatting
- GitHub Check: testing
🔇 Additional comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)
225-233
: LGTM! Robust GRES validation implementation.The GRES validation logic is thorough with clear error messages and follows SLURM's specifications.
@johanneskoester and @coderabbitai - this PR is now depended on snakemake/snakemake-executor-plugin-slurm-jobstep#28 . A change in the cpu-settings function of the jobstep-Executor module was required, too. |
Tip For best results, initiate chat on the files or code changes. @cmeesters: Thank you for informing about the dependency on PR #28 in the snakemake-executor-plugin-slurm-jobstep repository and the required changes in the cpu-settings function. I've noted this information for future reference. ✏️ Learnings added
🧠 Learnings used
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
snakemake_executor_plugin_slurm/__init__.py
(4 hunks)snakemake_executor_plugin_slurm/utils.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- snakemake_executor_plugin_slurm/init.py
🧰 Additional context used
📓 Learnings (1)
📓 Common learnings
Learnt from: cmeesters
PR: snakemake/snakemake-executor-plugin-slurm#0
File: :0-0
Timestamp: 2024-11-25T21:10:22.280Z
Learning: PR #173 adds `gres` as a possible resource specification in the Snakemake executor plugin for SLURM; other features are addressed in separate PRs.
🪛 Ruff (0.8.2)
snakemake_executor_plugin_slurm/utils.py
67-67: Undefined name WorkflowError
(F821)
76-76: Undefined name WorkflowError
(F821)
79-79: Undefined name WorkflowError
(F821)
87-87: Undefined name WorkflowError
(F821)
90-90: Undefined name WorkflowError
(F821)
99-99: Undefined name WorkflowError
(F821)
106-106: Undefined name WorkflowError
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: testing
- GitHub Check: formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
snakemake_executor_plugin_slurm/utils.py (1)
51-63
: Enhance function documentation with format examplesWhile the regular expressions are well-defined, the docstring could be more helpful by including example formats and explaining the relationship between gres, gpu, and gpu_model parameters.
Add examples to the docstring:
def set_gres_string(job: JobExecutorInterface) -> str: """ Function to set the gres string for the SLURM job based on the resources requested in the job. + + Examples: + - gres="gpu:1" # Request 1 GPU + - gres="gpu:tesla:2" # Request 2 Tesla GPUs + - gpu=2 # Request 2 GPUs + - gpu=1, gpu_model="tesla" # Request 1 Tesla GPU """docs/further.md (2)
93-94
: Fix grammatical issue in documentationThe sentence structure needs improvement.
Apply this diff:
-SLURM allows to specify GPU request with the `--gres` or `--gpus` flags, Snakemake takes a similar approach. Resources can be asked for with +SLURM allows you to specify GPU requests using the `--gres` or `--gpus` flags, and Snakemake takes a similar approach. Resources can be requested using:🧰 Tools
🪛 LanguageTool
[grammar] ~93-~93: Did you mean “specifying”? Or maybe you should add a pronoun? In active voice, ‘allow’ + ‘to’ takes an object, usually a pronoun.
Context: ...tribution. ### GPU Jobs SLURM allows to specify GPU request with the
--gresor
--gpu...(ALLOW_TO)
102-114
: Enhance documentation examplesWhile the current examples are good, they could be more comprehensive to cover all supported scenarios.
Consider adding these additional examples:
```YAML set-resources: gres_request_rule: gres: "gpu:1" multi_gpu_rule: gpus: 2 cpus_per_gpu: 4 no_cpu_gpu_rule: gpus: 1 cpus_per_gpu: -1 + + gpu_model_rule: + gpu_model: "tesla" + gpu: 2 + cpus_per_gpu: 2 + + gres_advanced_rule: + gres: "gpu:tesla:4"</blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used: .coderabbit.yaml** **Review profile: CHILL** **Plan: Pro** <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 0e1873603d3b3bce48e9ba09ee7579fe4737191b and df62a6f8d243999ff8e80f4c9ce1ab95767bab05. </details> <details> <summary>📒 Files selected for processing (2)</summary> * `docs/further.md` (1 hunks) * `snakemake_executor_plugin_slurm/utils.py` (2 hunks) </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>📓 Learnings (2)</summary> <details> <summary>📓 Common learnings</summary>
Learnt from: cmeesters
PR: snakemake/snakemake-executor-plugin-slurm#0
File: :0-0
Timestamp: 2024-11-25T21:10:22.280Z
Learning: PR #173 addsgres
as a possible resource specification in the Snakemake executor plugin for SLURM; other features are addressed in separate PRs.
Learnt from: cmeesters
PR: snakemake/snakemake-executor-plugin-slurm#0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.</details> <details> <summary>snakemake_executor_plugin_slurm/utils.py (1)</summary>
Learnt from: cmeesters
PR: snakemake/snakemake-executor-plugin-slurm#0
File: :0-0
Timestamp: 2024-11-25T21:10:22.280Z
Learning: PR #173 addsgres
as a possible resource specification in the Snakemake executor plugin for SLURM; other features are addressed in separate PRs.</details> </details><details> <summary>🪛 LanguageTool</summary> <details> <summary>docs/further.md</summary> [grammar] ~93-~93: Did you mean “specifying”? Or maybe you should add a pronoun? In active voice, ‘allow’ + ‘to’ takes an object, usually a pronoun. Context: ...tribution`. ### GPU Jobs SLURM allows to specify GPU request with the `--gres` or `--gpu... (ALLOW_TO) --- [uncategorized] ~95-~95: Loose punctuation mark. Context: ...esources can be asked for with - `gres`, e.g. `gres=gpu:1` or `gres=gpu:tesla:2`... (UNLIKELY_OPENING_PUNCTUATION) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms (1)</summary> * GitHub Check: testing </details> <details> <summary>🔇 Additional comments (2)</summary><blockquote> <details> <summary>snakemake_executor_plugin_slurm/utils.py (2)</summary> `4-10`: **LGTM! Imports are well-organized and complete.** All necessary imports are present, including the previously missing WorkflowError import. --- `64-74`: _:warning: Potential issue_ **Fix potential control flow issue with gres_string** The gres_string is set when job.resources.get("gres") is True, but there's no else clause. This could lead to undefined variable errors in subsequent code. Apply this diff to fix the issue: ```diff if job.resources.get("gres"): # Validate GRES format (e.g., "gpu:1", "gpu:tesla:2") gres = job.resources.gres if not gres_re.match(gres): raise WorkflowError( f"Invalid GRES format: {gres}. Expected format: " "'<name>:<number>' or '<name>:<type>:<number>' " "(e.g., 'gpu:1' or 'gpu:tesla:2')" ) gres_string = f" --gres={job.resources.gres}" + else: + gres_string = ""
Likely invalid or redundant comment.
if job.resources.get("gpus"): | ||
# ensure that gres is not set, if gpu and gpu_model are set | ||
if job.resources.get("gres"): | ||
raise WorkflowError("GRES and GPU are set. Please only set one" " of them.") | ||
# validate GPU format | ||
if not gpus_re.match(str(job.resources.gpu)): | ||
raise WorkflowError( | ||
f"Invalid GPU format: {job.resources.gpu}. " | ||
"Expected format: '<number>' or '<name>:<number>' " | ||
"(e.g., '1' or 'tesla:2')" | ||
) | ||
gres_string = f" --gpus={job.resources.gpu}" | ||
elif job.resources.get("gpu_model") and job.resources.get("gpu"): | ||
# validate GPU model format | ||
if not gpu_model_re.match(job.resources.gpu_model): | ||
raise WorkflowError( | ||
f"Invalid GPU model format: {job.resources.gpu_model}. " | ||
"Expected format: '<name>' " | ||
"(e.g., 'tesla')" | ||
) | ||
gres_string = f" --gpus:{job.resources.gpu_model}:{job.resources.gpu}" | ||
elif job.resources.get("gpu_model") and not job.resources.get("gpu"): | ||
raise WorkflowError( | ||
"GPU model is set, but no GPU number is given. " "Please set 'gpu' as well." | ||
) | ||
return gres_string if job.resources.get("gres") or job.resources.get("gpu") else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify GPU resource handling logic
The current implementation has overlapping conditions and could be simplified using a more structured approach.
Consider this refactored version:
- if job.resources.get("gpus"):
- # ensure that gres is not set, if gpu and gpu_model are set
- if job.resources.get("gres"):
- raise WorkflowError("GRES and GPU are set. Please only set one" " of them.")
- # validate GPU format
- if not gpus_re.match(str(job.resources.gpu)):
- raise WorkflowError(
- f"Invalid GPU format: {job.resources.gpu}. "
- "Expected format: '<number>' or '<name>:<number>' "
- "(e.g., '1' or 'tesla:2')"
- )
- gres_string = f" --gpus={job.resources.gpu}"
- elif job.resources.get("gpu_model") and job.resources.get("gpu"):
- # validate GPU model format
- if not gpu_model_re.match(job.resources.gpu_model):
- raise WorkflowError(
- f"Invalid GPU model format: {job.resources.gpu_model}. "
- "Expected format: '<name>' "
- "(e.g., 'tesla')"
- )
- gres_string = f" --gpus:{job.resources.gpu_model}:{job.resources.gpu}"
- elif job.resources.get("gpu_model") and not job.resources.get("gpu"):
- raise WorkflowError(
- "GPU model is set, but no GPU number is given. " "Please set 'gpu' as well."
- )
+ if not job.resources.get("gres"):
+ if job.resources.get("gpu_model"):
+ if not job.resources.get("gpu"):
+ raise WorkflowError(
+ "GPU model is set, but no GPU number is given. Please set 'gpu' as well."
+ )
+ if not gpu_model_re.match(job.resources.gpu_model):
+ raise WorkflowError(
+ f"Invalid GPU model format: {job.resources.gpu_model}. "
+ "Expected format: '<name>' (e.g., 'tesla')"
+ )
+ gres_string = f" --gpus:{job.resources.gpu_model}:{job.resources.gpu}"
+ elif job.resources.get("gpu"):
+ if not gpus_re.match(str(job.resources.gpu)):
+ raise WorkflowError(
+ f"Invalid GPU format: {job.resources.gpu}. "
+ "Expected format: '<number>' or '<name>:<number>' "
+ "(e.g., '1' or 'tesla:2')"
+ )
+ gres_string = f" --gpus={job.resources.gpu}"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if job.resources.get("gpus"): | |
# ensure that gres is not set, if gpu and gpu_model are set | |
if job.resources.get("gres"): | |
raise WorkflowError("GRES and GPU are set. Please only set one" " of them.") | |
# validate GPU format | |
if not gpus_re.match(str(job.resources.gpu)): | |
raise WorkflowError( | |
f"Invalid GPU format: {job.resources.gpu}. " | |
"Expected format: '<number>' or '<name>:<number>' " | |
"(e.g., '1' or 'tesla:2')" | |
) | |
gres_string = f" --gpus={job.resources.gpu}" | |
elif job.resources.get("gpu_model") and job.resources.get("gpu"): | |
# validate GPU model format | |
if not gpu_model_re.match(job.resources.gpu_model): | |
raise WorkflowError( | |
f"Invalid GPU model format: {job.resources.gpu_model}. " | |
"Expected format: '<name>' " | |
"(e.g., 'tesla')" | |
) | |
gres_string = f" --gpus:{job.resources.gpu_model}:{job.resources.gpu}" | |
elif job.resources.get("gpu_model") and not job.resources.get("gpu"): | |
raise WorkflowError( | |
"GPU model is set, but no GPU number is given. " "Please set 'gpu' as well." | |
) | |
return gres_string if job.resources.get("gres") or job.resources.get("gpu") else "" | |
if not job.resources.get("gres"): | |
if job.resources.get("gpu_model"): | |
if not job.resources.get("gpu"): | |
raise WorkflowError( | |
"GPU model is set, but no GPU number is given. Please set 'gpu' as well." | |
) | |
if not gpu_model_re.match(job.resources.gpu_model): | |
raise WorkflowError( | |
f"Invalid GPU model format: {job.resources.gpu_model}. " | |
"Expected format: '<name>' (e.g., 'tesla')" | |
) | |
gres_string = f" --gpus:{job.resources.gpu_model}:{job.resources.gpu}" | |
elif job.resources.get("gpu"): | |
if not gpus_re.match(str(job.resources.gpu)): | |
raise WorkflowError( | |
f"Invalid GPU format: {job.resources.gpu}. " | |
"Expected format: '<number>' or '<name>:<number>' " | |
"(e.g., '1' or 'tesla:2')" | |
) | |
gres_string = f" --gpus={job.resources.gpu}" | |
return gres_string if job.resources.get("gres") or job.resources.get("gpu") else "" |
In the light of more and more accelerator applications (AI, base mapping, ...) the fall-back onto
slurm_extra
becomes a bit tedious to use. Hence, the resource support forgres
.Addresses issue #52 (and to a minor extent: #18 and #104). Supersedes PR #172 .
Summary by CodeRabbit
New Features
Bug Fixes