Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GPU blocks property #2253

Merged
merged 8 commits into from
Feb 6, 2025
Merged

Implement GPU blocks property #2253

merged 8 commits into from
Feb 6, 2025

Conversation

un-def
Copy link
Collaborator

@un-def un-def commented Jan 31, 2025

  • Add blocks property to cloud and SSH fleet
  • Reverse instance->job relation, switch from 1-to-1 to many-to-1 (many jobs to one instance)
  • Add InstanceModel.shared_info structure to hold total and currenty used number of blocks
  • Generate virtual shared offers (fraction of a real offer) on the fly
  • Update CLI and API to display shared resources and instances
  • Keep track of volumes used by jobs

Part-of: #1780

* Add `blocks` property to cloud and SSH fleet
* Reverse instance->job relation, switch from 1-to-1 to many-to-1 (many
  jobs to one instance)
* Add InstanceModel.shared_info structure to hold total and currenty
  used number of blocks
* Generate virtual shared offers (fraction of a real offer) on the fly
* Update CLI and API to display shared resources and instances
* Keep track of volumes used by jobs

Part-of: #1780
@un-def un-def requested review from jvstme and r4victor January 31, 2025 14:23
Union[Literal["auto"], int],
Field(
description=(
"The amount of blocks to split the instance into, a number or `auto` (e.g., `4`)."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe elaborate what auto means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -110,6 +110,11 @@ def get_public_keys(self) -> List[str]:
return [ssh_key.public.strip() for ssh_key in self.ssh_keys]


class InstanceSharedInfo(CoreModel):
total_blocks: Union[Literal["auto"], int]
busy_blocks: int = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about storing busy_blocks in json vs in a separate column. Any particular reasons to do so? I'm generally inclined to keep static data in json and changing fields like this as separate columns.

Maybe using json is more flexible if we take a complete different approach instead of relying on blocks, Idk.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with two integer columns 8b555f7

return shareable_offers


def generate_shared_offer(offer: InstanceOfferWithAvailability, blocks: int, total_blocks: int):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing return type

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -416,7 +421,7 @@ def _error(cls, values) -> Dict:

class JobPlan(CoreModel):
job_spec: JobSpec
offers: List[InstanceOfferWithAvailability]
offers: List[Union[InstanceSharedOffer, InstanceOfferWithAvailability]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to introduce different types of offers to the API? Can't every InstanceOfferWithAvailability be InstanceSharedOffer with total_blocks/blocks = 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 402 to 403
if isinstance(blocks, int) and blocks < 2:
blocks = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting shared_info to None when blocks: 1 is very confusing since you need to know that total_blocks cannot be 1 (e.g. when shared offers are generated). I'd reverse it: if blocks is None, then assume blocks: 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, with my solution, it would be more appropriate to normalize shared_info to keep total/used blocks in separate columns so that it's easier to migrate.

Copy link
Collaborator Author

@un-def un-def Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored as part of 8b555f7

# List of volumes used by the job
volume_names: Optional[list[str]] = None # None for backward compalibility
# Virtual shared offer. None if the instance is not shared.
offer: Optional[InstanceSharedOffer] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's more useful to always store the offer used by the job, not just when it was shared?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 8b555f7

Comment on lines 240 to 246
# None if data is not yet available (on vm-based backends and ssh instances)
# or not applicable (container-based backends)
ports: Optional[dict[int, int]] = None
# List of volumes used by the job
volume_names: Optional[list[str]] = None # None for backward compalibility
# Virtual shared offer. None if the instance is not shared.
offer: Optional[InstanceSharedOffer] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's crucial now to add description of JobRuntimeData and what should go there so that it does not become a catch-all struct for everything. Probably makes sense to add a similar description to JobProvisioningData to differentiate it with JobRuntimeData.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 592 to 602
if shared_info is not None:
if blocks == "auto":
blocks = len(instance_offer.instance.resources.gpus)
if blocks > 1:
shared_info.total_blocks = blocks
else:
shared_info = None
if shared_info is not None:
instance.shared_info = shared_info.json()
else:
instance.shared_info = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared_info updating is convoluted. I believe it can be created with total_blocks: "auto" and then it's overridden. I don't see much value in that. I'd keep user's configured blocks as a static property somewhere and wound't touch it. total_blocks can then be resolved when instance resources are known. No need to allow for auto.

Copy link
Collaborator Author

@un-def un-def Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with NULL -> calculated auto blocks in 8b555f7

NULL as "info is not yet available" seems to fit there

Comment on lines 107 to 122
for backend, offer in offers:
resources = offer.instance.resources
gpus = resources.gpus
if not gpus:
continue
if gpus[0].vendor == gpuhunt.AcceleratorVendor.GOOGLE:
# TPU instances cannot be shared
continue
if blocks == "auto":
divisor = len(gpus)
else:
divisor = blocks
if len(gpus) % divisor or resources.cpus % divisor:
# gpus or cpus are not divisible by blocks
continue
shareable_offers.append((backend, offer))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code can be nicely refactored into a separate func.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in 8b555f7

shared_offer.availability = InstanceAvailability.BUSY
if shared_offer.availability == InstanceAvailability.IDLE or not idle_only:
instances_with_offers.append((instance, shared_offer))
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we show only the smallest matching offer? Is it intended. Asking because the doc showed all possible offers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IMO there is no reason to show all virual offers as they won't be picked anyway

Copy link
Collaborator

@r4victor r4victor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with cloud fleets. Everything works as expected.

I'd reconsider how shared_info is stored/updated (mainly to make the code more comprehendible) and if InstanceSharedOffer/InstanceOfferWithAvailability should be differentiated in the API. Otherwise, looks good.


def downgrade() -> None:
with op.batch_alter_table("instances", schema=None) as batch_op:
batch_op.add_column(sa.Column("job_id", sa.UUID(), autoincrement=False, nullable=True))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sa.UUID() fails with sqlite

change to:

batch_op.add_column(sa.Column("job_id", sqlalchemy_utils.types.uuid.UUIDType(binary=False), nullable=True))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@un-def un-def force-pushed the issue_1780_gpu_blocks branch from 0767ad7 to a6dcde2 Compare February 5, 2025 13:32
@un-def un-def merged commit c1206c5 into master Feb 6, 2025
24 checks passed
@un-def un-def deleted the issue_1780_gpu_blocks branch February 6, 2025 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants