-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GPU blocks
property
#2253
Conversation
* Add `blocks` property to cloud and SSH fleet * Reverse instance->job relation, switch from 1-to-1 to many-to-1 (many jobs to one instance) * Add InstanceModel.shared_info structure to hold total and currenty used number of blocks * Generate virtual shared offers (fraction of a real offer) on the fly * Update CLI and API to display shared resources and instances * Keep track of volumes used by jobs Part-of: #1780
Union[Literal["auto"], int], | ||
Field( | ||
description=( | ||
"The amount of blocks to split the instance into, a number or `auto` (e.g., `4`)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe elaborate what auto
means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -110,6 +110,11 @@ def get_public_keys(self) -> List[str]: | |||
return [ssh_key.public.strip() for ssh_key in self.ssh_keys] | |||
|
|||
|
|||
class InstanceSharedInfo(CoreModel): | |||
total_blocks: Union[Literal["auto"], int] | |||
busy_blocks: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about storing busy_blocks
in json vs in a separate column. Any particular reasons to do so? I'm generally inclined to keep static data in json and changing fields like this as separate columns.
Maybe using json is more flexible if we take a complete different approach instead of relying on blocks
, Idk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with two integer columns 8b555f7
return shareable_offers | ||
|
||
|
||
def generate_shared_offer(offer: InstanceOfferWithAvailability, blocks: int, total_blocks: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing return type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -416,7 +421,7 @@ def _error(cls, values) -> Dict: | |||
|
|||
class JobPlan(CoreModel): | |||
job_spec: JobSpec | |||
offers: List[InstanceOfferWithAvailability] | |||
offers: List[Union[InstanceSharedOffer, InstanceOfferWithAvailability]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to introduce different types of offers to the API? Can't every InstanceOfferWithAvailability be InstanceSharedOffer with total_blocks/blocks = 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(blocks, int) and blocks < 2: | ||
blocks = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting shared_info to None when blocks: 1
is very confusing since you need to know that total_blocks cannot be 1 (e.g. when shared offers are generated). I'd reverse it: if blocks is None, then assume blocks: 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, with my solution, it would be more appropriate to normalize shared_info to keep total/used blocks in separate columns so that it's easier to migrate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored as part of 8b555f7
# List of volumes used by the job | ||
volume_names: Optional[list[str]] = None # None for backward compalibility | ||
# Virtual shared offer. None if the instance is not shared. | ||
offer: Optional[InstanceSharedOffer] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's more useful to always store the offer used by the job, not just when it was shared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 8b555f7
# None if data is not yet available (on vm-based backends and ssh instances) | ||
# or not applicable (container-based backends) | ||
ports: Optional[dict[int, int]] = None | ||
# List of volumes used by the job | ||
volume_names: Optional[list[str]] = None # None for backward compalibility | ||
# Virtual shared offer. None if the instance is not shared. | ||
offer: Optional[InstanceSharedOffer] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's crucial now to add description of JobRuntimeData and what should go there so that it does not become a catch-all struct for everything. Probably makes sense to add a similar description to JobProvisioningData to differentiate it with JobRuntimeData.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if shared_info is not None: | ||
if blocks == "auto": | ||
blocks = len(instance_offer.instance.resources.gpus) | ||
if blocks > 1: | ||
shared_info.total_blocks = blocks | ||
else: | ||
shared_info = None | ||
if shared_info is not None: | ||
instance.shared_info = shared_info.json() | ||
else: | ||
instance.shared_info = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shared_info
updating is convoluted. I believe it can be created with total_blocks: "auto"
and then it's overridden. I don't see much value in that. I'd keep user's configured blocks
as a static property somewhere and wound't touch it. total_blocks
can then be resolved when instance resources are known. No need to allow for auto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with NULL -> calculated auto blocks in 8b555f7
NULL as "info is not yet available" seems to fit there
for backend, offer in offers: | ||
resources = offer.instance.resources | ||
gpus = resources.gpus | ||
if not gpus: | ||
continue | ||
if gpus[0].vendor == gpuhunt.AcceleratorVendor.GOOGLE: | ||
# TPU instances cannot be shared | ||
continue | ||
if blocks == "auto": | ||
divisor = len(gpus) | ||
else: | ||
divisor = blocks | ||
if len(gpus) % divisor or resources.cpus % divisor: | ||
# gpus or cpus are not divisible by blocks | ||
continue | ||
shareable_offers.append((backend, offer)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code can be nicely refactored into a separate func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored in 8b555f7
shared_offer.availability = InstanceAvailability.BUSY | ||
if shared_offer.availability == InstanceAvailability.IDLE or not idle_only: | ||
instances_with_offers.append((instance, shared_offer)) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we show only the smallest matching offer? Is it intended. Asking because the doc showed all possible offers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, IMO there is no reason to show all virual offers as they won't be picked anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with cloud fleets. Everything works as expected.
I'd reconsider how shared_info
is stored/updated (mainly to make the code more comprehendible) and if InstanceSharedOffer/InstanceOfferWithAvailability should be differentiated in the API. Otherwise, looks good.
|
||
def downgrade() -> None: | ||
with op.batch_alter_table("instances", schema=None) as batch_op: | ||
batch_op.add_column(sa.Column("job_id", sa.UUID(), autoincrement=False, nullable=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sa.UUID() fails with sqlite
change to:
batch_op.add_column(sa.Column("job_id", sqlalchemy_utils.types.uuid.UUIDType(binary=False), nullable=True))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0767ad7
to
a6dcde2
Compare
Suitable for single-node setups
blocks
property to cloud and SSH fleetPart-of: #1780