Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sim4life.io - WP4: Computational backend #950

Open
81 of 90 tasks
Tracked by #878
mguidon opened this issue May 9, 2023 · 12 comments
Open
81 of 90 tasks
Tracked by #878

sim4life.io - WP4: Computational backend #950

mguidon opened this issue May 9, 2023 · 12 comments
Assignees
Labels
PO issue Created by Product owners s4l:web sim4life product in osparc.io

Comments

@mguidon
Copy link
Member

mguidon commented May 9, 2023

Description

With the latest version of sim4life.io, we are introducing an improved computational backend that ensures reliable and efficient job scheduling via the computational backend. Moving forward, all solver jobs will be scheduled via these facilities, enabling users to choose the hardware on which their jobs should run and providing the ability to inspect and operate on the job queue (subject to sufficient permissions).

This robust backend will be capable of handling 100s of concurrent jobs, ensuring that even the busiest periods will not cause any disruptions to service.

Furthermore, the backend functionality will also be made available through the API, allowing for integration with external systems (e.g. the sim4life desktop application) and further expanding the possibilities for users.

Tasks

  1. a:director-v2 a:osparc-gateway-server
    sanderegg
  2. sanderegg
  3. a:dask-service
    sanderegg
  4. PO issue s4l:web
    sanderegg
  5. 2 of 3
    Feedback type:bug
    sanderegg
  6. 1 of 2
    bisgaard-itis
  7. a:dask-service t:enhancement t:maintenance
    sanderegg
  8. a:clusters-keeper a:dask-service a:director-v2
    sanderegg
  9. 0 of 3
    a:director-v2
    GitHK sanderegg
  10. a:clusters-keeper a:director-v2 computational clusters
    sanderegg
  11. 1 of 3
    a:autoscaling a:frontend a:webserver computational clusters
    sanderegg
  12. sanderegg
  13. a:autoscaling a:dask-service computational clusters t:enhancement
    sanderegg
  14. a:apiserver a:dask-service a:director-v2
    sanderegg
  15. a:clusters-keeper t:enhancement
    sanderegg
  16. a:autoscaling t:enhancement
    sanderegg
  17. a:apiserver a:dask-service a:director-v2
    mguidon
  18. a:dask-service bug
    sanderegg

Enchanted Odyssey

  1. 7 of 7
    a:clusters-keeper a:infra+ops
    mrnicegyu11 sanderegg

Schoggilebe

  1. 4 of 4
    a:autoscaling a:clusters-keeper
    YuryHrytsuk sanderegg
  2. a:autoscaling bug
    sanderegg
  3. a:autoscaling a:infra+ops
    sanderegg
  4. a:autoscaling a:director-v2
    sanderegg
  5. 2 of 3
    High Priority
    sanderegg
  6. Feedback type:bug
    sanderegg
  7. a:autoscaling a:clusters-keeper
    sanderegg

This is Sparta!

  1. a:dask-service
    sanderegg
  2. a:dask-service
    sanderegg
  3. type:bug
    sanderegg
  4. type:bug
    sanderegg
  5. a:clusters-keeper a:infra+ops
    sanderegg
  6. a:autoscaling
    sanderegg
  7. bug
    pcrespov sanderegg
  8. a:dask-service e2e
    sanderegg
  9. a:dask-service
    sanderegg
  10. bug
    matusdrobuliak66 sanderegg
  11. a:director-v2
    sanderegg
  12. a:clusters-keeper
    sanderegg

Kobayashi Maru

  1. a:autoscaling a:clusters-keeper
    sanderegg
  2. Feedback type:bug
    bisgaard-itis sanderegg
  3. a:autoscaling bug
    sanderegg
  4. a:frontend t:enhancement
    ignapas
  5. a:director-v2
    sanderegg
  6. a:autoscaling
    sanderegg
  7. 3 of 3
    a:autoscaling a:clusters-keeper
    sanderegg
  8. a:storage
    sanderegg
  9. sanderegg
  10. a:clusters-keeper
    sanderegg
  11. a:autoscaling a:clusters-keeper
    sanderegg
  12. a:autoscaling a:clusters-keeper t:maintenance
    sanderegg
  13. a:infra+ops t:maintenance
    sanderegg
  14. a:clusters-keeper t:maintenance
    sanderegg
  15. a:director-v2 t:maintenance
    sanderegg
  16. a:infra+ops t:maintenance
    sanderegg
  17. a:autoscaling t:maintenance
    sanderegg
  18. a:director-v2
    sanderegg
  19. a:infra+ops t:maintenance
    sanderegg
  20. a:clusters-keeper a:director-v2
    sanderegg
  21. sanderegg

7Peaks

  1. 2 of 2
    a:dask-service a:director-v2 a:frontend a:webserver
    mguidon sanderegg
  2. 1 of 1
    a:dask-service a:director-v2 a:dynamic-sidecar a:webserver
    GitHK matusdrobuliak66
    sanderegg
  3. a:autoscaling
    sanderegg
  4. 2 of 2
    PO issue
    bisgaard-itis mguidon
  5. bisgaard-itis
  6. High Priority bug
    sanderegg
  7. a:autoscaling
    sanderegg
  8. a:autoscaling bug
    sanderegg
  9. a:clusters-keeper
    sanderegg
  10. a:director-v2
    sanderegg
  11. a:director-v2
    sanderegg
  12. a:dask-service
    sanderegg
  13. a:director-v2
    sanderegg
  14. a:autoscaling
    sanderegg
  15. a:autoscaling a:aws-library a:clusters-keeper
    sanderegg
  16. a:autoscaling bug
    sanderegg
  17. a:clusters-keeper
    sanderegg
  18. a:clusters-keeper
    sanderegg

Microhistory

  1. GitHK matusdrobuliak66
    sanderegg
  2. a:autoscaling a:clusters-keeper
    sanderegg
  3. a:director-v2
    sanderegg
  4. 1 of 1
    a:dask-service bug
    sanderegg

Quilmes

  1. a:director-v2 bug
    sanderegg
  2. Feedback a:dask-service bug
    sanderegg
  3. Feedback type:enhancement
    sanderegg
  4. a:dask-service bug
    bisgaard-itis sanderegg

Sundae

  1. a:dask-service changelog:🎨enhancement
    sanderegg
  2. a:dask-service bug
    sanderegg
  3. a:director-v2
    matusdrobuliak66 sanderegg

Baklava

  1. a:director-v2 bug
    sanderegg
@mguidon mguidon added PO issue Created by Product owners s4l:web sim4life product in osparc.io labels May 9, 2023
@mguidon mguidon mentioned this issue May 9, 2023
@mguidon mguidon changed the title sim4life.io - Computational backend sim4life.io - WP4: Computational backend May 9, 2023
@pcrespov pcrespov added this to the Pastel de Nata milestone May 12, 2023
@sanderegg
Copy link
Member

sanderegg commented May 14, 2023

Goal for sprint Pastel de Nata

  • progress on AppTeam Std Simulations, ideally run CF use-case
  • refactoring on computational backend, progress on separating PublicAPI calls from webserver load, return solver progress
  • progress on Public API missing entrypoints, and bug fixes
  • if possible progress on personalized resource limits

@mguidon
Copy link
Member Author

mguidon commented Jul 6, 2023

Update Watermelon

Done:

Ongoing:

  • Robustness improvements/refactoring

@sanderegg
Copy link
Member

sanderegg commented Aug 9, 2023

Update Sundae

Done:

  • bugfixes #4153
  • connection of computational backend to resource usage tracking service #4523
  • new clusters keeper service to automatically create computational clusters in AWS #4591

Ongoing:

@sanderegg
Copy link
Member

sanderegg commented Sep 6, 2023

The below schema shows the overall architecture for the on-demand clusters.
Some important points here are:

  • the computational clusters are created per user/wallet
  • in case of maintenance in simcore, these clusters shall be able to continue running independently

Image

@sanderegg sanderegg modified the milestones: Baklava, the nameless Sep 18, 2023
@sanderegg
Copy link
Member

sanderegg commented Oct 31, 2023

Update Microhistory

Done and working

  • Separate cluster is created for each set of user/wallet combination on demand in Amazon AWS,
  • Cluster is a primary machine (t2.micro), on which a stack containing dask-scheduler, autoscaling, redis, dask-sidecar services is started, dask-sidecar only runs on worker machines,
  • autoscaling service creates 1 worker machine (g4dn.xlarge),
  • Only computational services that use a pricing unit defined as g4dn.xlarge machine can run,
  • computational service uses the all the resources provided by the machine (a bit less than 16Gb/4CPUs)

--> Running computational service should work for one service at a time, provided they are set up to use a g4dn.xlarge machine type, there is no upscaling of the machines so parallel jobs will have to wait in line (if multiple isolve jobs are sent, they will be executed one after the other).

should work in 3 weeks

  • Cluster shall create correct machine based on plan (not just g4dn.xlarge), so potentially better machine fit/performance,
  • identify computational child jobs (for example started from s4l) and show them in UI
  • maybe upscaling of separate cluster (needs discussions on how to do it, it has influence on costs, etc)

should not be available in 3 weeks

  • upscaling?
  • optimisations

@sanderegg
Copy link
Member

sanderegg commented Nov 28, 2023

Update 7peaks

Summary

It is now possible to run computational services on their required AWS instance types. Also child computational job logs show up in the logs of the parent service (e.g. sim4life/jupyterlab starting a computational job).
Upscaling is still not implemented.

Done

Ongoing

  • bugfixing
  • improvements on user feedback (cluster status, number of machines, etc...)

@sanderegg
Copy link
Member

sanderegg commented Jan 7, 2024

Update Kobayashi Maru

Summary

  • bugfixes:
    • handling of on-demand computational clusters (timeouts, reported states)
    • concurrent computing of tasks
  • monitoring & manual interventions:
    • CLI tool to monitor on-demand computational clusters and dynamic service machines
    • partially clear jobs in a specific cluster
    • allow tracing of created machines via tags on EC2 instances

Done ✅

  • various fixes for GPU-based computational services on multi-GPU machines
  • migration of sleepers test to Playwright framework to have more reliable and more flexible E2E testing and compatibility with on-demand computational clusters
  • various fixes regarding invalid state reported by the computational clusters
  • added timeout in case of non responding cluster for more than 10 minutes
  • improvement of response time when retrieving the computational clusters state via Public API
  • new CLI-based monitoring tool to check current state of auto-scaled EC2 instances and their running states

Problematic issues (being worked on) 🚧

Open Features 🚧

@sanderegg sanderegg modified the milestones: Kobayashi Maru, This is Sparta! Jan 11, 2024
@sanderegg
Copy link
Member

sanderegg commented Jan 30, 2024

@bisgaard-itis bisgaard-itis removed this from the This is Sparta! milestone Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PO issue Created by Product owners s4l:web sim4life product in osparc.io
Projects
None yet
Development

No branches or pull requests

9 participants