Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOX: Headless execution for UC3 #70

Open
BachirNILU opened this issue Jul 15, 2024 · 26 comments
Open

EOX: Headless execution for UC3 #70

BachirNILU opened this issue Jul 15, 2024 · 26 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@BachirNILU
Copy link

BachirNILU commented Jul 15, 2024

Hi,

In UC3, we want to test the headless execution.
@Schpidi provided a comprehensive step-by-step guide on running a notebook headlessly in the following issue: How to?: Headless execution.
I understand there are two options for UC3:

  1. Use UC2 headless server link, in that case, we need the credentials to access it.
  2. Set up a new headless server for UC3.

Can you help us with this?

Thanks in advance.

Best regards,

-Bachir.

@eox-cs1
Copy link

eox-cs1 commented Aug 20, 2024

A headless access for UC3 has now been set up ans is ready for use:

curl -X POST -v https://headless-fairicubeuc3.hub.eox.at/processes/execute-notebook/jobs \
    -u USERNAME:PASSWORD \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": {
  "notebook": "s3/NOTEBOOK_IN_BUCKET.ipynb",
  "cpu_requests": "1",
  "cpu_limit": "1",
  "mem_requests": "4G",
  "mem_limit": "4G",
  "node_purpose": "user",
  "kernel": "conda-env-eurodatacube8-torch-py",
  "parameters_json": {"a": "b", "c": 123}
}}'

Instructions the same as for UC2, see: https://github.com/FAIRiCUBE/flux-config/issues/1#issuecomment-1689488002
Username and PW are provided via personal communication

@BachirNILU
Copy link
Author

Thanks @eox-cs1!
I have tested it and it works, thanks!
I only have an issue regarding the kernal. I understand "conda-env-eurodatacube8-torch-py" is the only available kernal as for now, right? A library that we use is missing, how can I install the library into this kernal?

Thanks in advance.

@eox-cs1
Copy link

eox-cs1 commented Aug 27, 2024

Yes, "conda-env-eurodatacube8-torch-py" is the only kernel for your headless usage
what libraries are you missing?

@BachirNILU
Copy link
Author

BachirNILU commented Aug 28, 2024

@eox-cs1 it is "codecarbon".
I was able to run our code, but I need to install the same libraries for each run (by dedicating the first notebook cell to install missing libraries).

@BachirNILU
Copy link
Author

Before closing this issue, I came across a minor problem.
In our code, we use the S3 common bucket environment variables to read/write.
Headless running of the code did not provide us with access to these variables.
I had to copy/past the name, key and secret codes to the script (which is obviously not safe).
Is there a way to fix this? (Maybe passing the environment variables through parameters directly in the curl command?)

@eox-cs1
Copy link

eox-cs1 commented Sep 2, 2024

Hey @BachirNILU
we have now upgraded the kernels for headless (conda-env-eurodatacube8-torch-py) and jupyter-lab (fairicubeuc3-torch)

  • adding codecarbon to both

  • also the availability of the environmental variables in the headless execution should now be fixed

@BachirNILU
Copy link
Author

I have tested it and it works.
I close the issue here.
Thanks for your help.

@BachirNILU
Copy link
Author

Hi @eox-cs1

I reopen the following issue to request headless execution for the remaining UCs (UC1, UC4 and UC5) with priority to UC4 where we need GPUs.

Thanks in advance,

Best regards,

-Bachir.

@BachirNILU BachirNILU reopened this Sep 23, 2024
@eox-cs1
Copy link

eox-cs1 commented Sep 23, 2024

@BachirNILU should they all only execute on gpu then ?
or does only UC4 require gpu.

@BachirNILU
Copy link
Author

@eox-cs1 it is only UC4 for now.

@eox-cs1
Copy link

eox-cs1 commented Oct 4, 2024

the conda kernel from eurodatacube8 (called torch) got replicated to eurodatacube17 and eurodatacube18. This was necessary to avoid conflicts between running jobs.

These are now the new access URLs for the UCs:
https://headless-fairicubeuc2.hub.eox.at/: using kernel -> eurodatacube8/torch
https://headless-fairicubeuc3.hub.eox.at/: using kernel -> eurodatacube17/torch
https://headless-fairicubeuc4.hub.eox.at/: using kernel -> eurodatacube18/torch

The credentials are available at: https://nilu365.sharepoint.com/sites/Horizon2021_CUBE/_layouts/15/Doc.aspx?sourcedoc={235313bb-424e-4a1e-b1d6-92296d28fbfc}&action=edit&wd=target%28technical%20library.one%7C18ca003a-ff29-4de7-925e-1f11804605c2%2FEOxHub%20headless%20execution%7C6d55dc09-f667-4dac-99d1-6d67687afc59%2F%29&wdorigin=703

@BachirNILU
Copy link
Author

Thanks @eox-cs1
I have tested headless execution for UC4, but still getting an error (500 Internal Server Error).
You can find the example I am using at the end of: https://nilu365.sharepoint.com/sites/Horizon2021_CUBE/_layouts/15/Doc.aspx?sourcedoc={235313bb-424e-4a1e-b1d6-92296d28fbfc}&action=edit&wd=target%28technical%20library.one%7C18ca003a-ff29-4de7-925e-1f11804605c2%2FEOxHub%20headless%20execution%7C6d55dc09-f667-4dac-99d1-6d67687afc59%2F%29&wdorigin=703

Please let me know if I am doing something wrong.

@eox-cs1
Copy link

eox-cs1 commented Oct 5, 2024

I know that you followed the example provided above - try it with the request below:
It's still not working but the error looks more like a script issue then a server error (but I'm not sure):
{ "code":"InvalidParameterValue", "description":"invalid request data"}

Reformatted Request:

curl -X POST -v https://headless-fairicubeuc4.hub.eox.at/processes/execute-notebook/jobs \
    -u <username:password> \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": [
      {"id": "notebook", "value": "s3/scripts/Roof_height_ML.ipynb"},
      {"id": "cpu_requests", "value": "1"},
      {"id": "cpu_limit", "value": "1"},
      {"id": "mem_requests", "value": "4G},
      {"id": "mem_limit", "value": "4G"},
      {"id": "node_purpose", "value": "userg1"},
      {"id": "kernel", "value": "conda-env-eurodatacube18-torch-py"}
    }}'

see also https://github.com/FAIRiCUBE/flux-config/issues/1#issuecomment-1689488002

@BachirNILU
Copy link
Author

Thanks @eox-cs1, yes, now I have a similar error, thanks.
Note that the script/request I shared worked for UC3 a few weeks ago.

Best,

-Bachir.

@eox-cs1
Copy link

eox-cs1 commented Oct 7, 2024

  • Sysadmns restarted pygeoapi now and triggered again -> can see pygeoapi-job-ee64cd3e-847b-11ef-9e15-6e556aa22337-5qr6h job pending now as it is starting up GPU node -> to be checked later....

pygeoapi-job-ee64cd3e-847b-11ef-9e15-6e556aa22337 in eurodatacube18 (UC4) indicates "succeeded", please check

@BachirNILU
Copy link
Author

Thanks!
I am still having the error above.
In summary:
1- The following runs successfully under UC3:

curl -X POST -v https://headless-fairicubeuc3.hub.eox.at/processes/execute-notebook/jobs \
    -u user:psw \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": {
      "notebook": "s3/Slicing using Headless Execution/Slicing_Headless.ipynb",
      "cpu_requests": "1",
      "cpu_limit": "1",
      "mem_requests": "4G",
      "mem_limit": "4G",
      "node_purpose": "userg1",
      "kernel": "conda-env-eurodatacube8-torch-py"
    }}'

2- The following under UC4 has error 500 Internal Server Error:

curl -X POST -v https://headless-fairicubeuc4.hub.eox.at/processes/execute-notebook/jobs \
     -u user:psw \
      --header 'Content-Type: application/json' \
      --data-raw '{"inputs": {
       "notebook": "s3/scripts/Roof_height_ML.ipynb",
       "cpu_requests": "2",
       "cpu_limit": "2",
       "mem_requests": "8G",
       "mem_limit": "8G",
       "node_purpose": "userg1",
       "kernel": "conda-env-eurodatacube18-torch-py"
     }}'

3- The following under UC4 has error "code":"InvalidParameterValue", "description":"invalid request data":

curl -X POST -v https://headless-fairicubeuc4.hub.eox.at/processes/execute-notebook/jobs \
    -u user:psw \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": [
      {"id": "notebook", "value": "s3/scripts/Roof_height_ML.ipynb"},
      {"id": "cpu_requests", "value": "1"},
      {"id": "cpu_limit", "value": "1"},
      {"id": "mem_requests", "value": "4G},
      {"id": "mem_limit", "value": "4G"},
      {"id": "node_purpose", "value": "userg1"},
      {"id": "kernel", "value": "conda-env-eurodatacube18-torch-py"}
    }}'

@eox-cs1
Copy link

eox-cs1 commented Oct 17, 2024

The following updates have been performed to to FAIRiCUBE Hub:

These are the new HEADLESS KERNELS for the UCs (now providing torch, openjdk and a new cdsapi) and the NAMESPACES:

eurodatacube8 -> headless-fairicubeuc2.hub.eox.at using bucket s3://hub-fairicubeuc2
eurodatacube17 -> headless-fairicubeuc3.hub.eox.at using bucket s3://hub-fairicubeuc3
eurodatacube18 -> headless-fairicubeuc4.hub.eox.at using bucket s3://hub-fairicubeuc4
eurodatacube19 -> headless-fairicubeuc1.hub.eox.at using bucket s3://hub-fairicube0 <-- legacy
eurodatacube20 -> headless-fairicubeuc5.hub.eox.at using bucket s3://hub-fairicubeuc5

The corresponding JUPYTERLAB KERNELS (now provide torch, openjdk, and a new cdsapi) are:

fairicubeuc1/torch_openjdk
fairicubeuc2/torch
fairicubeuc3/torch
fairicubeuc3/torch_openjdk
fairicubeuc5/torch_openjdk

All 5 endpoints have a uniq basic-auth configured --> TEAMS (https://nilu365.sharepoint.com/sites/Horizon2021_CUBE/_layouts/15/Doc.aspx?sourcedoc={235313bb-424e-4a1e-b1d6-92296d28fbfc}&action=edit&wd=target%28technical%20library.one%7C18ca003a-ff29-4de7-925e-1f11804605c2%2FEOxHub%20headless%20execution%7C6d55dc09-f667-4dac-99d1-6d67687afc59%2F%29&wdorigin=703)

All the headless endpoints can be started either with:

  • node_purpose "user" (for regular CPU)
  • node_purpose "userg1" (for single GPU) or
    If node_purpose is not passed then "userg1" (GPU) is the default!

In addition, the smallest Multi GPU VM available on eu-central-1 g4dn.12xlarge (4 x NVIDIA T4 16 GiB) -> for $4.89 per hour on "userg2" has been configured.
Only UC3 (eurodatacube17) and UC4 (eurodatacube18) are whitelisted for using "userg2"!

The following calls for headless execution are tested and work.
Please also notice the syntax change at the URL (jobs -> execution)

#--------------
UC1 - with CPU:
#--------------
curl -X POST -v https://headless-fairicubeuc1.hub.eox.at/processes/execute-notebook/execution \
    -u <USERNAME>:<PASSWORD> \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": {
      "notebook": "s3/pytorch_verification.ipynb",
      "cpu_requests": "1",
      "cpu_limit": "1",
      "mem_requests": "4G",
      "mem_limit": "4G",
      "parameters_json": {"a": "b", "c": 111},
      "kernel": "conda-env-eurodatacube19-torch-py",
      "node_purpose": "user"
    }
}'

#--------------
UC2 - with CPU:
#--------------
curl -X POST -v https://headless-fairicubeuc2.hub.eox.at/processes/execute-notebook/execution \
    -u <USERNAME>:<PASSWORD> \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": {
      "notebook": "s3/common-code/pytorch-verification/pytorch_verification.ipynb",
      "cpu_requests": "1",
      "cpu_limit": "1",
      "mem_requests": "4G",
      "mem_limit": "4G",
      "parameters_json": {"a": "b", "c": 222},
      "kernel": "conda-env-eurodatacube8-torch-py",
      "node_purpose": "user"
    }
}'

#--------------
UC2 - with GPU:
#--------------
curl -X POST -v https://headless-fairicubeuc2.hub.eox.at/processes/execute-notebook/execution \
    -u <USERNAME>:<PASSWORD> \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": {
      "notebook": "s3/common-code/pytorch-verification/pytorch_verification.ipynb",
      "cpu_requests": "1",
      "cpu_limit": "1",
      "mem_requests": "4G",
      "mem_limit": "4G",
      "parameters_json": {"a": "b", "c": 222},
      "kernel": "conda-env-eurodatacube8-torch-py",
      "node_purpose": "userg1"
    }
}'

#--------------
UC3 - with CPU:
#--------------
curl -X POST -v https://headless-fairicubeuc3.hub.eox.at/processes/execute-notebook/execution \
    -u <USERNAME>:<PASSWORD> \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": {
      "notebook": "s3/Test Headless Execution/pytorch_verification.ipynb",
      "cpu_requests": "1",
      "cpu_limit": "1",
      "mem_requests": "4G",
      "mem_limit": "4G",
      "parameters_json": {"a": "b", "c": 333},
      "kernel": "conda-env-eurodatacube17-torch-py",
      "node_purpose": "user"
    }
}'

#--------------
UC4 - with CPU:
#--------------
curl -X POST -v https://headless-fairicubeuc4.hub.eox.at/processes/execute-notebook/execution \
    -u <USERNAME>:<PASSWORD> \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": {
      "notebook": "s3/pytorch_verification.ipynb",
      "cpu_requests": "1",
      "cpu_limit": "1",
      "mem_requests": "4G",
      "mem_limit": "4G",
      "parameters_json": {"a": "b", "c": 444},
      "kernel": "conda-env-eurodatacube18-torch-py",
      "node_purpose": "user"
    }
}'

#--------------
UC5 - with CPU:
#--------------
curl -X POST -v https://headless-fairicubeuc5.hub.eox.at/processes/execute-notebook/execution \
    -u <USERNAME>:<PASSWORD> \
     --header 'Content-Type: application/json' \
     --data-raw '{"inputs": {
      "notebook": "s3/pytorch_verification.ipynb",
      "cpu_requests": "1",
      "cpu_limit": "1",
      "mem_requests": "4G",
      "mem_limit": "4G",
      "parameters_json": {"a": "b", "c": 555},
      "kernel": "conda-env-eurodatacube20-torch-py",
      "node_purpose": "user"
    }
}'

@BachirNILU
Copy link
Author

Thank you for the update @eox-cs1
I have tested for UC4 and I have had the following error:

<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
* Connection #0 to host headless-fairicubeuc4.hub.eox.at left intact

@eox-cs1
Copy link

eox-cs1 commented Oct 21, 2024

@BachirNILU
Logs showed the following:
OSError: [Errno 107] Transport endpoint is not connected: '/home/jovyan/s3/headless/2024-10-21/20241021-101340-349402-Roof_height_ML'
[2024-10-21T10:14:36Z] {/pygeoapi/pygeoapi/api.py:3339} ERROR - Invalid control character at: line 5 column 44 (char 211)
[2024-10-21T10:14:36Z] {/pygeoapi/pygeoapi/api.py:3854} ERROR - invalid request data

pod was restarted
Test request was successful afterwards.

@BachirNILU
Copy link
Author

@eox-cs1 I am not sure why, but I am getting <title>500 Internal Server Error</title> now!

@eox-cs1
Copy link

eox-cs1 commented Oct 23, 2024

sorry, @BachirNILU there is a problem with the mounting of the s3, it fails sometimes and then the pod needs a restart, which currently is done manually since we haven't quite figured out yet how to check for the failure.
It works again - tested it myself.

@Schpidi
Copy link
Member

Schpidi commented Jan 29, 2025

@BachirNILU I took a look into the issue and saw that there are over 44.000 objects in the folder s3/data/Vienna/Image_Name\=Height/. Judging from the name the whole folder seems to be there by error. Could you please try if deleting this folder helps?

Note that it is not recommended to have a big number of objects in a bucket mounted via s3fs (see also the FAIRiCUBE documentation https://fairicube.readthedocs.io/en/latest/user_guide/storage/#object-storage Though possible, it is not recommended to add this directory permanently (eg. via symbolic linking) to the JupyterLab session, since this could slow down the session's performance considerably if a high number of files is stored on the bucket.). Let me know if you want us to remove the automatic mounting of the shared bucket for your use case or if we should mount it under a hidden directory, i.e., starting with a ..

@Schpidi
Copy link
Member

Schpidi commented Jan 29, 2025

Also I just successfully executed some test jobs, both in UC3 and UC4. On our end this was fixed on October 30 last year. Please let us know if you still experience issues, then we have to investigate again or close this issue. Thanks.

@BachirNILU
Copy link
Author

@Schpidi, thank you for your answer.
I think there was a miscommunication!
The last comment was on 23/10, we have received an email on 24/10 requesting us to stop using the headless execution until the issue is resolved.
I did not test it (because I was asked not to) and did not receive any update until I have contacted you.
Anyways, under UC4, we have created that folder to have a clear overview of the images that we use for training a computer vision model, you can remove it and sorry for the inconvenience. There is also a folder (s3/data/Vienna/modelling_results/masked_orthophoto) containing multiple images, but I need it to train the image regression model using a GPU (it can be removed once the execution is successful). I have tested the headless execution and it is working! Thanks!
However, I am not sure if I can install libraries in the kernel from my side? I ran 'Roof_height_ML.ipynb' and I noticed that the library 'tensorflow' is not installed. I added a block to the notebook to install it, but looks like I do not have permission.
Thank you again for your help.

@Schpidi
Copy link
Member

Schpidi commented Jan 31, 2025

@BachirNILU sorry again for the miscommunication 😞

As we fixed the underlying issue there is no immediate need to remove the files. Just note that with a higher number of objects the system gets simply slower and please let us know if you experience issues again, in which case we might need to increase the resources again.

In the headless mode it is really not advised to install libraries because of side effects. We should always install required libraries in the provided kernel. I installed now tensorflow in the torch kernel for UC3 i.e. eurodatacube17 as explained by Christian above. Please let me know if that works for you and if you need other libraries or the kernels for other UCs updated as well.

@BachirNILU
Copy link
Author

Thank you again for solving the issue. However, a got a fail when running:

#!/bin/bash

curl -X POST -v https://headless-fairicubeuc4.hub.eox.at/processes/execute-notebook/execution \
     -u Used:Psw \
      --header 'Content-Type: application/json' \
      --data-raw '{"inputs": {
       "notebook": "s3/scripts/Roof_height_ML.ipynb",
       "cpu_requests": "1",
       "cpu_limit": "1",
       "mem_requests": "4G",
       "mem_limit": "4G",
       "kernel": "conda-env-eurodatacube17-torch-py",
       "node_purpose": "userg1"
     }}'

Please let me know where I am doing wrong.
It starts running but then it stops.
I am guessing this is an out of memory issue.
Do you know how can I check this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants