Auto-download files from the staging directory to output #500

andrii-i · 2024-03-14T21:27:09Z

Adds auto-download functionality.

For packaging input folder, this PR was superseded by #510.

bhadrip · 2024-03-20T21:09:50Z

Can we support the following use case:

jupyter-root-dir
  dep-dir
     init.py
  project1-dir
     analysis.ipynb (depends on dep-dir/init.py)
  project2-dir
     report-generation.ipynb (depends on dep-dir/init.py)

2 separate notebook jobs could depend on the same python file:
Job1 - analysis.ipynb
Job2 - report-generation.ipynb

With the current approach the customer has to copy the dep-dir to every project folder.

JasonWeill · 2024-04-22T17:31:01Z

Upon first view, the term "Package input folder" doesn't mean anything to me. I don't know why, as a user, I would want to do that, or whether I would need to. This change also doesn't modify the documentation at all.

I would recommend:

Modifying the documentation to describe what "package input folder" does
Adding some kind of "info" ℹ️ button or expander near the new checkbox to explain what the option does, briefly

src/components/files-directory-link.tsx

src/mainviews/detail-view/job-definition.tsx

src/mainviews/detail-view/job-detail.tsx

andrii-i · 2024-04-23T08:01:43Z

Kicking CI

dlqqq

Thanks Andrii! I left a few comments regarding the auto-download capability being introduced in this branch, but didn't complete a full code review.

My main concern with this PR is that 1) the new default auto-download capability being proposed here is a potential breaking change, and 2) the changes to the download logic are not required to implement the 'include parent dir' capability, which was the original intent of this PR.

Let's explore the possibility opening another PR with just the 'include parent dir' capability such that it can be reviewed, merged, and released separately.

dlqqq · 2024-04-23T20:33:14Z

jupyter_scheduler/download_manager.py

+            session.commit()
+
+
+class DownloadManager:


Can we merge this class with DownloadRecordManager above? I don't see the benefit of splitting the logic here into two separate classes if they are only used together anyways.

dlqqq · 2024-04-23T21:06:27Z

jupyter_scheduler/extension.py

+        # Forces new processes to not be forked on Linux.
+        # This is necessary because `asyncio.get_event_loop()` is bugged in
+        # forked processes in Python versions below 3.12. This method is
+        # called by `jupyter_core` by `nbconvert` in the default executor.
+
+        # See: https://github.com/python/cpython/issues/66285
+        # See also: https://github.com/jupyter/jupyter_core/pull/362
+        multiprocessing.set_start_method("spawn", force=True)
+


The problem with this is that this line affects the multiprocessing behavior globally for everything running on this main thread, i.e. the server and all server extensions running on a JupyterLab instance. We should really avoid doing this just to pass our GitHub workflows. Consider what happens if:

Another extension is also calling multiprocessing.set_start_method(..., force=True), or

Some part of the server / server extension breaks when the start method is changed during its lifetime by our extension's initialize_settings() method.

I don't have a solution for how this bug can be fixed. However, the error message is pretty specific about why an exception is being raised, so my intuition is that this bug can be fixed. I'm leaving some references here for us to review in the future.

My write-up on the first encounter of this bug: Fix "event loop is already running" bug on Linux #450

The exception being raised in the E2E test workflow: https://github.com/jupyter-server/jupyter-scheduler/actions/runs/8730209153/job/23953600827#step:9:161

dlqqq · 2024-04-23T21:14:36Z

jupyter_scheduler/download_runner.py

+    async def process_download_queue(self):
+        while not self.download_manager.queue.empty():
+            download = self.download_manager.queue.get()
+            download_record = self.download_manager.record_manager.get(download.download_id)
+            if not download_record:
+                continue
+            await self.job_files_manager.copy_from_staging(download.job_id, download.redownload)
+            self.download_manager.delete_download(download.download_id)


I think we can avoid using multiprocessing.Queue if we are already writing pending downloads to a DB. Can we read directly from self.download_manager.record_manager instead?

I believe that this may fix the process bug previously raised on the E2E tests in this branch. This is the corresponding error message:

RuntimeError: A SemLock created in a fork context is being shared with a process in a spawn context. This is not supported. Please use the same context to create multiprocessing objects and Process.

If we remove the need for multiprocessing objects, we may be able to fix this bug without relying on multiprocessing.set_start_method(). Can you give this a try?

sravyasdh · 2024-04-23T22:29:16Z

Can we support the following use case:
jupyter-root-dir
  dep-dir
     init.py
  project1-dir
     analysis.ipynb (depends on dep-dir/init.py)
  project2-dir
     report-generation.ipynb (depends on dep-dir/init.py)
2 separate notebook jobs could depend on the same python file: Job1 - analysis.ipynb Job2 - report-generation.ipynb

With the current approach the customer has to copy the dep-dir to every project folder.

@andrii-i can you confirm if this usecase can be supported? Can we support changing the input folder path on the UI? Rest of this lgtm

andrii-i · 2024-04-23T22:51:32Z

@sravyasdh thank you for looking into this PR. In the initial version only packaging the input file's folder via checkbox control will be supported so use case in question will not be supported. Reason is that it's not clear if users would prefer more complex UI/UX that would allow packaging of the folder one level above or packaging of the arbitrary folder or ability to input folder path on the UI. So it makes sense to start with simpler implementation, let users try it and get feedback from them, and update the implementation based on feedback if needed.

andrii-i · 2024-04-25T04:37:42Z

For package input folder, superseded by #510.
For auto-download, either this PR will be reopened or a new PR would be created based on relevant commits.

…nd db records

…r and ExecutionManager

…rity

…nsion,

for more information, see https://pre-commit.ci

andrii-i added the enhancement New feature or request label Mar 14, 2024

andrii-i force-pushed the package-input-folder branch from c75a717 to eb8e18f Compare March 14, 2024 21:34

andrii-i force-pushed the package-input-folder branch from a28c881 to 383a37e Compare March 27, 2024 21:16

andrii-i force-pushed the package-input-folder branch 4 times, most recently from cfc4886 to 7bbed76 Compare April 10, 2024 18:03

andrii-i force-pushed the package-input-folder branch 4 times, most recently from bfb68dd to 92b9088 Compare April 19, 2024 00:11

JasonWeill reviewed Apr 22, 2024

View reviewed changes

src/components/files-directory-link.tsx Outdated Show resolved Hide resolved

JasonWeill reviewed Apr 22, 2024

View reviewed changes

src/mainviews/detail-view/job-definition.tsx Outdated Show resolved Hide resolved

src/mainviews/detail-view/job-detail.tsx Outdated Show resolved Hide resolved

andrii-i force-pushed the package-input-folder branch from e868446 to ee0fdb8 Compare April 23, 2024 05:27

andrii-i closed this Apr 23, 2024

andrii-i reopened this Apr 23, 2024

andrii-i closed this Apr 23, 2024

andrii-i reopened this Apr 23, 2024

dlqqq requested changes Apr 23, 2024

View reviewed changes

andrii-i mentioned this pull request Apr 24, 2024

Input files packaging #510

Merged

andrii-i marked this pull request as draft April 24, 2024 19:01

andrii-i changed the title ~~Add an option to package input folder~~ Add an option to package input folder, auto-download files from the staging directory to output Apr 24, 2024

andrii-i changed the title ~~Add an option to package input folder, auto-download files from the staging directory to output~~ Add an option to package input folder; auto-download files from the staging directory to output Apr 24, 2024

andrii-i closed this Apr 25, 2024

andrii-i reopened this May 8, 2024

andrii-i added 14 commits May 8, 2024 17:49

use download queue

25e17a5

rename Download tables, use mp.quque directly without a wrapper

3fc2cc0

Remove DownloadTask data class, use DescribeDownload for both queue a…

7abdf27

…nd db records

propagate redownload, use download_id in get download

56f095d

def initiate_download_standalonefunction and use it in DownloadManage…

ba44f7c

…r and ExecutionManager

add traitlets-configurable downloads_poll_interval

381dcae

refactor output files creation logic into a separate function for cla…

a723337

…rity

remove download records associated with job on job delete

92a4574

clarify comments

bc4bc88

fix existing pytests

551561e

Force Process spawn, not fork in in JobFilesManager

1cb9c84

force multiprocessing start method to be "spawn" at the start of exte…

4c2b558

…nsion,

test that Process is called only once

a6b75ed

fix pytest tests

9d41103

andrii-i force-pushed the package-input-folder branch from 166ed05 to 9d41103 Compare May 9, 2024 22:04

[pre-commit.ci] auto fixes from pre-commit.com hooks

88e7b0b

for more information, see https://pre-commit.ci

andrii-i changed the title ~~Add an option to package input folder; auto-download files from the staging directory to output~~ Auto-download files from the staging directory to output May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-download files from the staging directory to output #500

Auto-download files from the staging directory to output #500

andrii-i commented Mar 14, 2024 •

edited

Loading

bhadrip commented Mar 20, 2024

JasonWeill commented Apr 22, 2024

andrii-i commented Apr 23, 2024

dlqqq left a comment •

edited

Loading

dlqqq Apr 23, 2024

dlqqq Apr 23, 2024

dlqqq Apr 23, 2024

sravyasdh commented Apr 23, 2024

andrii-i commented Apr 23, 2024

andrii-i commented Apr 25, 2024

Auto-download files from the staging directory to output #500

Are you sure you want to change the base?

Auto-download files from the staging directory to output #500

Conversation

andrii-i commented Mar 14, 2024 • edited Loading

bhadrip commented Mar 20, 2024

JasonWeill commented Apr 22, 2024

andrii-i commented Apr 23, 2024

dlqqq left a comment • edited Loading

Choose a reason for hiding this comment

dlqqq Apr 23, 2024

Choose a reason for hiding this comment

dlqqq Apr 23, 2024

Choose a reason for hiding this comment

dlqqq Apr 23, 2024

Choose a reason for hiding this comment

sravyasdh commented Apr 23, 2024

andrii-i commented Apr 23, 2024

andrii-i commented Apr 25, 2024

andrii-i commented Mar 14, 2024 •

edited

Loading

dlqqq left a comment •

edited

Loading