Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start: allow users to call job start command to start stopped jobs #24150

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

martisah
Copy link
Contributor

@martisah martisah commented Oct 8, 2024

This PR implements the job start command, which allows for users to call "job start [options] " with a valid job that has been previously stopped and starts the most recent running and stable version up.

Fixes: #23852

@martisah martisah self-assigned this Oct 8, 2024
command/job_start.go Outdated Show resolved Hide resolved
@martisah martisah marked this pull request as ready for review October 24, 2024 16:17
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @martisah. I've left some comments on implementation details but the overall design is solid.

  • Don't forget to run make cl to add a changelog item.
  • Given how tricky the reverse iteration and selection of versions is, it'd probably be a good idea to have a test that covers the more complex selection scenarios like having job that's stopped and started multiple times.
  • It looks like GitHub was "helpful" and hid some of the comments, be sure to expand them as you work thru the review.

command/commands.go Outdated Show resolved Hide resolved
command/job_start_test.go Outdated Show resolved Hide resolved
command/job_start.go Outdated Show resolved Hide resolved
command/job_start.go Outdated Show resolved Hide resolved
command/job_start.go Outdated Show resolved Hide resolved
command/job_start.go Outdated Show resolved Hide resolved
command/job_start.go Outdated Show resolved Hide resolved
command/job_start.go Outdated Show resolved Hide resolved
command/job_start.go Outdated Show resolved Hide resolved
command/job_start.go Outdated Show resolved Hide resolved
var chosenVersion uint64
versionAvailable := false
for i := range versions {
if !*versions[i].Stop {
Copy link
Contributor Author

@martisah martisah Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of some bug fixing associated to selecting the correct version, I realized that we should actually be selecting the previously non-stopped version. Also I re-evaluated whether we should be specifically looking for a running version, as often the previously non stopped version tends to be pending (in testing). Is it the expected behavior to always revert to only running versions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as often the previously non stopped version tends to be pending (in testing). Is it the expected behavior to always revert to only running versions?

Oh that's a good call. Yeah we shouldn't revert to pending versions because we don't know that they're stable. That'll complicate testing a bit because you'll need to make sure the job can mark itself running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know of a method I can use to ensure the jobs are running in my tests? I've been trying to wait for the evaluation to succeed after running each command, but that still hasn't proved to work in terms of getting them to be marked running in tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job should be marked running as soon as at lease one allocation is placed. So as long as the allocations can start you should be able to poll api/Jobs.Info for the running status. But if the allocations exit then the job will be dead, so you need to make sure they stay running or they could exit before you can poll and you'll get weird test flakes.

The testJob uses the mock driver, so you may want to look at the configuration of the test job to make sure it'll still be up and running.

Comment on lines +27 to +31
Start an existing stopped job. This command is used to start a previously stopped job's
most recent version. Upon successful start, an interactive
monitor session will start to display log lines as the job starts its
allocations based on its most recent version. It is safe to exit the monitor
early using ctrl+c.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: the output of these commands are what get printed literally to the terminal. There's no automatic reflowing in mitchellh/cli. So we should try to manually reflow these to 80 cols.


func (c *JobStartCommand) Help() string {
helpText := `
Usage: nomad job start [options] <job>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't have any clue here to the user that they can use more than one job ID, or that if they do they all need to be in the same namespace.

// Find the most recent version for this job that has not been stopped
var chosenVersion uint64
versionAvailable := false
for i := range versions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do for _, version := range versions here and then not have to dereference by index.

var chosenVersion uint64
versionAvailable := false
for i := range versions {
if !*versions[i].Stop {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as often the previously non stopped version tends to be pending (in testing). Is it the expected behavior to always revert to only running versions?

Oh that's a good call. Yeah we shouldn't revert to pending versions because we don't know that they're stable. That'll complicate testing a bit because you'll need to make sure the job can mark itself running.

}

}
c.versionSelected = chosenVersion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially have multiple jobs we're going to revert, so we'd be overwriting this field from multiple goroutines, which is a race condition. The production code doesn't ever read this value, only tests, so we probably want to get rid of it if we can.

I guess there's no way in the API to tell after the fact which version we reverted?

One way to solve this would be to refactor the body of the goroutine into a method that returns the chosen version, and then have the tests call that method directly. But for this it'd be nice to have the tests exercise the goroutines. Maybe this could be a buffered channel that the goroutine writes into and then the test can pull out the value? That'd be safe from data races.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh I see, I'll give that a shot!

Comment on lines +184 to +190
if consulToken == "" {
consulToken = os.Getenv("CONSUL_HTTP_TOKEN")
}

if vaultToken == "" {
vaultToken = os.Getenv("VAULT_TOKEN")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: this is not going to change between jobs, so we can lift these two assignments above the loop.

must.Sprintf("job start stdout: %s", ui.OutputWriter.String()),
must.Sprintf("job start stderr: %s", ui.ErrorWriter.String()),
)
must.Eq(t, expectedVersions[i], startCmd.versionSelected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you've still got a test failure here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I keep passing this test when I run it locally, but it seems to keep failing when I push, could this be because of race conditions I haven't accounted for (similarly to what you mentioned above)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nomad job start feature would be really useful
2 participants