Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in tests when using pbspro & not job submission not working with KRB AUTH #635

Closed
sync-by-unito bot opened this issue Dec 27, 2021 · 11 comments
Closed
Labels

Comments

@sync-by-unito
Copy link

sync-by-unito bot commented Dec 27, 2021

Hi,
I am implementing Open OnDemand on our infrastructure and I've encountered 2 problems.

  1. We are using pbs. When I try to run tests as shown on web page using

sudo su $USER -c 'scl enable ondemand -- bin/rake test:jobs:cluster1 RAILS_ENV=production'

I get error regarding "invalid option -o for qmgr". This is nonsense as -o is valid option. However, we found the problem. Somewhere in the tests, you are creating log file with is used as this -o option. The name of log file contains semicolons (something like log_file_10092019_11:23:59.log) which causes the problem - in qmgr semicolons have special meaning so they shouldn't be used in names. This can be repaired locally in tests but if there are files created like that scattered at more places in the code, it would be better if you could patch this.

I couldn't verify if local repair worked as immediately, another problem occurred.

  1. For authentication, we are using Kerberos, not LDAP. Authenticating works but we are unable to get path to cached Kerberos ticket from the environment variables. This prevents users from submitting jobs (qmgr really needs to verify that there exists krb ticket under right user ownership and 0600 permissions). We are using mod_auth_kerb which is Apache Module for Kerberos. We know that the ticket is created as it is shown in /tmp, also using cgi script we can print the ENV value of KRB5CCNAME. In the ood YAML config, we set KrbSaveCredentials to On and we use Ruby regex parser for getting user name . At this point, it would be nice if we had this KRB5 ENV variable. However, there are no ENV variables like that at all present in whole ENV variables ( all ENV vars that can be seen by that script, btw it runs as Apache child process). It might have to do something with all that LUA stuff because it seems as if nearly every ENV variable was somehow set as not visible for everyone just after authenticating (maybe?)

We've came up with some solutions to this problem but I think that they're too complicated and you might propose better solution to this question: how to enable krb auth in Open OnDemand? (precisely, authentication works so just submitting job using kerberos ticket which is created when user authenticates.)

Thanks for a reply!

┆Issue is synchronized with this Asana task by Unito

@sync-by-unito sync-by-unito bot added the bug label Dec 27, 2021
@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ Eric Franz commented:

On the first, https://github.com/OSC/ondemand/blob/26a265a3ebd5c63d376ccd0679602c64d9929a0f/apps/dashboard/lib/tasks/test.rake#L18 ( https://github.com/OSC/ondemand/blob/26a265a3ebd5c63d376ccd0679602c64d9929a0f/apps/dashboard/lib/tasks/test.rake#L18 ) is the offending code:

output_path = WORKDIR.join("output_#{cluster.id}_#{Time.now.iso8601}.log")Are dashes okay?

irb(main):002:0> cluster_id = "owens"
=> "owens"
irb(main):011:0> "output_#{cluster_id}#{Time.now.iso8601}.log"
=> "output_owens_2019-12-10T14:11:02-05:00.log"
irb(main):008:0> "output
#{cluster_id}#{Time.now.iso8601}.log".parameterize
=> "output_owens_2019-12-10t14-10-39-05-00-log"
irb(main):010:0> "output
#{cluster_id}_#{Time.now.iso8601}.log".parameterize.underscore
=> "output_owens_2019_12_10t14_10_48_05_00_log"

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ viktoriaas commented:

Underscores solved the first problem.
For the second problem, we came up with C binary that saves KRB auth ticket but it would be nice if you could provide native support for Kerberos authentication.

Also, are these things possible?

  • list my failed jobs in all jobs? (or just failed?)
  • specify the queue that should be used?

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ Jeff Ohrstrom commented:

To the kerberose support:

You may be able to pass the KRB5CCNAME env variable down to PUN. See these docs on custom env variables ( https://osc.github.io/ood-documentation/master/customization_overview.html?highlight=pun_custom_env ( https://osc.github.io/ood-documentation/master/customization_overview.html?highlight=pun_custom_env ) ).

It may look something like this.

pun_custom_env_declarations:

For queues, we have these docs ( https://osc.github.io/ood-documentation/master/app-development/tutorials-interactive-apps/add-custom-queue.html ( https://osc.github.io/ood-documentation/master/app-development/tutorials-interactive-apps/add-custom-queue.html ) ).

To failed jobs: we're kind of at the mercy of the scheduler, we only report completed jobs without indication of success or failure. I'm guessing we could add that as a feature to inspect the exit code of indicate success or failure, but I believe that'd be a feature we'd have to add to our roadmap.

Hope that helps!

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ viktoriaas commented:

Hi, adding - KRB5CCNAME to nginx YAML did not help, when I cat /proc//environ, the variable is not visible. Actually, that is second problem as the first problem is that the Kerberos credentials are created under user Apache a this user can never create file owned by somebody else. As I have written in the first post,in the ood.yaml we have: `auth:
  • 'AuthType Kerberos'
    ...
  • 'KrbSaveCredentials On'
  • 'Require valid-user'
    `
    From [http://modauthkerb.sourceforge.net/configure.html#saving ( http://modauthkerb.sourceforge.net/configure.html#saving )]
  • If you turn on KrbSaveCredentials, the tickets will be retrieved into a ticket file or credential cache that will be available for the request handler. The ticket file will be removed after request is handled.

So the tickets should be present while the user is logged in and in KRB5CCNAME env varible should be present path to that cached ticket (with right ownership - owner is user and group is user group - and right permissions 0600).
Another problem is that information about the ticket is somehow deleted from the environment, so basically I can not get them.

I tried playing with queues and attributes and somehow I got to Interactive Apps - Desktop. I am not entirely sure I've understood this feature correctly, I have to play with it a little longer. However, I would like to ask if it is possible to add option of selecting the queue when submitting the job in the job composer. Is it possible to add drop-down menu somewhere there? I was thinking about adding new option in the "Job options" menu - Select queue type - and use provided value as qsub -q param. Would it be possible?

Reporting failed jobs would be super useful! I might help you with that but right now I am a bit lost in the code so I do not know which part of the whole code is taking care of reporting states (queued/running/success). You can let me know and I might come up with something.

thx for help

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ Eric Franz commented:

@viktoriaas, the issue with the rake task will be fixed in OSC/ondemand#355, so included in a future 1.7.X release.

For adding ability to specify the queue to use in the Job Composer, this is a common request and is captured here: OSC/ondemand#154. We postponed this work for 2 reasons. First, if we are going to add the ability to customize the web form for Job Composer, the support should be similar to how interactive apps work. However, interactive apps don't yet support specifying multiple clusters, so we wanted to fix that first so there was one appropriate way to scope a set of form elements to a particular selected cluster. Seems like we should prioritize this, however.

You can modify the interactive app plugins to specify queue, and Jeff has provided that link to the documentation.

For failed jobs, do you mean failed jobs in the Job Composer, or failed jobs in the Active Jobs app? One of the issues here is that typically the scheduler does not hold onto information about completed jobs for a long period of time, meaning that until we extend OnDemand with the ability to query another source that knows about completed jobs, we won't know what the exit status, completion time, etc. of a job was. There is the ability to inspect files in the working directory of a job that was submitted through the Job Composer, and that may contain information for flagging a job as "failed". But this is very specific to the type of job and scheduler, so to make this a generic feature of the Job Composer that would work across schedulers would require a little design. If you are interested in that I can open a separate issue to start the discussion.

As for the original request for native Kerberos Authentication support, would you be willing to open a Discourse topic on this, and perhaps copy and paste the relevant discussion points from this issue, in the feature discussion category https://discourse.osc.edu/c/open-ondemand/feature-requests-and-roadmap-discussion/48? Other people in the community have voiced interest in this and may have some insight. Also, from our end we will have to do some investigation that may involve some experiments, so we might not be able to report any progress on this till after the holiday break.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ MorganRodgers commented:

I encountered a similar problem recently with Grid Engine. The version of GE I was using did not like "/" in job names. That's a problem for OnDemand because Batch Connect plugins often use forward slash in their names.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ viktoriaas commented:

ericfranz
thanks for fixing issue with rake task, however I've found another bug. In /var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb there is private method job_name that tries to create job name for interactive desktop. However, again, it uses backslashes and therefore qsub is complaining about invalid -N option.

`ENV["OOD_PORTAL"], # the OOD portal id

ENV["RAILS_RELATIVE_URL_ROOT"].sub(/^/[^\/]+//, ""), # the OOD app token # the Batch Connect app ].reject(&:blank?).join("-").gsub!("/", "-")`Adding gsub! solves the problem, however I think it should be rfixed in whole app.

I have already modified the interactive desktop app with specifying the queue. However I might do some coding in that Job Composer part on my own as it is quite interesting and I am not sure when you can get to this matter (but prioritizing would be nice :) )

By failed jobs I mean both kinds of jobs - I think it could be useful for every user to see whether his job has actually failed. Some users might not notice and they will wait until they see the job somewhere as running or completed but it actually failed. I also might work on this feature, it is quite important to us. For the start, I will try to list also user's failed jobs.

For the Kerberos - sure, I am willing to open discussion topic. Hearing from you after the holiday break is absolutely fine with me, I am also going on a break and now it works for us so it is not that urgent. I will open the discussion topic in next days.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ Eric Franz commented:

Using the slashes in the job name creates these nice job names - if the scheduler handles slashes for the job names:

screen 2020-01-02 at 11 36 03 AM ( https://user-images.githubusercontent.com/512333/71678857-25a08100-2d54-11ea-81bb-65f9f75b5caa.png ( https://user-images.githubusercontent.com/512333/71678857-25a08100-2d54-11ea-81bb-65f9f75b5caa.png ) )

At OSC we have some scripts to generate reports that use the job name (and thus these slashes) so just arbitrarily changing all of the job names would affect us, and may affect other sites as well.

Maybe instead the solution should be to sanitize the job name in the grid engine adapter itself, and do so if a configuration flag is specified in the job adapter to do so. So the default functionality would remain the same, but a site could in the cluster config do this:

v2:
job:
adapter: "sge"
cluster: "my_cluster"
bin: "/usr/lib/gridengine"
conf: "/etc/default/gridengine"
sge_root: "/var/lib/gridengine"
libdrmaa_path: "/var/lib/gridengine/drmaa/libdrmaa.so"
sanitize_job_name: trueand change

args += ['-N', script.job_name] unless script.job_name.nil?
(
args += ['-N', script.job_name] unless script.job_name.nil?
):

args += ['-N', script.job_name] unless script.job_name.nil?to

if script.job_name
if sanitize_job_name
args += ['-N', script.job_name.parameterize.underscore]
else
args += ['-N', script.job_name]
end
end
args += ['-N', script.job_name] unless script.job_name.nil?

  • though for actual implementation it would be appropriate to just modify the script object before calling batch_submit_args so we avoid control coupling in this method
  • parameterize and underscore are not part of ood_core either, so either we import implementations of these to ood_core or active_support would become a requirement of ood_core (and we would want it to be a wide variety of versions 4 and 5) - >=4 and <=6 perhaps? is there a set of methods that exist in 4, 5, and 6 and we limit ourselves to these? writing an automated test for that is difficult)

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ Eric Franz commented:

>By failed jobs I mean both kinds of jobs - I think it could be useful for every user to see whether his job has actually failed. Some users might not notice and they will wait until they see the job somewhere as running or completed but it actually failed. I also might work on this feature, it is quite important to us. For the start, I will try to list also user's failed jobs.

Job Composer currently displays only jobs submitted by the user through the Job Composer. In this case, it already has a reference to the job id of the job submitted, along with the path to the working directory, so a method could be written to either ask a service (an accounting database) about the error state of the job, or the stdout and stderr files could be grep-ed for lines indicating specific errors (walltime exceeded, the string "Error", etc.) and that could be used to update the error state. The Job Composer code actually has a hook where this code could be invoked, though that may change in future versions as we upgrade our codebase. I'm happy to share if you are interested.

Since Active Jobs displays a list of all the running jobs on the system, that is one natural place to add completed jobs (and thus failed jobs) though that app. Though that app doesn't do any proper pagination - it just pulls all the data down and displays it in a searchable table - so thought would need to be done to determine how many completed jobs to display in this table.

We are working on adding a completed jobs app (which might later be integrated into the Active Jobs app) to show only the user's recently completed jobs. This would pull completed job data from XDMoD, with the idea if you have XDMoD and OnDemand installed and configured to work together a user will have access to recently completed jobs. Since the exit status of a job might be available there, we could display that.

You could also take a stab at building a custom Passenger app for your needs. This tutorial might help: https://osc.github.io/ood-documentation/master/app-development/tutorials-passenger-apps.html ( https://osc.github.io/ood-documentation/master/app-development/tutorials-passenger-apps.html )

There is also an example https://github.com/OSC/ood-example-ps-nodejs ( https://github.com/OSC/ood-example-ps-nodejs ) app if you prefer NodeJS. I don't have a Python example published but we could get a Flask app example if you are interested.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ Eric Franz commented:

viktoriaas I opened a separate issue #172 to track fixing the problem with sanitizing the job name of jobs submitted in OnDemand. If the suggested approach is acceptable to you we will go ahead with an implementation to be included in 1.7 and I'll close this issue.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Dec 27, 2021

➤ viktoriaas commented:

ericfranz Hi, sorry for not answering for a long time. I am still a student and I have just finished my exam period and I'm back to work. I have left a comment regarding sanitizing job names in appropriate discussion.

As I am back, I will now open the Kerberos discussion topic on the site you posted a few comments before.

Maybe instead the solution should be to sanitize the job name in the grid engine adapter itself, and do so if a configuration flag is specified in the job adapter to do so.

I think that this is a bit overkill and there is no need to do this in such complicated way.

The Job Composer code actually has a hook where this code could be invoked, though that may change in future versions as we upgrade our codebase. I'm happy to share if you are interested.

I am interested so please, feel free to share.

I will also look at building custom Passenger apps, I will think of some way how to use it for my purposes.

@ghost ghost closed this as completed Dec 30, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

0 participants