-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in tests when using pbspro & not job submission not working with KRB AUTH #635
Comments
➤ Eric Franz commented: On the first, https://github.com/OSC/ondemand/blob/26a265a3ebd5c63d376ccd0679602c64d9929a0f/apps/dashboard/lib/tasks/test.rake#L18 ( https://github.com/OSC/ondemand/blob/26a265a3ebd5c63d376ccd0679602c64d9929a0f/apps/dashboard/lib/tasks/test.rake#L18 ) is the offending code:output_path = WORKDIR.join("output_#{cluster.id}_#{Time.now.iso8601}.log")Are dashes okay? irb(main):002:0> cluster_id = "owens" |
➤ viktoriaas commented: Underscores solved the first problem. Also, are these things possible?
|
➤ Jeff Ohrstrom commented: To the kerberose support:You may be able to pass the KRB5CCNAME env variable down to PUN. See these docs on custom env variables ( https://osc.github.io/ood-documentation/master/customization_overview.html?highlight=pun_custom_env ( https://osc.github.io/ood-documentation/master/customization_overview.html?highlight=pun_custom_env ) ). It may look something like this. pun_custom_env_declarations:
For queues, we have these docs ( https://osc.github.io/ood-documentation/master/app-development/tutorials-interactive-apps/add-custom-queue.html ( https://osc.github.io/ood-documentation/master/app-development/tutorials-interactive-apps/add-custom-queue.html ) ). To failed jobs: we're kind of at the mercy of the scheduler, we only report completed jobs without indication of success or failure. I'm guessing we could add that as a feature to inspect the exit code of indicate success or failure, but I believe that'd be a feature we'd have to add to our roadmap. Hope that helps! |
➤ viktoriaas commented: Hi, adding - KRB5CCNAME to nginx YAML did not help, when I cat /proc//environ, the variable is not visible. Actually, that is second problem as the first problem is that the Kerberos credentials are created under user Apache a this user can never create file owned by somebody else. As I have written in the first post,in the ood.yaml we have: `auth:
So the tickets should be present while the user is logged in and in KRB5CCNAME env varible should be present path to that cached ticket (with right ownership - owner is user and group is user group - and right permissions 0600). I tried playing with queues and attributes and somehow I got to Interactive Apps - Desktop. I am not entirely sure I've understood this feature correctly, I have to play with it a little longer. However, I would like to ask if it is possible to add option of selecting the queue when submitting the job in the job composer. Is it possible to add drop-down menu somewhere there? I was thinking about adding new option in the "Job options" menu - Select queue type - and use provided value as qsub -q param. Would it be possible? Reporting failed jobs would be super useful! I might help you with that but right now I am a bit lost in the code so I do not know which part of the whole code is taking care of reporting states (queued/running/success). You can let me know and I might come up with something. thx for help |
➤ Eric Franz commented: @viktoriaas, the issue with the rake task will be fixed in OSC/ondemand#355, so included in a future 1.7.X release. For adding ability to specify the queue to use in the Job Composer, this is a common request and is captured here: OSC/ondemand#154. We postponed this work for 2 reasons. First, if we are going to add the ability to customize the web form for Job Composer, the support should be similar to how interactive apps work. However, interactive apps don't yet support specifying multiple clusters, so we wanted to fix that first so there was one appropriate way to scope a set of form elements to a particular selected cluster. Seems like we should prioritize this, however. You can modify the interactive app plugins to specify queue, and Jeff has provided that link to the documentation. For failed jobs, do you mean failed jobs in the Job Composer, or failed jobs in the Active Jobs app? One of the issues here is that typically the scheduler does not hold onto information about completed jobs for a long period of time, meaning that until we extend OnDemand with the ability to query another source that knows about completed jobs, we won't know what the exit status, completion time, etc. of a job was. There is the ability to inspect files in the working directory of a job that was submitted through the Job Composer, and that may contain information for flagging a job as "failed". But this is very specific to the type of job and scheduler, so to make this a generic feature of the Job Composer that would work across schedulers would require a little design. If you are interested in that I can open a separate issue to start the discussion. As for the original request for native Kerberos Authentication support, would you be willing to open a Discourse topic on this, and perhaps copy and paste the relevant discussion points from this issue, in the feature discussion category https://discourse.osc.edu/c/open-ondemand/feature-requests-and-roadmap-discussion/48? Other people in the community have voiced interest in this and may have some insight. Also, from our end we will have to do some investigation that may involve some experiments, so we might not be able to report any progress on this till after the holiday break. |
➤ MorganRodgers commented: I encountered a similar problem recently with Grid Engine. The version of GE I was using did not like "/" in job names. That's a problem for OnDemand because Batch Connect plugins often use forward slash in their names. |
➤ viktoriaas commented: ericfranz `ENV["OOD_PORTAL"], # the OOD portal id ENV["RAILS_RELATIVE_URL_ROOT"].sub(/^/[^\/]+//, ""), # the OOD app token # the Batch Connect app ].reject(&:blank?).join("-").gsub!("/", "-")`Adding gsub! solves the problem, however I think it should be rfixed in whole app. I have already modified the interactive desktop app with specifying the queue. However I might do some coding in that Job Composer part on my own as it is quite interesting and I am not sure when you can get to this matter (but prioritizing would be nice :) ) By failed jobs I mean both kinds of jobs - I think it could be useful for every user to see whether his job has actually failed. Some users might not notice and they will wait until they see the job somewhere as running or completed but it actually failed. I also might work on this feature, it is quite important to us. For the start, I will try to list also user's failed jobs. For the Kerberos - sure, I am willing to open discussion topic. Hearing from you after the holiday break is absolutely fine with me, I am also going on a break and now it works for us so it is not that urgent. I will open the discussion topic in next days. |
➤ Eric Franz commented: Using the slashes in the job name creates these nice job names - if the scheduler handles slashes for the job names:screen 2020-01-02 at 11 36 03 AM ( https://user-images.githubusercontent.com/512333/71678857-25a08100-2d54-11ea-81bb-65f9f75b5caa.png ( https://user-images.githubusercontent.com/512333/71678857-25a08100-2d54-11ea-81bb-65f9f75b5caa.png ) ) At OSC we have some scripts to generate reports that use the job name (and thus these slashes) so just arbitrarily changing all of the job names would affect us, and may affect other sites as well. Maybe instead the solution should be to sanitize the job name in the grid engine adapter itself, and do so if a configuration flag is specified in the job adapter to do so. So the default functionality would remain the same, but a site could in the cluster config do this: v2:
args += ['-N', script.job_name] unless script.job_name.nil?to if script.job_name
|
➤ Eric Franz commented: >By failed jobs I mean both kinds of jobs - I think it could be useful for every user to see whether his job has actually failed. Some users might not notice and they will wait until they see the job somewhere as running or completed but it actually failed. I also might work on this feature, it is quite important to us. For the start, I will try to list also user's failed jobs.Job Composer currently displays only jobs submitted by the user through the Job Composer. In this case, it already has a reference to the job id of the job submitted, along with the path to the working directory, so a method could be written to either ask a service (an accounting database) about the error state of the job, or the stdout and stderr files could be grep-ed for lines indicating specific errors (walltime exceeded, the string "Error", etc.) and that could be used to update the error state. The Job Composer code actually has a hook where this code could be invoked, though that may change in future versions as we upgrade our codebase. I'm happy to share if you are interested. Since Active Jobs displays a list of all the running jobs on the system, that is one natural place to add completed jobs (and thus failed jobs) though that app. Though that app doesn't do any proper pagination - it just pulls all the data down and displays it in a searchable table - so thought would need to be done to determine how many completed jobs to display in this table. We are working on adding a completed jobs app (which might later be integrated into the Active Jobs app) to show only the user's recently completed jobs. This would pull completed job data from XDMoD, with the idea if you have XDMoD and OnDemand installed and configured to work together a user will have access to recently completed jobs. Since the exit status of a job might be available there, we could display that. You could also take a stab at building a custom Passenger app for your needs. This tutorial might help: https://osc.github.io/ood-documentation/master/app-development/tutorials-passenger-apps.html ( https://osc.github.io/ood-documentation/master/app-development/tutorials-passenger-apps.html ) There is also an example https://github.com/OSC/ood-example-ps-nodejs ( https://github.com/OSC/ood-example-ps-nodejs ) app if you prefer NodeJS. I don't have a Python example published but we could get a Flask app example if you are interested. |
➤ Eric Franz commented: viktoriaas I opened a separate issue #172 to track fixing the problem with sanitizing the job name of jobs submitted in OnDemand. If the suggested approach is acceptable to you we will go ahead with an implementation to be included in 1.7 and I'll close this issue. |
➤ viktoriaas commented: ericfranz Hi, sorry for not answering for a long time. I am still a student and I have just finished my exam period and I'm back to work. I have left a comment regarding sanitizing job names in appropriate discussion.As I am back, I will now open the Kerberos discussion topic on the site you posted a few comments before.
I think that this is a bit overkill and there is no need to do this in such complicated way.
I am interested so please, feel free to share. I will also look at building custom Passenger apps, I will think of some way how to use it for my purposes. |
Hi,
I am implementing Open OnDemand on our infrastructure and I've encountered 2 problems.
sudo su $USER -c 'scl enable ondemand -- bin/rake test:jobs:cluster1 RAILS_ENV=production'
I get error regarding "invalid option -o for qmgr". This is nonsense as -o is valid option. However, we found the problem. Somewhere in the tests, you are creating log file with is used as this -o option. The name of log file contains semicolons (something like log_file_10092019_11:23:59.log) which causes the problem - in qmgr semicolons have special meaning so they shouldn't be used in names. This can be repaired locally in tests but if there are files created like that scattered at more places in the code, it would be better if you could patch this.
I couldn't verify if local repair worked as immediately, another problem occurred.
We've came up with some solutions to this problem but I think that they're too complicated and you might propose better solution to this question: how to enable krb auth in Open OnDemand? (precisely, authentication works so just submitting job using kerberos ticket which is created when user authenticates.)
Thanks for a reply!
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: