Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 sr_audit starts instance 0 when instances >= 100 #1183

Open
petersilva opened this issue Aug 23, 2024 · 6 comments
Open

v2 sr_audit starts instance 0 when instances >= 100 #1183

petersilva opened this issue Aug 23, 2024 · 6 comments
Labels
bug Something isn't working v2only only affects v2 branches. wontfix This will not be worked on (outside development scope) work-around a work-around is provided, mitigating the issue.

Comments

@petersilva
Copy link
Contributor

petersilva commented Aug 23, 2024

in v2, when running foreground, it will create a .pid file for instance 0 as well as for the other instances (>=1.) sr_audit looks for all the .pid files, and since the foreground instance is "missing" will re-start it. instance 0 should never be restarted.

work around:

  • rm the pid file for instance 0.
  • kill the instance zero process.
  • when stopping foreground processes... check for left-over pid file, and remove to prevent recurrence.
@petersilva petersilva added bug Something isn't working wontfix This will not be worked on (outside development scope) v2only only affects v2 branches. labels Aug 23, 2024
@petersilva petersilva added the work-around a work-around is provided, mitigating the issue. label Sep 17, 2024
@petersilva
Copy link
Contributor Author

On further investigation it is determined that instance 0 is being started because:

  • instance numbers are two digits, and they over flow when
  • config file has instances 100 in it.

workaround: reduce instances to lower number. e.g. instances 75

@petersilva
Copy link
Contributor Author

tried to reproduce this with sr3, and it works fine, issue not present at all
in sr3:

  • three digit instance files are created.
  • sr3 status does not report any missing or strays.
  • sr3 sanity does not destroy or start any instances.

tested at 120 and it seems fine.

@petersilva
Copy link
Contributor Author

OK did find some weird behaviour after I kill instance 100 in sr3, not the same as v2. made a patch to correct it.

@petersilva petersilva changed the title v2 sr_audit improperly starts instance 0. sr3 sanity gets confused by instance numbers over 100 Sep 17, 2024
@petersilva petersilva changed the title sr3 sanity gets confused by instance numbers over 100 v2 sr_audit starts instance 0 when instances >= 100 Sep 17, 2024
@reidsunderland
Copy link
Member

reidsunderland commented Sep 24, 2024

The fix in #1226 caused a problem :(

 Traceback (most recent call last):
   File "/local/home/sarra/.local/bin/sr3", line 11, in <module>
   load_entry_point('metpx-sr3', 'console_scripts', 'sr3')()
 File "/local/home/sarra/sr3/sarracenia/sr.py", line 3074, in main
    gs = sr_GlobalState(cfg, cfg.configurations)
  File "/local/home/sarra/sr3/sarracenia/sr.py", line 1323, in __init__
     self._read_states()
   File "/local/home/sarra/sr3/sarracenia/sr.py", line 556, in _read_states
     self._read_state_dir()
   File "/local/home/sarra/sr3/sarracenia/sr.py", line 493, in _read_state_dir
    i = int(pathname[0:-4].split('_')[-1])

Edit: I realized this issue is specific to v2, I'm talking about sr3 here.

@reidsunderland
Copy link
Member

reidsunderland commented Sep 24, 2024

It's getting stuck on a cpost:

/local/home/sarra/.cache/sr3/cpost/config_name/i01.pid

Is it normal for cpost pid files to be named iXX.pid?

Edit: I realized this issue is specific to v2, I'm talking about sr3 here.

@petersilva
Copy link
Contributor Author

uh... yeah... but I fixed in sarrac also: MetPX/sarrac#161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2only only affects v2 branches. wontfix This will not be worked on (outside development scope) work-around a work-around is provided, mitigating the issue.
Projects
None yet
Development

No branches or pull requests

2 participants