Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making SSH Nodes work both on Linux and Windows based systems #25291

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

dmatej
Copy link
Contributor

@dmatej dmatej commented Dec 17, 2024

TBD: Update description. Please try to review "per commit" - it is a lot of changes, tested on Kubuntu host and Windows Server 2025 in VirtualBox. I know it works, however I could make some mistakes.

When I implemented these tests I had to face quite a lot of issues even on Linux. When I finish, I will write some blog about it, it is another painful episode. However I have to make notes and it will be useful also for users and reviewers. In short:

Assumptions of the test

  • The private key doesn't have any passphrase
  • DAS uses create-node-ssh --install to "clone" GlassFish installation to another machine.
  • Test uses Temurin docker images as a base.

Node side: Environment

  • jar is not on PATH everywhere
    • however unzip might be already there - fix: add some autodetection code.

Node side: SSHD Configuration

  • SFTP has to be explicitly enabled on some systems
  • public key authentication can be disabled. Same applies to other types.
  • User interaction can be forbidden.
  • Port can be blocked by firewall or antivirus, etc.
  • Port can be different.
  • ...

DAS side: Environment

  • When using scripts, no TTY might be available. Scripts can fail or wait for user input forever.
    • Always set timeout when such risk exists
    • For ssh command is useful sshpass command. Test uses it, however GlassFish uses Jsch library instead.

DAS side: SSH Command Configuration

  • Applies to the test and user experiments, not to GlassFish
  • Empty passphrase - be careful, when you set it via execInContainer: "''" means apostrophes, not empty passphrase.
  • User interaction can be forbidden just like on the server (sshd)
  • StrictHostKeyChecking=accept-new should not be used on production; however would be useful if we would not enforce user to run the ssh command on his own and manually accept the server's key. That is TODO.

DAS side: JSCH SSH Implementation

  • "raw" and "resolved" password and passphrase. I doubt the implementation is correct.
  • logging - fixes will be pushed later
    • race condition - fix: use Jsch logger name.
    • impossible to guess which logger level I should set, overengineered antipattern of "set logger for Jsch once for asadmin, then for kernel, then for cluster, then just jsch, ..." - then you set 4 logger levels and it still doesn't print anything. - fix: always use Jsch logger name.

@dmatej dmatej self-assigned this Dec 17, 2024
@dmatej dmatej added bug Something isn't working build and test improvement labels Dec 17, 2024
@dmatej dmatej added this to the 7.0.21 milestone Dec 17, 2024
@dmatej dmatej changed the title Making SSH Nodes working reliably both on Linux and Windows based systems Making SSH Nodes work both on Linux and Windows based systems Dec 17, 2024
@arjantijms arjantijms requested a review from pzygielo December 24, 2024 13:01
@arjantijms arjantijms modified the milestones: 7.0.21, 7.0.22 Jan 3, 2025
@dmatej
Copy link
Contributor Author

dmatej commented Jan 3, 2025

Local status: Large refactoring targeting File vs Path vs String vs SFTP vs Unix vs Windows. Local manual testing of DAS on Kubuntu and Node on Windows Server 2025 - was able to "replicate" glassfish installation, unpack, create node and instance. Instance did not start yet as I did not install Java to Windows yet :-)

TODO

  • Proper session management (AutoCloseable, separate session and SSHLauncher)
  • Data object with some autodetection of Node's operating system and features
  • SFTP rmdir does not work on windows
  • Documentation, example, ...

Jenkinsfile Outdated Show resolved Hide resolved
@dmatej dmatej force-pushed the windows-ssh branch 2 times, most recently from d30a997 to e88181e Compare January 17, 2025 20:07
@dmatej
Copy link
Contributor Author

dmatej commented Jan 17, 2025

TODO:

  • Broken encoding in SSH stdout (hardcoded UTF8 vs Windows-1250)
  • Missing autodeploy/bundles (I am often removing those empty directories with .empty file inside, this time it seems Felix cannot autocreate it)
  • restart-domain now does not replace WALL_CLOCK_START system property, then GF computes incorrect startup time including CLI (now rather "time since CLI started this process"; on restart-domain the server restarts itself after it was asked to do so by restart-domain).
  • SSH adapter should throw the new SSHException, which would always collect exec output. We are losing remote logs sometimes, details about errors. I have fixed it here and there where I needed it, but this should be systematic.

dmatej added 12 commits January 19, 2025 01:30
- prefer unzip, if not detected, use jar

Signed-off-by: David Matějček <[email protected]>
- Because Eclipse Jenkins CI does not support TestContainers while GitHub
  Actions do.
- Synced Ubuntu and Windows workflow

Signed-off-by: David Matějček <[email protected]>

# Conflicts:
#	.github/workflows/build-windows.yml
- The server-config used "nointeractive", while default used
  "--noshutdown -c noop=true" options

Signed-off-by: David Matějček <[email protected]>
- SFTP uses "linux-like" paths, while Windows uses different separators and
  all systems may use also different root for SFTP.

Signed-off-by: David Matějček <[email protected]>
- Added PATH and JAVA_HOME to user environment
- Disabled AS_TRACE
- UsePAM set to yes to enable /etc/environment
- INFO ssh server log level

Signed-off-by: David Matějček <[email protected]>
- Windows is killing everything started in the session
- The only workaround found after 2 weeks of experiments, googling, consultation
  with GitHub CoPilot is this - to create two scripts executed by the Windows
  Scheduler just once.
- Added also nohup for Linux nodes.

Signed-off-by: David Matějček <[email protected]>
- DAS host can have multiple network endpoints but not all accessible from
  the node host. Once we are able to establish the connection, we can also
  use the socket's local address as the host.
- Note: This can be much more complicated, this is rather a quick fix or
  simple workaround. Proper solution is to have the DAS host used by the node
  configurable.

Signed-off-by: David Matějček <[email protected]>
… system

- SSHLauncher is stateless and NOT injectable; collects information about
  the target system.
- new SSHSession - wraps the jsch session
- SFTPClient - responsible for file system changes
- chmod is not supported on windows
- all stateful objects are closeable -> no need to keep them in SSHUtil

Signed-off-by: David Matějček <[email protected]>
- If the host name is not good enough, we will prefer chosen IPv4 address.
- Would be better to let it on the user which hostname or IP should represent
  the node and DAS, however that goes over this pull request.

Signed-off-by: David Matějček <[email protected]>
- The gfstart.bat must be there from the last start, so we don't need
  to generate the command again.
- Permissions to use the scheduler depend on the windows session type.
- Killing of started programs also depends on the windows session type
- Closing streams doesn't resolve issues, it seems it causes some.

- I have doubts about all solutions related to Windows OS, so I tried to
  minimize it as much as possible. Also I tried to learn as much as possible.

Signed-off-by: David Matějček <[email protected]>
- Tested via SSH and locally
- Tested on domains and instances
- Very probably can be simplified later, especially problematic is hard coded
  timeout in the generated ps1 script.

Signed-off-by: David Matějček <[email protected]>
- The bat script reads from a temp file given

Signed-off-by: David Matějček <[email protected]>
- We don't need to wait for the end of the server in startup scripts.

Signed-off-by: David Matějček <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build and test improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants