Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exec onramp/offramp #20

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Add exec onramp/offramp #20

wants to merge 1 commit into from

Conversation

Licenser
Copy link
Member

No description provided.

@Licenser Licenser marked this pull request as draft April 17, 2020 14:17
does not make sense.

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed in the meeting earlier today: let's define how stdout and stderr output from the exec commands are treated by the onramp. Could be useful to forward them as distinct outputs from onramp, such that users can connect them differently for their pipeline needs.

Also, would be good to set the command executed as part of the event's origin uri. Users can access that metadata from tremor-script and use it as needed: https://docs.tremor.rs/tremor-script/functions/origin/.

@darach darach added active The RFC is active and the implementation is a work in progress enhancement New feature or request labels Apr 28, 2020
@Licenser Licenser changed the base branch from master to main August 10, 2020 09:50
@darach darach changed the title Add exec onramp/offramp RFC Add exec onramp/offramp Jan 29, 2021
@marioortizmanero
Copy link

marioortizmanero commented Mar 30, 2021

I've come up with a few possible ways to run a given command from the config for the offramp implementation.

The current RFC doesn't specify if stuff like pipes and redirections are going to be supported, so I'll try to take them into account and see if it's possible. The solutions are actually more complicated because it depends on the Operating System it's running on, but I'll talk about how it would work on Unix systems:

  • By running the command with sh -c "<command>" and later replacing {} with the event. For example, sh -c "curl -d status=\"{}\" api.twitter.com/status". This means that tremor would have to escape certain sequences like ", `, $ or '. But it's far from trivial because these escape sequences depend on what's wrapping the {}:
    • If you're using "{}", you wouldn't have to escape '.
    • If you're using '{}', you wouldn't have to escape ".
    • If you're using an unquoted {}, you have to escape lots of other characters, like the space itself, |, >...
  • Writing the data into a file or pipe and forcing the user to use that instead of {}. For example you'd need cat /tmp/tremor instead of echo "{}". But this may introduce some overhead and I'm not sure how it could work for concurrent streams.
  • By splitting the string by whitespace and using std::process::Command step by step to run the command. This makes things like pipes and redirections complicated because at this point we're reimplementing the shell. But the good part is that we don't really need to escape anything.

I've been taking a look at how others do this:

  • https://github.com/uutils/findutils doesn't even escape the -exec echo \{\} \;-type commands because it uses std::process::Command when splitting them up.
  • https://github.com/google/rust-shell is a library that does literally what we want to do. It's a very hand-crafted solution with more complexity than I expected, as it spans a few files and implements a custom parser. Unfortunately, it only works for Unix-like systems and it's been abandoned for 3 years. I've also seen some bugs and missing features, like ~ not being expanded to $HOME. Not to mention that pipes and redirects cause a panic. This one seems to run the command split up with std::process::Command

In conclusion, if we want to support stuff like pipes the only viable way to do it is with sh -c. But even that is really complicated to implement, so we should only support simple commands. And in that case the best way is to split up the command by whitespace and use std::process::Command. We should assume that if the user wants more features like environment variables, redirects or pipes, they can just create a script and run that instead. But our command execution logic should be as simple as possible because otherwise it's a mess and not portable at all.

@marioortizmanero
Copy link

I'm bringing this issue up because I would like to try implementing it to warm up for the GSoC. After trying to find out how the command could be ran, another thing to take into account is whether we should use OsStr/OsString or str/String. The incoming data is in Vec<u8> after running the post-processing, so we'd have to parse it into some kind of string std::process::Command can handle.

The more correct way would be using OsStr/OsString -- they exist for a reason -- but that means operations like replace aren't supported and it could be more complicated.

Can we assume all commands input by the user will parse correctly with str/String?

@ernadh
Copy link
Member

ernadh commented Mar 30, 2021

Take a look at this PR for an implementation of an exec onramp that utilizes rust-subprocess crate. The same approach was meant to be used for the offramp. You'll find there some answers to the questions related to redirection and whether data needs to be stringified.

As far as escaping input goes, I believe the Twitter example was just sending empty status for each event rather than embedding an event in the actual command. A useful implementation that carries the similar spirit can be found in Telegraf's continuous exec output and example configuration here. The non-continuous version can be found here, though I believe we didn't mean to differentiate between the two logically but via configuration in Tremor.

@marioortizmanero
Copy link

Take a look at this PR for an implementation of an exec onramp that utilizes rust-subprocess crate. The same approach was meant to be used for the offramp. You'll find there some answers to the questions related to redirection and whether data needs to be stringified.

Huh I like how you think. So you just create a shell script file with the command and execute that with whatever shell is configured. Though I don't see that much of a necessity for rust-subprocess here because the offramp is simpler in the sense that you don't have to poll the command. You just execute it and wait for its termination, which AFAIK can be done with std::subprocess. Anyhow, I could use it for consistency with the onramp.

In that PR the UTF-8 part isn't really discussed so I would still like to get some opinions on that.

As far as escaping input goes, I believe the Twitter example was just sending empty status for each event rather than embedding an event in the actual command.

Oh I completely misunderstood it then. Using stdin sounds like a better idea to pass the event; I thought the proposal meant that {} would be replaced.

@ernadh
Copy link
Member

ernadh commented Mar 30, 2021

Data should really be an octet stream on which codec and processors can be applied. That PR is unfinished but the latest iteration we were working on is not forcing UTF-8 data.

rust-subprocess was used due to different types of execution like continuous or periodical for which std::process was not sufficient. There's a case to be made for an offramp that's expensive to start on every event that one should be able to run continuously but that's likely not rev0 of this offramp. cc: @tremor-rs/tremor-core

@marioortizmanero
Copy link

marioortizmanero commented Mar 31, 2021

And now that I know no replacing/escaping is needed, why create a file instead of just using ["sh/bash/zsh", "-c", command]? The PR seems to have an args field for ExecStream, but I don't understand why you'd need that for an already configurable command (it's always empty in the code), maybe I'm missing another feature?.

Also, I like the idea of a continuous script as well. As you said, I wouldn't implement it in the first attempt but I'll definitely take it into account. And I'll give rust-subprocess a try.

@ernadh
Copy link
Member

ernadh commented Mar 31, 2021

Those are good questions -- needless to say, take my answers as reflection of the type of thinking behind the work in that PR rather than conclusive in terms of the final solution for your work.

A file allows you to support executing scripts that are multi-line, have functions defined, loops, supporting multiple commands in one stream, etc. This allows for having Tremor execute rather complex scripts (say you're hitting the API, passing a complex script and asking for it to be executed once). Nothing prevents you from passing in one command to be executed in such an onramp but designing solely for std::process::Command would require additional orchestration outside of Tremor for execution of a complex script that does not fit into std::process::Command interface.

For example, a common scenario with Telegraf's exec is wrapping multiple commands (think gathering and processing telemetry on a Linux system with eBPF) in a file and registering that as one input. The scenario implies that prior to starting our collection agent (Telegraf) we need to ensure that this particular file is present on the system, perhaps using some configuration management tool like Puppet or by manually placing it in there. We can avoid that type of work by designing Tremor's exec onramp in such a way that it "feels" closer to the underlying OS than what command execution package in Go or Rust give us.

Args indeed aren't exposed as top-level configuration option in that PR because they apply for execution of the shell command, rather than the user-provided script. Args aren't empty -- they will contain the filename and on Windows (not reflexted in that PR) they will have an additional item "/C".

@marioortizmanero
Copy link

marioortizmanero commented Mar 31, 2021

A file allows you to support executing scripts that are multi-line, have functions defined, loops, supporting multiple commands in one stream, etc.

So does sh -c, sorry, I think I explained myself incorrectly. I'm just talking about simplicity, because you don't need to create a file beforehand. But this is probably not that important anyway.

@ernadh
Copy link
Member

ernadh commented Apr 1, 2021

I see what you mean. That totally makes sense for the example you provided.

There is a case, though, where you do want to have the script as an executable file. Take a script like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
active The RFC is active and the implementation is a work in progress enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants