Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filtering variable sets #52

Open
brainchild0 opened this issue Feb 1, 2022 · 10 comments
Open

filtering variable sets #52

brainchild0 opened this issue Feb 1, 2022 · 10 comments

Comments

@brainchild0
Copy link
Contributor

One of the current limitations is that templates are always evaluated against the full variable set of shell variables. Because these variables are set from a variety of sources, unpredictable and even unsafe results may emerge from invoking templates on them. A secondary concern is the difference between variable names used in some specific template versus those chosen for use the surrounding shell logic.

It would be advantageous to support an invocation case in which the template variables are filtered representation of the shell variables. Filtering may take in general two useful forms. The first is selection of which variables to include in template evaluation, with other variables excluded. The second is a name mapping, that is, support for an optional alternative name for any variable selected for inclusion during template processing.

An interface supporting such functionality might be an added parameter allowing specification of a white space-separated list of variables for inclusion, each one optionally followed by a colon and then an alternative name.

@fidian
Copy link
Contributor

fidian commented Feb 1, 2022

This would certainly slow down execution, though I can clearly see the benefit. Because there's no real scoping, we'd have to run all user-defined functions and evaluate all variables in a subshell and deal with cleaning/preserving some environment variables. Besides the penalty from the subshell, we'd also have the filtering loop-within-a-loop causing more slowdowns. That's really a huge concern of mine, especially since I know there are places that generate hundreds or thousands of output files from templates and data, so doubling the processing time for those users would be bad.

This sounds like a change with a high impact. I'm worried that you are taking data, functions, and templates from untrusted sources and running them. mo isn't really a secure product and adding security measures like this are only partially addressing the holes that exist - that might lead to a false sense of security for new users. This doesn't mean I should not do it, but security vulnerabilities will still remain and it's important that this feature isn't touted as making anything secure.

So, I suppose the task list would look something like this:

  1. Create a function to start a subshell in a clean environment
  2. Pattern match existing variables (well, ones that used to exist?) and copy them into the subshell
  3. Alter moCallFunction to use this new clean subshell
  4. When running functions, we check them against a whitelist
  5. When checking variables, we also make sure they match the whitelist

I don't know why I would support name mapping. That should be done when feeding data into mo and is already in your realm. Do you have an instance of where this would be useful?

@brainchild0
Copy link
Contributor Author

brainchild0 commented Feb 2, 2022

Because there's no real scoping, we'd have to run all user-defined functions and evaluate all variables in a subshell and deal with cleaning/preserving some environment variables.

Such is the approach I originally considered, but a more lightweight alternative is simply filtering at the stage in which the template processor resolves the template value, by inserting a query to an intermediary mapping table.

Besides the penalty from the subshell, we'd also have the filtering loop-within-a-loop causing more slowdowns.

Why? Would use of associative arrays help avoid the scenario you envision requiring a nested loop?

That's really a huge concern of mine, especially since I know there are places that generate hundreds or thousands of output files from templates and data, so doubling the processing time for those users would be bad.

The process could be optional, with the original performance being preserved for the case of not invoking the new behavior. The specific scenarios you describe, however, are unfamiliar to my experience. You seem to describe a case in which a Bash script delivers acceptable performance under conditions of high demand, but even so the performance would degrade unacceptably due to use of subshells.

This sounds like a change with a high impact. I'm worried that you are taking data, functions, and templates from untrusted sources and running them.

A template operation applies a specific template to a specific data set to produce output. It should have no effects other than producing the output. Trust is not the important issue in my view because the effects of a template operation are limited to the contents of the output.

The issue for me is robustness and predictability. Suppose a data set is given as { a=foo, b=bar, c=baz }. The output should not include the user name of the user running the template process, even if the template refers to a variable called USER. The template processor should consider the variable unset, because it is not a member of the data set. It should consider similarly unset any other variables that might be included for the other reasons in the environment (e.g. PYTHONPATH, XAUTHORITY, etc.).

Trust of sources is not a concern that I understand as important to the discussion. Keeping the output predictable based on the template and data set is the central issue I am raising.

@brainchild0
Copy link
Contributor Author

It would be helpful to see at least a simple implementation of the essential functionality, possibly to be augmented later.

A rough outline of how it might work is a follows. A new parameter would be added to the main processor. The form of the parameter value would be a list of words, each including a variable name followed optionally by a colon and a mapped name.

For example, suppose the following two shell variables are defined:

FOO_VARIABLE=Hello
bar=World

The processor may be passed a value for the parameter as FOO_VARIABLE:foo bar. The variables exposed to the template processor would be such that the template {{foo}} {{bar}} would evaluate to Hello World. Other shell variables would be inaccessible in the template.

The modified view of the variable set may be achieved simply by adding an additional intermediary step when resolving the variable values during template processing, requiring minimal overall modification to the code.

@brainchild0
Copy link
Contributor Author

brainchild0 commented Feb 21, 2022

I wonder whether you have had a chance to consider further. Would you be open to reviewing a PR, assuming I could offer one, at least for a first design iteration?

@Swivelgames
Copy link
Contributor

I apologize if I don't fully understand the issue here, but if the concern is that it might pollute the environment of the current shell, why not spin up an isolated shell for execution instead?

env -i "$BASH" -c './mo ...args'

This wouldn't require changing mo at all to achieve.

@fidian
Copy link
Contributor

fidian commented Apr 3, 2023

Currently, mo will expose any variables to the template. This includes environment variables that were in-place when mo starts and internal variables used for tracking content, tags, loop counters, etc.

I believe the goal is to eliminate any variables that don't match. For security, we would also want to mask away any values from internal functions. The user should be able to exclude sensitive information like $PATH. Also maybe they want to use $moContent. Perhaps the allowed variables are in a whitelist.

When implementing this, I want to make sure performance doesn't drastically suffer. There's a minimum of subshells created and content is returned indirectly for speed. When I mentioned loops within a loop, I'm specifically talking about how moParse loops over tags and then has to fetch data from variables. While fetching data from variables, I will need a way to rapidly detect if the variable is part of the whitelisted variables.

A last hurdle is keeping a copy of all variables when the program starts, then referencing that list when resolving the values. Unfortunately, this would also impact all user-defined functions because they will have to find a way to get access to these values as well. I am uncertain how this last goal could be achieved.

Sorry for not responding over a year ago. I thought I had. I would like to see an approach detailed before a pull request, if possible, and still make it able to run on Bash 3.2 for Apple support.

@Swivelgames
Copy link
Contributor

Swivelgames commented Apr 3, 2023

Forgive me, but I think that's the whole premise behind env's -i, --ignore-environment 🙂

Apple running Bash 3.2:
image

This is a bit of me being the pot calling the kettle black (given my latest, monstrous PR 😅 ), but I feel like we may be over-complicating things a bit.

See man env:

-i, --ignore-environment
start with an empty environment

The easiest way to deal with this is using env, which is specifically the purpose of it, including the ability to give specific environment variables to pass through if this is desired.

At most exec could be used to short-circuit execution to ensure a safe and clean execution environment if a flag is passed, if that's something that's desired:

#!/usr/bin/env bash

# If the environment isn't clean, make it so!
[ ! -z "$HOME"] && exec env -i "$BASH" -c "$0 $@"

Or something similar.

Reinventing the wheel by implementing an entire subshell functionality seems like a heavy endeavor, rather than just starting off the parsing inside a sanitized environment to begin with using env -i.

If neither of you get to it first, I can see what I can do to demonstrate the viability of using env to make things safe with Apple support in a PR once I'm done splitting up my other PR. 🙂👍

@fidian
Copy link
Contributor

fidian commented Apr 4, 2023

How would you propose removing a template's ability to access $PATH? Imagine a crazy world where users submit templates, perhaps through a web browser, and these templates are automatically used to generate scripts or web pages or emails.

Running env -i $(which bash) -c "set" still produces 32 lines of output with lots of details in there that a security-minded professional would not like to expose to random users.

@Swivelgames
Copy link
Contributor

Swivelgames commented Apr 4, 2023

Fair point 😂

I might play around with something. Avoiding loops would be ideal; to your point, it would certainly degrade performance. But, ultimately, it would be good to only use vars that are defined after execution.

@Swivelgames
Copy link
Contributor

@fidian @brainchild0 I've implemented a cursory solution that appears to work pretty well. Mind the test, of course. It's a little hairy.

The limitation of this solution is that it only works if you use --source, but I think that's a fair limitation, given the ask here.

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants