Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use literal-string for Document::fromString() $markup #32

Closed
wants to merge 1 commit into from

Conversation

craigfrancis
Copy link

Just taking a quick look at the Templado Documentation, for Engine 4, it looks like you used DOMDocument::loadHTML() and DOMDocument::loadXML() directly.

With Engine 5, it looks like you might be adding an abstraction for this, with Document::fromString()?

To ensure developers create their HTML/XML for their templates/snippets correctly, they should not include untrusted user data (those values should be provided separately, so Templado can encode them correctly).

This means the HTML/XML should either come from files on the disk (i.e. files created by the developer), or from "developer defined strings", known as the literal-string type in Static Analysis tools PHPStan and Psalm (also supported by PhpStorm 2022.3).

The literal-string type can used with variables, and supports concatenation; e.g.

/**
 * @param  literal-string  $html
 * @return void
 */
function loadHTML($html) {
  // ...
}

if ($search) {
  $htmlString = 'Results for <strong>?</strong>';
} else {
  $htmlString = 'Missing Search';
}

$htmlString = '
  <p>' . $htmlString . '</p>
  <p><a href="?">Example Link</a></p>';

loadHTML($htmlString);

Try it with PHPStan or Psalm.

It simply allows developers to check (via static analysis) they haven't made a classic mistake like this:

$htmlString = 'Results for <strong>' . ($_GET['q'] ?? '') . '</strong>';

(note, hopefully senior developers won't make these mistakes, but juniors developers do).


The literal-string type works for all kinds of Injection vulnerabilities (SQL, HTML, CLI, etc), and why database abstractions like Nette, Propel, RedBean uses it - and it does find mistakes.

As an aside, Pradeep Srinivasan and Graham Bleaney (Facebook/Meta) introduced the LiteralString type in Python 3.11; and this technique can be used in other programming languages.

@theseer
Copy link
Member

theseer commented Aug 13, 2023

Using the exploit example kind of kills the very idea using a templating engine to begin with - nothing templado could be doing here for the rescue... ;)

That being said, I do agree with the notion of having using literal-string.

Thanks!

@codecov-commenter
Copy link

codecov-commenter commented Aug 13, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (2df11d1) 100.00% compared to head (a12b964) 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff             @@
##                main       #32   +/-   ##
===========================================
  Coverage     100.00%   100.00%           
  Complexity       269       269           
===========================================
  Files             24        24           
  Lines            941       941           
===========================================
  Hits             941       941           
Files Changed Coverage Δ
src/document/Document.php 100.00% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@craigfrancis
Copy link
Author

Thanks @theseer.

Using the exploit example kind of kills the very idea using a templating engine to begin with - nothing templado could be doing here for the rescue... ;)

What I usually find is a senior developer uses the templating engine (good), then several months/years later a junior developer is asked to make a change, and they will sometimes put it into the template string, and it's not caught, often because it's one tiny change amongst many... and yes, I know, small commits, code reviews, etc; but I'm often only with a team for a week or two, and can't expect the senior to always pick up these things - that's what static analysis tools are good at :-)

@theseer
Copy link
Member

theseer commented Aug 13, 2023

I'm totally understanding where you're coming from. Unfortunately, I'm no longer convinced I can merge the change.

Quoting the psalm documentation on literal-string

literal-string

literal-string denotes a string value that is entirely composed of strings in your application.

Examples:

  • "hello " . "world"
  • "hello " . Person::DEFAULT_NAME
  • implode(', ', ["one", "two"])
  • implode(', ', [1, 2, 3])
  • "hello " .

Strings that don't pass this type check:

  • file_get_contents("foo.txt")
  • $_GET["foo"]
  • "hello " . $_GET["foo"]

While the general intention makes perfect sense, the fact it won't allow file_get_contents poses a problem. I do see why it's not allowed but for a templating engine like Templado, that's not an option. Even if I were to provide any type of wrapper that would do the actual loading of the template, it still would originate from an external source.

@theseer
Copy link
Member

theseer commented Aug 13, 2023

I just gave it a try locally, despite expecting it to not work.

I do not see a way, for literal-string to work for Templado's Document::fromString. While I could add a method to explicitly load from a file within Document, it will only make people save a string to a temp file and load it from there ;)
I haven't tested it, but i'm fairly certain psalm and phpstan won't like loading the markup from Redis or Mysql, either. I cannot possibly add loaders for all potential means to get the template's markup.

As nice as the idea was, I don't see a means of supporting it.

I'll just reject the PR.

Feel free to add additional comments and thoughts though :)

@theseer theseer closed this Aug 13, 2023
@theseer
Copy link
Member

theseer commented Aug 13, 2023

Just taking a quick look at the Templado Documentation, for Engine 4, it looks like you used DOMDocument::loadHTML() and DOMDocument::loadXML() directly.

With Engine 5, it looks like you might be adding an abstraction for this, with Document::fromString()?

For the record: Templado 4 does not use loadHTML nor loadHTMLFile, as they are horribly broken (or rather libxml2 is), but only the XML variants. Templado 4 offers a convenience wrapper to load from a file or parse a string. The actual work object HTML (which, in Templado 5, became Document) gets a ready made DOMDocument. For Templado 5 I merely decided to skip on the file access as it's useless and only adds complexity where it's not needed.

@craigfrancis
Copy link
Author

Ref developers using Document::fromString(file_get_contents($path)), sorry, I thought there would have been a method like Document::fromFile($path). I just assumed most developers would be loading templates from files, and would prefer the "complexity" in there, rather than having to use file_get_contents() themselves.

For Redis or MySQL, you're right, those cannot be trusted, and Templado shouldn't add loaders for them (you cannot provide the security guarantees needed). Personally I would provide a method like fromUnsafeString() for those cases, simply because it makes auditing considerably easier (i.e. highlight it is dangerous, making it clear they need more time to check these few cases, and then use static analysis tools check everything else).

As to Templado 4 using DOMDocument::loadHTML(), I was just referring to the documentation, line 980.

Thanks for considering anyway, if I work with a team who is using Templado, I can always get them to make a wrapper for Templado, or create a stub, to add this security check.

@theseer
Copy link
Member

theseer commented Aug 14, 2023

Ref developers using Document::fromString(file_get_contents($path)), sorry, I thought there would have been a method like Document::fromFile($path). I just assumed most developers would be loading templates from files, and would prefer the "complexity" in there, rather than having to use file_get_contents() themselves.

Sorry :-) Maybe I'll re-add some helpers for that. But it's additional complexity for the main engine which should focus on, well, engine related things ;)

For Redis or MySQL, you're right, those cannot be trusted, and Templado shouldn't add loaders for them (you cannot provide the security guarantees needed). Personally I would provide a method like fromUnsafeString() for those cases, simply because it makes auditing considerably easier (i.e. highlight it is dangerous, making it clear they need more time to check these few cases, and then use static analysis tools check everything else).

I believe to understand your idea. But it just adds complexity to the engine component. And one, I cannot actually write tests for as PHP doesn't see the difference between string and literal-string. So, from PHP's perspective both methods would be identical. I don't like that. And given PHP does not have is_literal or is_tainted - as flawed as those are conceptually - there's not really much to do. A friend of mine once said: "Don't restrict me without reason. You'll make me become creative. You don't want me to be creative ;)".

As to Templado 4 using DOMDocument::loadHTML(), I was just referring to the documentation, line 980.

That fragement is an example on how to extract something to pass along. It's not what Templado is doing in general.

Thanks for considering anyway, if I work with a team who is using Templado, I can always get them to make a wrapper for Templado, or create a stub, to add this security check.

I see :)

@craigfrancis
Copy link
Author

Maybe I'll re-add some helpers for that. But it's additional complexity for the main engine which should focus on, well, engine related things ;)

Yep, the location of the complexity is important, although I would hope that a fromFile() method would be fairly simple.

I believe to understand your idea. But it just adds complexity to the engine component.

I suppose having 3 methods to load the template (from string, from file, or from an unsafe source) would be adding some kind of complexity.

So, from PHP's perspective both methods would be identical. I don't like that. And given PHP does not have is_literal or is_tainted - as flawed as those are conceptually - there's not really much to do.

I spent many years trying to get taint checking to work, specifically trying to get the taint PECL extension to understand context. But I gave up, it's far too complicated. I've given several talks on this subject, and how taint checking is flawed (see also the limitations section of the psalm manual).

But I learnt about the "developer defined string" idea from Christoph Kern and Mike Samuel, and how it solves the problem - as in, it's impossible to have an Injection Vulnerability when the string does not contain user data. Instead you either keep user values separate (parameterised queries), or use a library to do context aware escaping (consistently, correctly, well tested, with domain expert knowledge). That's why I created the is_literal() RFC, and while it failed the vote, the main feedback I got was the need to include Phase 2, which is to have a LiteralString type as well. I've written a draft RFC for that, but I'm just waiting for George (G.P.B), one of the PHP Internals developers, to have some free time to finish the implementation (I've been using Joe's is_literal() implementation for a few years now, and it works incredibly well).

@theseer
Copy link
Member

theseer commented Aug 15, 2023

Curious what's going to come from that.

I never understood the hype about taint modes. It's just a broken concept, as said before.

I'm also not sure I'm fully buying into the literal movement as of yet. For a simple reason: What does "defined in code" mean in a scripting language that parses files at runtime?

file_put_contents('test.php', sprintf('<?php return "%s"; ?>',  $_GET['foo']));
$var = require('test.php');

Is $var now a literal or not? What about data: uri? A custom stream? And why would a file_get_contents be considered bad while require is fine?

There is just no technical solution to a social problem.

@craigfrancis
Copy link
Author

I never understood the hype about taint modes. It's just a broken concept, as said before.

Agreed, I tried to make taint work, and I don't think it's possible either (why I'm using this technique).

file_put_contents('test.php', sprintf('<?php return "%s"; ?>',  $_GET['foo']));
$var = require('test.php');

May I suggest var_export(), which I included as an example in my RFC ;-)

Anyway, yes, a developer could do that, but it's not something I've ever typed by accident, or found in the wilds. And I don't think we should worry about developers being intentionally malicious; instead I'm focusing on the fact that developers make mistakes all the time, even if they have been taught how to do things "correctly" (why this issue remains in the OWASP Top 10).

There is just no technical solution to a social problem.

Don't see how a social change can stop these mistakes, unless you're thinking that we beat developers any time they make a mistake? so they are extra careful? personally I think this simple technical solution is better ;-)


And, because I want to make sure I'm not proposing a flawed solution, I'll also check your other questions:

  • The data: URI example, assuming that's coming from the user (not a literal-string), it won't be Injected into the template, this means that the templating engine (the only one to be trusted to understand HTML) can be responsible for encoding or rejecting that value based on context.
  • A stream would not return a literal-string either.
  • Normally file_get_contents() is for getting file contents (assume bad, as it often contains user data), whereas require() executes code (which should have been written by the developer; if that's not the case, then you probably have bigger things to worry about).

@theseer
Copy link
Member

theseer commented Aug 29, 2023

To me, the last part is the "killer argument": How do you load a template?

@craigfrancis
Copy link
Author

Templado simply provides 3 methods (+15 lines):

  1. fromFile(), probably the most used, allowing the developer to easily load files they have created. It's very unlikely a developer would create and use an untrusted file (e.g. user upload), in the same way I've only seen 1 developer use require() on an untrusted file (that was a weird day).

  2. fromString(), using a literal-string to ensure a developer defined string is used, so we can easily check that it contains no unsafe user values (where I find most XSS issues are introduced, typically a junior developer editing existing code).

  3. fromUnsafeString(), for those rare cases where the string is not a literal-string. This makes the developer think twice, and makes it easy for code review/auditing (e.g. I can focus on the few times this is used, and trust the other from* methods).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants