-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitize annotations #52
Conversation
I believe that I am experiencing this issue: ezyang/htmlpurifier#71 |
src/Annotations/AppKernel.php
Outdated
@@ -103,6 +104,7 @@ public function __construct(string $environment = 'dev') | |||
'authority' => '', | |||
'group' => '', | |||
], | |||
'html_purifier.cache_dir' => $config['html_purifier']['cache_dir'] ?? __DIR__.'/../../var/html_purifier/cache', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to create the /srv/annotations/var/html_purifier/cache
in all environments in order for this to pass.
src/Annotations/AppKernel.php
Outdated
@@ -103,6 +104,12 @@ public function __construct(string $environment = 'dev') | |||
'authority' => '', | |||
'group' => '', | |||
], | |||
'html_purifier' => ($config['html_purifier'] ?? []) + [ | |||
'Cache.SerializerPath' => __DIR__.'/../../var/cache/html_purifier', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to create this directory in all environments: /srv/annotations/var/cache/html_purifier
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the Directory /srv/annotations/vendor/ezyang/htmlpurifier/library/HTMLPurifier/DefinitionCache/Serializer not writable, please chmod to 777
error related to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's related to that. I will work on MarkdownSanitizerTest
so that it doesn't attempt to write to cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
17cd325
to
813c2d5
Compare
@@ -1,26 +0,0 @@ | |||
<?php |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to see these classes simplified as function callbacks
$this->htmlPurifier = $htmlPurifier; | ||
} | ||
|
||
public function parse(string $input) : string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API conversion pipeline needs its step to be listed as documentation in the README:
1. Hypothesis API to `Model` objects
2. Denormalization
2.1 Conversion to HTML (uses library X)
2.2 HTML purification (uses library Y)
...
(ADRs are not a good format for that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The obvious questions that needs to be answered there is why we sanitize intermediate HTML and convert back to Markdown rather than sanitizing the final HTML
eedba3e
to
ec2af83
Compare
composer.json
Outdated
@@ -29,7 +29,6 @@ | |||
"guzzlehttp/guzzle": "^6.3", | |||
"knplabs/console-service-provider": "^2.1", | |||
"league/commonmark": "^0.17.0", | |||
"psr/container": "^1.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but it should be included since AppKernel
implements ContainerInterface
.
src/Annotations/AppKernel.php
Outdated
'allow_unsafe_links' => false, | ||
], | ||
'html_purifier' => ($config['html_purifier'] ?? []) + [ | ||
'Cache.SerializerPath' => __DIR__.'/../../var/cache/html_purifier', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we extract the base cache path as its own parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean var/cache/
or /srv/annotations/var/cache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
annotations/src/Annotations/AppKernel.php
Line 81 in 83c91e5
'cache.path' => $config['cache']['path'] ?? __DIR__.'/../../var/cache', |
README.md
Outdated
## Journey of a request to the annotations api | ||
1. Query received via the `annotations` api. | ||
1. Retrieve annotation listing from `hypothes.is/api` based on query parameters. | ||
1. Denormalize the annotations into an object model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[[HypothesisClient\Model]`?
function clean_paragraph($text) | ||
{ | ||
$allowed_tags = '<i><sub><sup><span><del><math><a><br><caption>'; | ||
|
||
return strip_tags($text, $allowed_tags); | ||
} | ||
|
||
function escape_math($text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should have types. (function escape_math(string $text) : string
.)
README.md
Outdated
1. Query received via the `annotations` api. | ||
1. Retrieve annotation listing from `hypothes.is/api` based on query parameters. | ||
1. Denormalize the annotations into an object model. | ||
1. Normalize the annotations to prepare the response which conforms to the eLife json schema. [HTMLPurifier](http://htmlpurifier.org/) is used to sanitise content values. We rely on [CommonMark](http://commonmark.thephpleague.com/) to step through the Markdown content and prepare the structure that we need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would invert the two sentences: first we step through, and while stepping through we can sanitise?
src/Annotations/AppKernel.php
Outdated
'html_purifier' => ($config['html_purifier'] ?? []) + [ | ||
'Cache.SerializerPath' => __DIR__.'/../../var/cache/html_purifier', | ||
], | ||
'html_converter' => ($config['html_converter'] ?? []) + [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a leftover to be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove
function escape_math(string $text) : string | ||
{ | ||
// Escape MathML. | ||
$escaped = preg_replace_callback('~(?P<before><math[^>]*>)(?P<mathml>.*)(?P<after></math>)~s', function ($match) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the regex is greedy and what happens with multiple tags. I'll ensure there is a test for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Named groups aren't needed? Could be simplified to:
function escape_math(string $text) : string
{
$escape = function (array $matches) {
return Xml::escape($matches[0]);
};
return preg_replace_callback_array([
'~(<math[^>]*>.*?</math>)~s' => $escape,
'~(\$\$.+?\$\$)~s' => $escape,
], $text);
}
}, $escaped); | ||
|
||
return $escaped; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I'm not sure whether an "unescaping" is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unescaping is not necessary as LaTeX and MathML are being presented as escaped in the api response.
use League\CommonMark\Block\Element; | ||
use League\CommonMark\DocParser; | ||
use League\CommonMark\ElementRendererInterface; | ||
use Psr\Log\LoggerInterface; | ||
use Symfony\Component\Serializer\Normalizer\NormalizerAwareInterface; | ||
use Symfony\Component\Serializer\Normalizer\NormalizerAwareTrait; | ||
use Symfony\Component\Serializer\Normalizer\NormalizerInterface; | ||
use function eLife\Annotations\Serializer\CommonMark\escape_math; | ||
|
||
final class HypothesisClientAnnotationNormalizer implements NormalizerInterface, NormalizerAwareInterface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NormalizerAwareInterface
not needed anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove
|
||
private function purify(string $text) : string | ||
{ | ||
return $this->htmlPurifier->purify($text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we'll want to cache the results of these (IIRC the library doesn't do this).
This class is feeling a bit bloated now, knowing how to convert all the bits and also purify them. Can purification happen before/after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Purification happening after is a possibility. I implemented a solution where it was happening before but it wasn't optimal.
In order for it to happen after the HypothesisClientAnnotationNormalizer
would return an unsanitized response and then we would need a class so step through each item in the content
array. So, I would need 2 places where the structure of the response was known. Perhaps we can have a chat about this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a refactor here which means I can remove purify from this class. See what you think.
9bf390b
to
bd51de7
Compare
@@ -145,8 +143,8 @@ private function processListBlock(Element\ListBlock $block) | |||
foreach ($item->children() as $child) { | |||
if ($child instanceof Element\ListBlock) { | |||
$items[] = [$render($child)]; | |||
} else { | |||
$items[] = $this->htmlRenderer->renderBlock($child); | |||
} elseif ($item = $this->htmlRenderer->renderBlock($child)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not introduced by this PR, but processText()
has already rendered the whole list then thrown it away (in processText()
) in favour of doing each item here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The render of the list converts the whole list to HTML. We need to step through the list to separate out each item. I will see if I can improve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You just need to stop it rendering the whole thing if it's a list, it's wasted effort.
use League\CommonMark\Environment; | ||
use League\CommonMark\HtmlRenderer as CommonMarkHtmlRenderer; | ||
|
||
class HtmlRenderer extends CommonMarkHtmlRenderer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name doesn't describe what it's doing (HtmlPurifierRenderer
?). Better to use composition too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misses out renderInlines()
too? Edit: or are 'inlines' always included inside blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working under that assumption. I will verify it by introducing a test.
|
||
final class HypothesisClientAnnotationNormalizer implements NormalizerInterface, NormalizerAwareInterface | ||
final class HypothesisClientAnnotationNormalizer implements NormalizerInterface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trait needs to go too.
composer.json
Outdated
@@ -24,10 +24,12 @@ | |||
"elife/content-negotiator": "^1.0", | |||
"elife/logging-sdk": "^1.0@dev", | |||
"elife/ping": "^1.0", | |||
"ezyang/htmlpurifier": "dev-master", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we relying on commits that haven't been tagged yet?
@@ -80,7 +80,7 @@ public function normalize($object, $format = null, array $context = []) : array | |||
|
|||
private function processText(string $text) : array | |||
{ | |||
$blocks = $this->docParser->parse($text)->children(); | |||
$blocks = $this->docParser->parse(escape_math($text))->children(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we're escaping Math elements before, then purifying after? This could also compose HtmlRenderer
?
(Means more regexes, but we're going to be better off caching the whole thing anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Math elements are escaped while the content is in Markdown. We need to determine what CommonMark
Block
elements that we have before rendering the HTML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we need to sanitise MathML we may need a different library then HTMLPurifier: http://htmlpurifier.org/phorum/read.php?5,8091,8091
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that CommonMark
does something with HTML element? Assumed it wouldn't...
function clean_paragraph($text) | ||
use League\CommonMark\Util\Xml; | ||
|
||
function clean_paragraph(string $text) : string | ||
{ | ||
$allowed_tags = '<i><sub><sup><span><del><math><a><br><caption>'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between <math>
here and escape_math()
?
Is it a requirement to drop MathML? Unlike LaTeX we can display it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a requirement. I think it will not be too much trouble do capture and display it particularly when it's in a block rather than inline. But I will take that on as a followup to this. <math>
was left in while I was deciding how to treat mathml. I will remove it for now while we are simply escaping MathML.
|
||
/** | ||
* @covers \eLife\Annotations\Serializer\HypothesisClientAnnotationNormalizer | ||
*/ | ||
final class HypothesisClientAnnotationNormalizerTest extends PHPUnit_Framework_TestCase | ||
final class HypothesisClientAnnotationNormalizerWebTest extends WebTestCase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebTest
? Should just extend ApplicationTestCase
if you want AppKernel
.
fdd675e
to
c007395
Compare
/** | ||
* @param Environment $environment | ||
*/ | ||
public function __construct(Environment $environment, HTMLPurifier $htmlPurifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant this should compose the original HtmlRender
rather than duplicate everything...
* Exposing problem with greedy regex * Make <math> regex ungreedy
3f0ffe3
to
f226152
Compare
src/Annotations/AppKernel.php
Outdated
return new HTMLPurifier($app['html_purifier']); | ||
}; | ||
|
||
$this->app['annotation.serializer.common_mark.math_escape_renderer'] = function () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer $this->app->extend()
to making separate services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thewilkybarkid Could you submit a PR for this or provide a code snippet? I see different uses for extend
, I'm not sure what the appropriate one is here.
@@ -17,6 +17,6 @@ public function render(AbstractBlock $block, ElementRendererInterface $htmlRende | |||
throw new \InvalidArgumentException('Incompatible block type: '.get_class($block)); | |||
} | |||
|
|||
return Xml::escape($block->getStringContent()); | |||
return Xml::escape(trim($block->getStringContent())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whitespace might be meaningful?
@nlisgo Are you rebasing this branch? Easier to do a merge (keeps the history in the PR intact). Since we're doing squash commits we don't need to worry about the extra commit created. (And you can use the update branch button here for reasonably trivial updates.) |
…anitize-annotations
src/Annotations/AppKernel.php
Outdated
return new HtmlRenderer($this->app['annotation.serializer.common_mark.environment']); | ||
}; | ||
|
||
$this->app['annotation.serializer.html_purifier'] = function (Application $app) { | ||
return new HTMLPurifier($app['html_purifier']); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using $this->app
elsewhere.
…anitize-annotations
d24e296
to
b548a3b
Compare
@@ -26,7 +26,7 @@ public function setup() | |||
$this->renderer = $this->getMockBuilder(ElementRendererInterface::class) | |||
->getMock(); | |||
$this->abstractBlock = $this->createMock(AbstractBlock::class); | |||
$this->htmlPurifierRenderer = new HtmlPurifierRenderer($this->renderer, new HTMLPurifier()); | |||
$this->htmlPurifierRenderer = new HtmlPurifierRenderer($this->renderer, new HTMLPurifier(['Cache.SerializerPath' => '/tmp'])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sys_get_temp_dir()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some stylistic stuff (eg final
), but .
No description provided.