Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize annotations #52

Merged
merged 36 commits into from
Jan 29, 2018
Merged

Sanitize annotations #52

merged 36 commits into from
Jan 29, 2018

Conversation

nlisgo
Copy link
Member

@nlisgo nlisgo commented Jan 23, 2018

No description provided.

@nlisgo
Copy link
Member Author

nlisgo commented Jan 23, 2018

I believe that I am experiencing this issue: ezyang/htmlpurifier#71

@@ -103,6 +104,7 @@ public function __construct(string $environment = 'dev')
'authority' => '',
'group' => '',
],
'html_purifier.cache_dir' => $config['html_purifier']['cache_dir'] ?? __DIR__.'/../../var/html_purifier/cache',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to create the /srv/annotations/var/html_purifier/cache in all environments in order for this to pass.

@@ -103,6 +104,12 @@ public function __construct(string $environment = 'dev')
'authority' => '',
'group' => '',
],
'html_purifier' => ($config['html_purifier'] ?? []) + [
'Cache.SerializerPath' => __DIR__.'/../../var/cache/html_purifier',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to create this directory in all environments: /srv/annotations/var/cache/html_purifier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Directory /srv/annotations/vendor/ezyang/htmlpurifier/library/HTMLPurifier/DefinitionCache/Serializer not writable, please chmod to 777 error related to this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's related to that. I will work on MarkdownSanitizerTest so that it doesn't attempt to write to cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#54

@nlisgo nlisgo force-pushed the sanitize-annotations branch from 17cd325 to 813c2d5 Compare January 24, 2018 09:28
@@ -1,26 +0,0 @@
<?php
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to see these classes simplified as function callbacks

$this->htmlPurifier = $htmlPurifier;
}

public function parse(string $input) : string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API conversion pipeline needs its step to be listed as documentation in the README:

1. Hypothesis API to `Model` objects
2. Denormalization
2.1 Conversion to HTML (uses library X)
2.2 HTML purification (uses library Y)
...

(ADRs are not a good format for that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The obvious questions that needs to be answered there is why we sanitize intermediate HTML and convert back to Markdown rather than sanitizing the final HTML

@nlisgo nlisgo force-pushed the sanitize-annotations branch 3 times, most recently from eedba3e to ec2af83 Compare January 24, 2018 14:24
@nlisgo nlisgo requested a review from thewilkybarkid January 24, 2018 14:37
composer.json Outdated
@@ -29,7 +29,6 @@
"guzzlehttp/guzzle": "^6.3",
"knplabs/console-service-provider": "^2.1",
"league/commonmark": "^0.17.0",
"psr/container": "^1.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but it should be included since AppKernel implements ContainerInterface.

'allow_unsafe_links' => false,
],
'html_purifier' => ($config['html_purifier'] ?? []) + [
'Cache.SerializerPath' => __DIR__.'/../../var/cache/html_purifier',
Copy link
Member

@thewilkybarkid thewilkybarkid Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extract the base cache path as its own parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean var/cache/ or /srv/annotations/var/cache?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

'cache.path' => $config['cache']['path'] ?? __DIR__.'/../../var/cache',

README.md Outdated
## Journey of a request to the annotations api
1. Query received via the `annotations` api.
1. Retrieve annotation listing from `hypothes.is/api` based on query parameters.
1. Denormalize the annotations into an object model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[HypothesisClient\Model]`?

function clean_paragraph($text)
{
$allowed_tags = '<i><sub><sup><span><del><math><a><br><caption>';

return strip_tags($text, $allowed_tags);
}

function escape_math($text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should have types. (function escape_math(string $text) : string.)

README.md Outdated
1. Query received via the `annotations` api.
1. Retrieve annotation listing from `hypothes.is/api` based on query parameters.
1. Denormalize the annotations into an object model.
1. Normalize the annotations to prepare the response which conforms to the eLife json schema. [HTMLPurifier](http://htmlpurifier.org/) is used to sanitise content values. We rely on [CommonMark](http://commonmark.thephpleague.com/) to step through the Markdown content and prepare the structure that we need.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would invert the two sentences: first we step through, and while stepping through we can sanitise?

'html_purifier' => ($config['html_purifier'] ?? []) + [
'Cache.SerializerPath' => __DIR__.'/../../var/cache/html_purifier',
],
'html_converter' => ($config['html_converter'] ?? []) + [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a leftover to be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove

function escape_math(string $text) : string
{
// Escape MathML.
$escaped = preg_replace_callback('~(?P<before><math[^>]*>)(?P<mathml>.*)(?P<after></math>)~s', function ($match) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the regex is greedy and what happens with multiple tags. I'll ensure there is a test for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Named groups aren't needed? Could be simplified to:

function escape_math(string $text) : string
{
    $escape = function (array $matches) {
        return Xml::escape($matches[0]);
    };

    return preg_replace_callback_array([
        '~(<math[^>]*>.*?</math>)~s' => $escape,
        '~(\$\$.+?\$\$)~s' => $escape,
    ], $text);
}

}, $escaped);

return $escaped;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I'm not sure whether an "unescaping" is necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unescaping is not necessary as LaTeX and MathML are being presented as escaped in the api response.

use League\CommonMark\Block\Element;
use League\CommonMark\DocParser;
use League\CommonMark\ElementRendererInterface;
use Psr\Log\LoggerInterface;
use Symfony\Component\Serializer\Normalizer\NormalizerAwareInterface;
use Symfony\Component\Serializer\Normalizer\NormalizerAwareTrait;
use Symfony\Component\Serializer\Normalizer\NormalizerInterface;
use function eLife\Annotations\Serializer\CommonMark\escape_math;

final class HypothesisClientAnnotationNormalizer implements NormalizerInterface, NormalizerAwareInterface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NormalizerAwareInterface not needed anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove


private function purify(string $text) : string
{
return $this->htmlPurifier->purify($text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we'll want to cache the results of these (IIRC the library doesn't do this).

This class is feeling a bit bloated now, knowing how to convert all the bits and also purify them. Can purification happen before/after?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purification happening after is a possibility. I implemented a solution where it was happening before but it wasn't optimal.

In order for it to happen after the HypothesisClientAnnotationNormalizer would return an unsanitized response and then we would need a class so step through each item in the content array. So, I would need 2 places where the structure of the response was known. Perhaps we can have a chat about this tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a refactor here which means I can remove purify from this class. See what you think.

@nlisgo nlisgo force-pushed the sanitize-annotations branch from 9bf390b to bd51de7 Compare January 24, 2018 15:27
@@ -145,8 +143,8 @@ private function processListBlock(Element\ListBlock $block)
foreach ($item->children() as $child) {
if ($child instanceof Element\ListBlock) {
$items[] = [$render($child)];
} else {
$items[] = $this->htmlRenderer->renderBlock($child);
} elseif ($item = $this->htmlRenderer->renderBlock($child)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced by this PR, but processText() has already rendered the whole list then thrown it away (in processText()) in favour of doing each item here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The render of the list converts the whole list to HTML. We need to step through the list to separate out each item. I will see if I can improve this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just need to stop it rendering the whole thing if it's a list, it's wasted effort.

use League\CommonMark\Environment;
use League\CommonMark\HtmlRenderer as CommonMarkHtmlRenderer;

class HtmlRenderer extends CommonMarkHtmlRenderer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name doesn't describe what it's doing (HtmlPurifierRenderer?). Better to use composition too.

Copy link
Member

@thewilkybarkid thewilkybarkid Jan 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misses out renderInlines() too? Edit: or are 'inlines' always included inside blocks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working under that assumption. I will verify it by introducing a test.


final class HypothesisClientAnnotationNormalizer implements NormalizerInterface, NormalizerAwareInterface
final class HypothesisClientAnnotationNormalizer implements NormalizerInterface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trait needs to go too.

composer.json Outdated
@@ -24,10 +24,12 @@
"elife/content-negotiator": "^1.0",
"elife/logging-sdk": "^1.0@dev",
"elife/ping": "^1.0",
"ezyang/htmlpurifier": "dev-master",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we relying on commits that haven't been tagged yet?

@@ -80,7 +80,7 @@ public function normalize($object, $format = null, array $context = []) : array

private function processText(string $text) : array
{
$blocks = $this->docParser->parse($text)->children();
$blocks = $this->docParser->parse(escape_math($text))->children();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we're escaping Math elements before, then purifying after? This could also compose HtmlRenderer?

(Means more regexes, but we're going to be better off caching the whole thing anyway.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Math elements are escaped while the content is in Markdown. We need to determine what CommonMark Block elements that we have before rendering the HTML.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we need to sanitise MathML we may need a different library then HTMLPurifier: http://htmlpurifier.org/phorum/read.php?5,8091,8091

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that CommonMark does something with HTML element? Assumed it wouldn't...

function clean_paragraph($text)
use League\CommonMark\Util\Xml;

function clean_paragraph(string $text) : string
{
$allowed_tags = '<i><sub><sup><span><del><math><a><br><caption>';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between <math> here and escape_math()?

Is it a requirement to drop MathML? Unlike LaTeX we can display it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a requirement. I think it will not be too much trouble do capture and display it particularly when it's in a block rather than inline. But I will take that on as a followup to this. <math> was left in while I was deciding how to treat mathml. I will remove it for now while we are simply escaping MathML.


/**
* @covers \eLife\Annotations\Serializer\HypothesisClientAnnotationNormalizer
*/
final class HypothesisClientAnnotationNormalizerTest extends PHPUnit_Framework_TestCase
final class HypothesisClientAnnotationNormalizerWebTest extends WebTestCase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebTest? Should just extend ApplicationTestCase if you want AppKernel.

@nlisgo nlisgo force-pushed the sanitize-annotations branch from fdd675e to c007395 Compare January 25, 2018 13:04
/**
* @param Environment $environment
*/
public function __construct(Environment $environment, HTMLPurifier $htmlPurifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this should compose the original HtmlRender rather than duplicate everything...

@nlisgo nlisgo force-pushed the sanitize-annotations branch from 3f0ffe3 to f226152 Compare January 26, 2018 13:53
return new HTMLPurifier($app['html_purifier']);
};

$this->app['annotation.serializer.common_mark.math_escape_renderer'] = function () {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer $this->app->extend() to making separate services.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thewilkybarkid Could you submit a PR for this or provide a code snippet? I see different uses for extend, I'm not sure what the appropriate one is here.

@@ -17,6 +17,6 @@ public function render(AbstractBlock $block, ElementRendererInterface $htmlRende
throw new \InvalidArgumentException('Incompatible block type: '.get_class($block));
}

return Xml::escape($block->getStringContent());
return Xml::escape(trim($block->getStringContent()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace might be meaningful?

@thewilkybarkid
Copy link
Member

thewilkybarkid commented Jan 26, 2018

@nlisgo Are you rebasing this branch? Easier to do a merge (keeps the history in the PR intact). Since we're doing squash commits we don't need to worry about the extra commit created.

(And you can use the update branch button here for reasonably trivial updates.)

return new HtmlRenderer($this->app['annotation.serializer.common_mark.environment']);
};

$this->app['annotation.serializer.html_purifier'] = function (Application $app) {
return new HTMLPurifier($app['html_purifier']);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using $this->app elsewhere.

@nlisgo nlisgo force-pushed the sanitize-annotations branch from d24e296 to b548a3b Compare January 29, 2018 12:09
@@ -26,7 +26,7 @@ public function setup()
$this->renderer = $this->getMockBuilder(ElementRendererInterface::class)
->getMock();
$this->abstractBlock = $this->createMock(AbstractBlock::class);
$this->htmlPurifierRenderer = new HtmlPurifierRenderer($this->renderer, new HTMLPurifier());
$this->htmlPurifierRenderer = new HtmlPurifierRenderer($this->renderer, new HTMLPurifier(['Cache.SerializerPath' => '/tmp']));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys_get_temp_dir()

Copy link
Member

@thewilkybarkid thewilkybarkid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some stylistic stuff (eg final), but :shipit:.

@nlisgo nlisgo merged commit b3a3305 into develop Jan 29, 2018
@thewilkybarkid thewilkybarkid deleted the sanitize-annotations branch January 30, 2018 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants