Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a validation bug #237

Merged
merged 3 commits into from
Mar 16, 2022
Merged

Fix a validation bug #237

merged 3 commits into from
Mar 16, 2022

Conversation

ollietreend
Copy link
Contributor

@ollietreend ollietreend commented Mar 15, 2022

What

This PR fixes a couple of bugs in the way generated HTML markup was being sanitised and validated.

1. Validation before post-processing

Validation is now performed before post-processors have acted upon the generated HTML. This means that disallowed HTML tags and attributes can now be inserted into a document by post-processors – since they are trusted code – but still cannot be inserted into the source Markdown by users.

In other words, the order of processing is now:

[ User input ] --> [ Validation ] --> [ Post-processing ]

This seems to be the most logical way for validation to be performed, although it wasn't actually being done that way in practice.

Previously, post-processing would happen before the validation step:

[ User input ] --> [ Post-processing ] --> [ Validation ]

Resulting in changes in post-processed elements – such as button elements – to be run through the validator.

2. Apply allowed_image_hosts consistently

The first change resulted in tests covering the allowed_image_hosts sanitiser option to fail.

It turns out the allowed_image_hosts wasn't being carried through to the Govspeak::Document object which performed the final HTML conversion & sanitisation, meaning this rule wasn't being applied consistently between the validation and final markup produced.

Why is this needed?

This was brought to our attention by users of Publishing apps who have recently become unable to save documents containing Start Button elements.

Whenever a document contained the following snippet of Govspeak-flavoured Markdown:

{button start}[Start Now](https://gov.uk/example-service){/button}

a validation error would be displayed, and the user's changes wouldn't save.

Screenshot 2022-03-15 at 16 29 24

Therefore, users have been unable to create/update any pages that happen to contain a start button.

Why did this happen?

This bug only appeared recently, and it wasn't caused by a change in this gem.

This gem is dependent on the govuk_publishing_components gem to render button elements. A recent change to that gem, which tweaked the HTML markup used for buttons, had the side effect of exposing a flaw in the way validation was being performed in this gem.

The new HTML markup for buttons was triggering a validation error due to it using disallowed HTML attributes – even though this wasn't HTML entered by the user, but was instead generated by the govuk_publishing_components gem.

Visual Changes

None.

Trello: https://trello.com/c/HQ85E1ux/1315-publishers-unable-to-save-drafts-when-there-is-a-start-button

The HTML Validator was sanitizing Markdown which had been converted to HTML and had post-processors applied. This meant that any unsafe HTML added by a post-processor was causing the source Markdown to be considered invalid.

Logically, it seems more sensible for HTML to be validated after conversion from Markdown, but **before** the post-processors have altered the HTML.

This way, we only validate user input rather than validating the output of our own post-processors, which we consider to be trusted.
This commit fixes an inconsistency in how the `allowed_image_hosts` option was used when sanitizing generated HTML.

The option was used by `Govspeak::HtmlValidator` class.

However, that option had no effect when sanitization was performed by `Govspeak::Document` when initialised with the option `sanitize: true`.

In practice, this means that generated HTML markup could still contain disallowed image hosts if the source Markdown wasn't validated with `HtmlValidator` first.
@ollietreend ollietreend marked this pull request as ready for review March 15, 2022 17:05
@ollietreend
Copy link
Contributor Author

Additional thoughts about this code

The interfaces of some of these classes – particularly the Govspeak::HtmlSanitizer class – feel quite messy to me.

HtmlSanitizer options

For example, some sanitisation options need to be passed in to the HtmlSanitizer constructor, where others are passed in to the HtmlSanitizer#sanitize instance method.

This leads to code such as:

HtmlSanitizer.new(kramdown_doc.to_html, allowed_image_hosts: @allowed_image_hosts)
             .sanitize(allowed_elements: @allowed_elements)

which needs to know that :allowed_image_hosts has to be given to the constructor, but :allowed_elements is given to the #sanitize method.

Whilst I'd be keen to refactor this, there's a risk that consumers of this gem have come to rely upon this interface. After all, it's been this way for 8 years at the time of writing.

sanitize_config

It'd also be nice to refactor the way configuration is passed into the third-party Sanitize gem.

Currently the configuration is defined in two different methods (#sanitize and #sanitize_config) and is then merged. It'd be nicer to have a single place for this config to live.

However, since both methods are public, it's another case where consumers could depend on these methods behaving as they currently do. In fact, this is exactly what alphagov/hmrc-manuals-api does – see here.

@ollietreend ollietreend changed the title Fix validation Fix a validation bug Mar 15, 2022
Copy link
Member

@brucebolt brucebolt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation work @ollietreend! Code changes all look good to me.

There appears to be a lack of testing around the post processing. I wonder if increasing the test coverage there would have caught this problem and blocked the PR that bumped us to the newer version of govuk_publishing_components?

CHANGELOG.md Outdated Show resolved Hide resolved
@ollietreend ollietreend merged commit 4e6b82c into main Mar 16, 2022
@ollietreend ollietreend deleted the fix-validation branch March 16, 2022 10:50
@ollietreend ollietreend mentioned this pull request Mar 16, 2022
ollietreend added a commit to alphagov/whitehall that referenced this pull request Jan 19, 2023
The SafeHtmlValidator checks values for unsafe HTML.

For HTML attachments, this validator has been running both on the
Govspeak input as well as the generated HTML output.

However, it's possible (and perfectly valid) for generated HTML output
to contain HTML that would be considered unsafe if entered by a user.
When Govspeak is converted to HTML, it's possible for
govuk_publishing_components to be included – for example, when adding
a [button] to a document – or indeed any other HTML that might be added
in post-processing.

This commit changes the validation rule so it only applies to Govspeak
input provided by the user. Govspeak HTML output is no longer validated.
This is safe to do because Govspeak post-processing is trusted code. For
more information, see alphagov/govspeak#237 which established this as
safe behaviour.

[button]: https://github.com/alphagov/govspeak#button
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants