Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantically parse and deduplicate source expressions #498

Closed
wants to merge 31 commits into from

Conversation

lgarron
Copy link
Contributor

@lgarron lgarron commented Aug 17, 2022

Recently, we've had a spate of fixes for parsing directives and source expressions, stemming from the fact that the code doesn't understand the format of valid expressions, and makes local assumptions about what they look like — in particular, assuming a resemblance to URLs during deduplication, when handling a lot of possible values that are not URLs.

#490
#478

This PR is an attempt to 'bite the bullet" and parse source expressions so we can semantically deduplicate matching URLs. In the future, we could use this to add more validation.

All PRs:

  • Has tests
  • Documentation updated (N/A)

Adding a new header

expect(Kernel).to receive(:warn).with(%(frame_ancestors contains a ; in "google.com;script-src *;.;" which will raise an error in future versions. It has been replaced with a blank space.))
expect(ContentSecurityPolicy.new(frame_ancestors: %w(https://google.com;script-src https://*;.;)).value).to eq("frame-ancestors google.com script-src * .")
expect(Kernel).to receive(:warn).with(%(frame_ancestors contains a ; in "https://google.com;script-src https://localhost;example.com;" which will raise an error in future versions. It has been replaced with a blank space.))
expect(ContentSecurityPolicy.new(frame_ancestors: %w(https://google.com;script-src https://localhost;example.com;)).value).to eq("frame-ancestors google.com script-src localhost example.com")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the CSP spec/MDN docs, it looks like . is not a valid expression? https://www.w3.org/TR/CSP3/#framework-directive-source-list

This test seems to come from #418

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the original test fail on this PR? I hesitate to change the test if functionality under it didn't change.


module SecureHeaders
class ContentSecurityPolicy
class SourceExpression
Copy link
Contributor Author

@lgarron lgarron Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just call this a "expression" or "entry" even something more general, rather than a "source expression".

My goal was to support just "source lists", as defined in the spec and implied in content_security_policy.rb. But for now we still need parse things that don't technically belong in source lists, such as report paths and as unquoted directives like script-src (see code comments in other files).

@lgarron lgarron force-pushed the parse-source-expressions branch from f9ebbfb to 60dd23f Compare August 17, 2022 19:16
@lgarron lgarron marked this pull request as ready for review August 17, 2022 19:18
@lgarron lgarron requested review from vcsjones and JackMc August 17, 2022 20:04
@JackMc
Copy link
Contributor

JackMc commented Aug 18, 2022

@machisuji if you have time, your review would also be appreciated here having worked in this area very recently 🙇

@JackMc
Copy link
Contributor

JackMc commented Aug 18, 2022

Meta-concern: is this change a breaking one? I don’t think we have an easy way to differentiate between the quirks of this implementation and the other one. Those quirks weren’t documented, but they likely are now relied upon.

Copy link
Contributor

@JackMc JackMc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn’t make it through the entire PR, but some initial feedback. Will attempt to remember to come back and do more, but please ping me if I forget.

source_list.map do |expression|
if expression =~ /(\n|;)/
if !semicolon_warned_yet
Kernel.warn("#{directive} contains a #{$1} in #{source_list.join(" ").inspect} which will raise an error in future versions. It has been replaced with a blank space.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely the $1 in the warning here will be very confusing if it’s an actual new line. Could we do $1.inspect? I think that will show ”\n”

def clean_malformatted_sources(directive, source_list)
cleaned_source_list = []
semicolon_warned_yet = false
source_list.map do |expression|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn’t a map operation since it doesn’t have a meaningful array result. However we could do this using a reduce pattern, what do you think? reduce is a weird name for the operation in this context. Ruby also offers inject which is the same thing just aliased.

cleaned_source_list = source_list.inject([]) do |arr, source|
  # … add elements to `arr` as needed.
end

If we like the code as-is, we could just change this to each.

end
cleaned_source_list.select { |value| value != "" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can’t remember if this library only works in Rails, but if it does you can use compact_blank here.

Suggested change
cleaned_source_list.select { |value| value != "" }
cleaned_source_list.compact_blank

Otherwise, I would do something like:

Suggested change
cleaned_source_list.select { |value| value != "" }
cleaned_source_list.reject(&:blank?)

Additionally, with the suggestion above returning the array, you could call this on the return value and remove the need for cleaned_source_list.

@@ -156,17 +174,14 @@ def reject_all_values_if_none(source_list)
# e.g. *.github.com asdf.github.com becomes *.github.com
def dedup_source_list(sources)
sources = sources.uniq
wild_sources = sources.select { |source| source =~ STAR_REGEXP }
host_source_expressions = sources.map { |source| parse_source_expression(source) }
# TODO: Split by source expression type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO before the PR merges? If not we should likely document it in a ticket and remove this comment.

@@ -156,17 +174,14 @@ def reject_all_values_if_none(source_list)
# e.g. *.github.com asdf.github.com becomes *.github.com
def dedup_source_list(sources)
sources = sources.uniq
wild_sources = sources.select { |source| source =~ STAR_REGEXP }
host_source_expressions = sources.map { |source| parse_source_expression(source) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to not copy the array several times, you can use the in-place variants of these methods. Generally they end with an !.

Suggested change
host_source_expressions = sources.map { |source| parse_source_expression(source) }
sources.map! { |source| parse_source_expression(source) }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for uniq/uniq! and select/select!

@@ -214,5 +229,9 @@ def strip_source_schemes(source_list)
def symbol_to_hyphen_case(sym)
sym.to_s.tr("_", "-")
end

def source_scheme(source)
source.match(/^([A-Za-z0-9\-\+.]+):\/\//)&.values_at(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is doing this using a regex the most reliable way? I imagine we had a good reason for not using a URI parser.

expect(Kernel).to receive(:warn).with(%(frame_ancestors contains a ; in "google.com;script-src *;.;" which will raise an error in future versions. It has been replaced with a blank space.))
expect(ContentSecurityPolicy.new(frame_ancestors: %w(https://google.com;script-src https://*;.;)).value).to eq("frame-ancestors google.com script-src * .")
expect(Kernel).to receive(:warn).with(%(frame_ancestors contains a ; in "https://google.com;script-src https://localhost;example.com;" which will raise an error in future versions. It has been replaced with a blank space.))
expect(ContentSecurityPolicy.new(frame_ancestors: %w(https://google.com;script-src https://localhost;example.com;)).value).to eq("frame-ancestors google.com script-src localhost example.com")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the original test fail on this PR? I hesitate to change the test if functionality under it didn't change.

source_list.map do |expression|
if expression =~ /(\n|;)/
if !semicolon_warned_yet
Kernel.warn("#{directive} contains a #{$1} in #{source_list.join(" ").inspect} which will raise an error in future versions. It has been replaced with a blank space.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be out of scope for this PR but this kernel warning message and cleaning of ; has been present since Jan 2020. Maybe the version in which this change is released is the version that raises an error? If this method raised an error, it could reduce the logic we need to verify in test.

lgarron added a commit that referenced this pull request Oct 24, 2022
This PR removes `dedup_source_list` and replaces it with a simple
`.uniq` call. This resolves
#491, which is only the
latest in a series of ongoing issues with source expression
deduplication.

`secure_headers` has had this feature [since
2015](32bb3f5)
that [deduplicates redundant URL source
expressions](https://github.com/github/secure_headers/blob/494b75ff927464ed8d1c43e98e41fe4d15ce2bdf/lib/secure_headers/headers/content_security_policy.rb#L157-L170).
For example, if `*.github.com` is listed as a source expression for a
given
[directive](https://w3c.github.io/webappsec-csp/#framework-directives),
then the addition of `example.github.com` would have no effect, and so
the latter can be safely removed by `secure_headers` to save bytes.

Unfortunately, this implementation has had various bugs due to the use
of "impedance mismatched" APIs like
[`URI`](https://docs.ruby-lang.org/en/2.1.0/URI.html)[^1] and
[`File.fnmatch`](https://apidock.com/ruby/v2_5_5/File/fnmatch/class)[^2].
For example, it made incorrect assumptions about source expression
schemes, leading to the following series of events:

[^1]: Which allows wildcards in domains but not for ports, as it is not
designed to parse URL source expressions.
[^2]: Which has general glob matching that is not designed for URL
source expressions either.

- 2017-03: A [bug was reported and
confirmed](#317)
- 2022-04: The bug was finally [fixed by `@keithamus` (a Hubber) in
2022](#478) due to our use
of web sockets.
- 2022-06: This fix in turn triggered a [new
bug](#491) with source
expressions like `data:`.
- 2022-06: An external contributor [submitted a fix for the bew
bug](#490), but this still
doesn't address some of the "fast and loose" semantic issues of the
underlying implementation.
- 2022-08: `@lgarron` [drafted a new
implementation](#498) that
semantically parses and compares source expressions based on the
specification for source expressions.
- This implementation already proved to have some value in early
testing, as its stricter validation caught an issue in `github.com`'s
CSP. However, it would take additional work to make this implementation
fully aware of CSP syntax (e.g. not allowing URL source expressions in a
source directive when only special keywords are allowed, and
vice-versa), and it relies on a new regex-based implementation of source
expression parsing that may very well lead to more subtle bugs.

In effect, this is a half feature whose maintenance cost has outweighed
its functionality:

- The relevant code has suffered from continued bugs, described as
above.
- Deduplication is purely a "nice-to-have" — it is not necessary for the
security or correct functionality of `secure_headers`.
- It was [introduced by `@oreoshake` (the then-maintainer) without
explanation in
2015](32bb3f5),
never "officially" documented. We have no concrete data on whether it
has any performance impact on any real apps — for all we know, uncached
deduplication calculations might even cost more than the saved header
bytes.
- Further, in response to the first relevant bug, `@oreoshake` himself
[said](#317 (comment)):

> I've never been a fan of the deduplication based on `*` anyways. Maybe
we should just rip that out.

> Like people trying to save a few bytes can optimize elsewhere.

So this PR completely removes the functionality. If we learn of a use
case where this was very important (and the app somehow can't preprocess
the list before passing it to `secure_headers`), we can always resume
consideration of one of:

- #490
- #498
@lgarron
Copy link
Contributor Author

lgarron commented Oct 25, 2022

Abandoned in favor of #499

@lgarron lgarron closed this Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants