Semantically parse and deduplicate source expressions #498

lgarron · 2022-08-17T19:08:04Z

Recently, we've had a spate of fixes for parsing directives and source expressions, stemming from the fact that the code doesn't understand the format of valid expressions, and makes local assumptions about what they look like — in particular, assuming a resemblance to URLs during deduplication, when handling a lot of possible values that are not URLs.

#490
#478

This PR is an attempt to 'bite the bullet" and parse source expressions so we can semantically deduplicate matching URLs. In the future, we could use this to add more validation.

All PRs:

Has tests
~~Documentation updated~~ (N/A)

Adding a new header

lgarron · 2022-08-17T19:08:33Z

spec/lib/secure_headers/headers/content_security_policy_spec.rb

-        expect(Kernel).to receive(:warn).with(%(frame_ancestors contains a ; in "google.com;script-src *;.;" which will raise an error in future versions. It has been replaced with a blank space.))
-        expect(ContentSecurityPolicy.new(frame_ancestors: %w(https://google.com;script-src https://*;.;)).value).to eq("frame-ancestors google.com script-src * .")
+        expect(Kernel).to receive(:warn).with(%(frame_ancestors contains a ; in "https://google.com;script-src https://localhost;example.com;" which will raise an error in future versions. It has been replaced with a blank space.))
+        expect(ContentSecurityPolicy.new(frame_ancestors: %w(https://google.com;script-src https://localhost;example.com;)).value).to eq("frame-ancestors google.com script-src localhost example.com")


Based on the CSP spec/MDN docs, it looks like . is not a valid expression? https://www.w3.org/TR/CSP3/#framework-directive-source-list

This test seems to come from #418

Does the original test fail on this PR? I hesitate to change the test if functionality under it didn't change.

lgarron · 2022-08-17T19:10:52Z

lib/secure_headers/headers/content_security_policy/source_expression.rb

+
+module SecureHeaders
+  class ContentSecurityPolicy
+    class SourceExpression


I wonder if we should just call this a "expression" or "entry" even something more general, rather than a "source expression".

My goal was to support just "source lists", as defined in the spec and implied in content_security_policy.rb. But for now we still need parse things that don't technically belong in source lists, such as report paths and as unquoted directives like script-src (see code comments in other files).

JackMc · 2022-08-18T12:06:39Z

@machisuji if you have time, your review would also be appreciated here having worked in this area very recently 🙇

JackMc · 2022-08-18T12:08:31Z

Meta-concern: is this change a breaking one? I don’t think we have an easy way to differentiate between the quirks of this implementation and the other one. Those quirks weren’t documented, but they likely are now relied upon.

JackMc

Didn’t make it through the entire PR, but some initial feedback. Will attempt to remember to come back and do more, but please ping me if I forget.

JackMc · 2022-08-18T12:10:57Z

lib/secure_headers/headers/content_security_policy.rb

+      source_list.map do |expression|
+        if expression =~ /(\n|;)/
+          if !semicolon_warned_yet
+            Kernel.warn("#{directive} contains a #{$1} in #{source_list.join(" ").inspect} which will raise an error in future versions. It has been replaced with a blank space.")


Likely the $1 in the warning here will be very confusing if it’s an actual new line. Could we do $1.inspect? I think that will show ”\n”

JackMc · 2022-08-18T12:18:14Z

lib/secure_headers/headers/content_security_policy.rb

+    def clean_malformatted_sources(directive, source_list)
+      cleaned_source_list = []
+      semicolon_warned_yet = false
+      source_list.map do |expression|


This isn’t a map operation since it doesn’t have a meaningful array result. However we could do this using a reduce pattern, what do you think? reduce is a weird name for the operation in this context. Ruby also offers inject which is the same thing just aliased.

cleaned_source_list = source_list.inject([]) do |arr, source| # … add elements to `arr` as needed. end

If we like the code as-is, we could just change this to each.

JackMc · 2022-08-18T12:20:59Z

lib/secure_headers/headers/content_security_policy.rb

      end
+      cleaned_source_list.select { |value| value != "" }


I can’t remember if this library only works in Rails, but if it does you can use compact_blank here.

Suggested change

cleaned_source_list.select { |value| value != "" }

cleaned_source_list.compact_blank

Otherwise, I would do something like:

Suggested change

cleaned_source_list.select { |value| value != "" }

cleaned_source_list.reject(&:blank?)

Additionally, with the suggestion above returning the array, you could call this on the return value and remove the need for cleaned_source_list.

JackMc · 2022-08-18T12:21:56Z

lib/secure_headers/headers/content_security_policy.rb

@@ -156,17 +174,14 @@ def reject_all_values_if_none(source_list)
    # e.g. *.github.com asdf.github.com becomes *.github.com
    def dedup_source_list(sources)
      sources = sources.uniq
-      wild_sources = sources.select { |source| source =~ STAR_REGEXP }
+      host_source_expressions = sources.map { |source| parse_source_expression(source) }
+      # TODO: Split by source expression type.


Is this TODO before the PR merges? If not we should likely document it in a ticket and remove this comment.

JackMc · 2022-08-18T12:23:18Z

lib/secure_headers/headers/content_security_policy.rb

@@ -156,17 +174,14 @@ def reject_all_values_if_none(source_list)
    # e.g. *.github.com asdf.github.com becomes *.github.com
    def dedup_source_list(sources)
      sources = sources.uniq
-      wild_sources = sources.select { |source| source =~ STAR_REGEXP }
+      host_source_expressions = sources.map { |source| parse_source_expression(source) }


If you want to not copy the array several times, you can use the in-place variants of these methods. Generally they end with an !.

Suggested change

host_source_expressions = sources.map { |source| parse_source_expression(source) }

sources.map! { |source| parse_source_expression(source) }

Same for uniq/uniq! and select/select!

JackMc · 2022-08-18T12:26:05Z

lib/secure_headers/headers/content_security_policy.rb

@@ -214,5 +229,9 @@ def strip_source_schemes(source_list)
    def symbol_to_hyphen_case(sym)
      sym.to_s.tr("_", "-")
    end
+
+    def source_scheme(source)
+      source.match(/^([A-Za-z0-9\-\+.]+):\/\//)&.values_at(1)


Is doing this using a regex the most reliable way? I imagine we had a good reason for not using a URI parser.

KyFaSt · 2022-08-22T21:45:11Z

spec/lib/secure_headers/headers/content_security_policy_spec.rb

-        expect(Kernel).to receive(:warn).with(%(frame_ancestors contains a ; in "google.com;script-src *;.;" which will raise an error in future versions. It has been replaced with a blank space.))
-        expect(ContentSecurityPolicy.new(frame_ancestors: %w(https://google.com;script-src https://*;.;)).value).to eq("frame-ancestors google.com script-src * .")
+        expect(Kernel).to receive(:warn).with(%(frame_ancestors contains a ; in "https://google.com;script-src https://localhost;example.com;" which will raise an error in future versions. It has been replaced with a blank space.))
+        expect(ContentSecurityPolicy.new(frame_ancestors: %w(https://google.com;script-src https://localhost;example.com;)).value).to eq("frame-ancestors google.com script-src localhost example.com")


Does the original test fail on this PR? I hesitate to change the test if functionality under it didn't change.

KyFaSt · 2022-08-22T22:03:26Z

lib/secure_headers/headers/content_security_policy.rb

+      source_list.map do |expression|
+        if expression =~ /(\n|;)/
+          if !semicolon_warned_yet
+            Kernel.warn("#{directive} contains a #{$1} in #{source_list.join(" ").inspect} which will raise an error in future versions. It has been replaced with a blank space.")


This may be out of scope for this PR but this kernel warning message and cleaning of ; has been present since Jan 2020. Maybe the version in which this change is released is the version that raises an error? If this method raised an error, it could reduce the logic we need to verify in test.

This PR removes `dedup_source_list` and replaces it with a simple `.uniq` call. This resolves #491, which is only the latest in a series of ongoing issues with source expression deduplication. `secure_headers` has had this feature [since 2015](32bb3f5) that [deduplicates redundant URL source expressions](https://github.com/github/secure_headers/blob/494b75ff927464ed8d1c43e98e41fe4d15ce2bdf/lib/secure_headers/headers/content_security_policy.rb#L157-L170). For example, if `*.github.com` is listed as a source expression for a given [directive](https://w3c.github.io/webappsec-csp/#framework-directives), then the addition of `example.github.com` would have no effect, and so the latter can be safely removed by `secure_headers` to save bytes. Unfortunately, this implementation has had various bugs due to the use of "impedance mismatched" APIs like [`URI`](https://docs.ruby-lang.org/en/2.1.0/URI.html)[^1] and [`File.fnmatch`](https://apidock.com/ruby/v2_5_5/File/fnmatch/class)[^2]. For example, it made incorrect assumptions about source expression schemes, leading to the following series of events: [^1]: Which allows wildcards in domains but not for ports, as it is not designed to parse URL source expressions. [^2]: Which has general glob matching that is not designed for URL source expressions either. - 2017-03: A [bug was reported and confirmed](#317) - 2022-04: The bug was finally [fixed by `@keithamus` (a Hubber) in 2022](#478) due to our use of web sockets. - 2022-06: This fix in turn triggered a [new bug](#491) with source expressions like `data:`. - 2022-06: An external contributor [submitted a fix for the bew bug](#490), but this still doesn't address some of the "fast and loose" semantic issues of the underlying implementation. - 2022-08: `@lgarron` [drafted a new implementation](#498) that semantically parses and compares source expressions based on the specification for source expressions. - This implementation already proved to have some value in early testing, as its stricter validation caught an issue in `github.com`'s CSP. However, it would take additional work to make this implementation fully aware of CSP syntax (e.g. not allowing URL source expressions in a source directive when only special keywords are allowed, and vice-versa), and it relies on a new regex-based implementation of source expression parsing that may very well lead to more subtle bugs. In effect, this is a half feature whose maintenance cost has outweighed its functionality: - The relevant code has suffered from continued bugs, described as above. - Deduplication is purely a "nice-to-have" — it is not necessary for the security or correct functionality of `secure_headers`. - It was [introduced by `@oreoshake` (the then-maintainer) without explanation in 2015](32bb3f5), never "officially" documented. We have no concrete data on whether it has any performance impact on any real apps — for all we know, uncached deduplication calculations might even cost more than the saved header bytes. - Further, in response to the first relevant bug, `@oreoshake` himself [said](#317 (comment)): > I've never been a fan of the deduplication based on `*` anyways. Maybe we should just rip that out. > Like people trying to save a few bytes can optimize elsewhere. So this PR completely removes the functionality. If we learn of a use case where this was very important (and the app somehow can't preprocess the list before passing it to `secure_headers`), we can always resume consideration of one of: - #490 - #498

lgarron · 2022-10-25T03:57:56Z

Abandoned in favor of #499

machisuji and others added 30 commits August 16, 2022 22:54

fix source dedup breaking with port wildcards

8ec819d

FREEZE.unindexed

48c4256

FREEZE.unindexed

c95f2c1

FREEZE.unindexed

c0b1d6b

FREEZE.unindexed

702ddd5

FREEZE.unindexed

4df132a

FREEZE.unindexed

107be10

FREEZE.unindexed

ffbdbbb

FREEZE.unindexed

e7be2a9

FREEZE.indexed

1d1f90b

FREEZE.unindexed

3455694

FREEZE.unindexed

c81f72c

FREEZE.unindexed

4c04e48

FREEZE.unindexed

9cfb165

Update test to handle stronger deduplication.

749b4ec

FREEZE.unindexed

9d6a280

FREEZE.indexed

ca8b9a0

FREEZE.unindexed

3681ba5

FREEZE.unindexed

11ff889

FREEZE.unindexed

c6e9369

FREEZE.indexed

4e2fb9f

FREEZE.indexed

fd94e94

FREEZE.unindexed

87626fb

FREEZE.unindexed

7a6b908

FREEZE.unindexed

aff76de

FREEZE.indexed

74f6e23

Remove quick 'n' dirty test code.

1a1c607

Fix quoted expression regex.

9babd31

Fix typo.

5511b66

Remove all empty expressions.

801a002

lgarron mentioned this pull request Aug 17, 2022

fix source dedup breaking with port wildcards #490

Closed

lgarron commented Aug 17, 2022

View reviewed changes

Format.

60dd23f

lgarron force-pushed the parse-source-expressions branch from f9ebbfb to 60dd23f Compare August 17, 2022 19:16

lgarron marked this pull request as ready for review August 17, 2022 19:18

lgarron requested review from vcsjones and JackMc August 17, 2022 20:04

JackMc reviewed Aug 18, 2022

View reviewed changes

KyFaSt reviewed Aug 22, 2022

View reviewed changes

JackMc assigned lgarron Oct 18, 2022

lgarron mentioned this pull request Oct 19, 2022

Remove source expression deduplication. #499

Merged

lgarron closed this Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantically parse and deduplicate source expressions #498

Semantically parse and deduplicate source expressions #498

lgarron commented Aug 17, 2022 •

edited

Loading

lgarron Aug 17, 2022

KyFaSt Aug 22, 2022

lgarron Aug 17, 2022 •

edited

Loading

JackMc commented Aug 18, 2022

JackMc commented Aug 18, 2022 •

edited

Loading

JackMc left a comment

JackMc Aug 18, 2022

JackMc Aug 18, 2022

JackMc Aug 18, 2022

JackMc Aug 18, 2022

JackMc Aug 18, 2022

JackMc Aug 18, 2022

JackMc Aug 18, 2022

KyFaSt Aug 22, 2022

KyFaSt Aug 22, 2022

lgarron commented Oct 25, 2022

	cleaned_source_list.select { \|value\| value != "" }
	cleaned_source_list.compact_blank

	cleaned_source_list.select { \|value\| value != "" }
	cleaned_source_list.reject(&:blank?)

	host_source_expressions = sources.map { \|source\| parse_source_expression(source) }
	sources.map! { \|source\| parse_source_expression(source) }

Semantically parse and deduplicate source expressions #498

Semantically parse and deduplicate source expressions #498

Conversation

lgarron commented Aug 17, 2022 • edited Loading

All PRs:

Adding a new header

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgarron Aug 17, 2022 • edited Loading

Choose a reason for hiding this comment

JackMc commented Aug 18, 2022

JackMc commented Aug 18, 2022 • edited Loading

JackMc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgarron commented Oct 25, 2022

lgarron commented Aug 17, 2022 •

edited

Loading

lgarron Aug 17, 2022 •

edited

Loading

JackMc commented Aug 18, 2022 •

edited

Loading