Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert :whitespace_elements into a Hash. #94

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,15 +209,14 @@ traversal. See the Transformers section below for details.
Custom transformer or array of custom transformers to run using breadth-first
traversal. See the Transformers section below for details.

#### :whitespace_elements (Array)
#### :whitespace_elements (Hash)

Array of lowercase element names that should be replaced with whitespace when
removed in order to preserve readability. For example,
`foo<div>bar</div>baz` will become
`foo bar baz` when the `<div>` is removed.
Hash of lowercase element names that should be replaced and replacement values
in order to preserve readability. For example, `foo<div>bar</div>baz` will
become `foo bar baz` when the `<div>` is removed.

By default, the following elements are included in the
`:whitespace_elements` array:
By default, the following elements (as keys) are included in the
`:whitespace_elements` hash:

```
address article aside blockquote br dd div dl dt footer h1 h2 h3 h4 h5
Expand Down
39 changes: 31 additions & 8 deletions lib/sanitize/config.rb
Original file line number Diff line number Diff line change
Expand Up @@ -73,14 +73,37 @@ module Config
:transformers_breadth => [],

# Elements which, when removed, should have their contents surrounded by
# space characters to preserve readability. For example,
# `foo<div>bar</div>baz` will become 'foo bar baz' when the <div> is
# removed.
:whitespace_elements => %w[
address article aside blockquote br dd div dl dt footer h1 h2 h3 h4 h5
h6 header hgroup hr li nav ol p pre section ul
]

# values specified with `before` and `after` keys to preserve readability.
# For example, `foo<div>bar</div>baz` will become 'foo bar baz' when the
# <div> is removed.
:whitespace_elements => {
'address' => { :before => ' ', :after => ' ' },
'article' => { :before => ' ', :after => ' ' },
'aside' => { :before => ' ', :after => ' ' },
'blockquote' => { :before => ' ', :after => ' ' },
'br' => { :before => ' ', :after => ' ' },
'dd' => { :before => ' ', :after => ' ' },
'div' => { :before => ' ', :after => ' ' },
'dl' => { :before => ' ', :after => ' ' },
'dt' => { :before => ' ', :after => ' ' },
'footer' => { :before => ' ', :after => ' ' },
'h1' => { :before => ' ', :after => ' ' },
'h2' => { :before => ' ', :after => ' ' },
'h3' => { :before => ' ', :after => ' ' },
'h4' => { :before => ' ', :after => ' ' },
'h5' => { :before => ' ', :after => ' ' },
'h6' => { :before => ' ', :after => ' ' },
'header' => { :before => ' ', :after => ' ' },
'hgroup' => { :before => ' ', :after => ' ' },
'hr' => { :before => ' ', :after => ' ' },
'li' => { :before => ' ', :after => ' ' },
'nav' => { :before => ' ', :after => ' ' },
'ol' => { :before => ' ', :after => ' ' },
'p' => { :before => ' ', :after => ' ' },
'pre' => { :before => ' ', :after => ' ' },
'section' => { :before => ' ', :after => ' ' },
'ul' => { :before => ' ', :after => ' ' }
}
}
end
end
15 changes: 12 additions & 3 deletions lib/sanitize/transformers/clean_element.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,16 @@ def initialize(config)
@protocols = config[:protocols]
@remove_all_contents = false
@remove_element_contents = Set.new
@whitespace_elements = Set.new(config[:whitespace_elements])
@whitespace_elements = Hash.new

# Converting :whitespace_element into a Hash for backwards compatibility.
if config[:whitespace_elements].is_a?(Array)
config[:whitespace_elements].each do |element|
@whitespace_elements[element] = { :before => ' ', :after => ' ' }
end
else
@whitespace_elements = config[:whitespace_elements]
end

if config[:remove_contents].is_a?(Array)
@remove_element_contents.merge(config[:remove_contents].map(&:to_s))
Expand All @@ -31,10 +40,10 @@ def call(env)
# Elements like br, div, p, etc. need to be replaced with whitespace in
# order to preserve readability.
if @whitespace_elements.include?(name)
node.add_previous_sibling(Nokogiri::XML::Text.new(' ', node.document))
node.add_previous_sibling(Nokogiri::XML::Text.new(@whitespace_elements[name][:before].to_s, node.document))

unless node.children.empty?
node.add_next_sibling(Nokogiri::XML::Text.new(' ', node.document))
node.add_next_sibling(Nokogiri::XML::Text.new(@whitespace_elements[name][:after].to_s, node.document))
end
end

Expand Down
28 changes: 28 additions & 0 deletions test/test_sanitize.rb
Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,21 @@
Sanitize.clean('<b data-éfoo="valid"></b>', config)
.must_equal('<b></b>') # Another annoying Nokogiri quirk.
end

it 'should replace whitespace_elements with configured :before and :after values' do
config = {
:whitespace_elements => {
'p' => { :before => "\n", :after => "\n" },
'div' => { :before => "\n", :after => "\n" },
'br' => { :before => "\n", :after => "\n" },
}
}

Sanitize.clean('<p>foo</p>', config).must_equal("\nfoo\n")
Sanitize.clean('<p>foo</p><p>bar</p>', config).must_equal("\nfoo\n\nbar\n")
Sanitize.clean('foo<div>bar</div>baz', config).must_equal("foo\nbar\nbaz")
Sanitize.clean('foo<br>bar<br>baz', config).must_equal("foo\nbar\nbaz")
end
end

describe 'Sanitize.clean' do
Expand Down Expand Up @@ -645,3 +660,16 @@
Sanitize.clean!('foo <style>bar').must_equal('foo bar')
end
end

describe 'backwards compatibility' do
it 'should work with legacy :whitespace_elements Arrays' do
config = {
:whitespace_elements => %w[p div br]
}

Sanitize.clean('<p>foo</p>', config).must_equal(' foo ')
Sanitize.clean('<p>foo</p><p>bar</p>', config).must_equal(' foo bar ')
Sanitize.clean('foo<div>bar</div>baz', config).must_equal('foo bar baz')
Sanitize.clean('foo<br>bar<br>baz', config).must_equal('foo bar baz')
end
end