Filter or remove rules to filter/remove by regexp/wildcard #423

Flashwalker · 2023-01-14T04:41:07Z

Can we have filter or remove rules to filter/remove via regexp or wildcard???

E.g.:

1.

Zero width space and/or Non-breaking space:
<a href="https://bla-bla-bla">&ZeroWidthSpace;&ZeroWidthSpace;</a>text-text-text produce:

[](https://bla-bla-bla)text-text-text

Is there any way to filter out (remove) html with zero visual content?
Something like:

turndownService.addRule('al_spaces', {
    regexFilter: '<[^<>]+?>[[:space:]]<\/[^<>]+?>',
    replacement: function (content) {
        return ''
    }
})

List of spaces for reference:

Number	Character name
\u0020	space
\u00A0	no-break space
\u1680	Ogham space mark
\u180E	Mongolian vowel separator
\u2000	en quad
\u2001	em quad
\u2002	en space (nut)
\u2003	em space (mutton)
\u2004	three-per-em space (thick space)
\u2005	four-per-em space (mid space)
\u2006	six-per-em space
\u2007	figure space
\u2008	punctuation space
\u2009	thin space
\u200A	hair space
\u200B	zero width space
\u202F	narrow no-break space
\u205F	medium mathematical space
\u3000	ideographic space
\uFEFF	zero width no-break space
\uFFFC	object replacement Character

2.

Line break which breaks markdown's markup:
<strong>bla-bla-bla<br></strong> <br>text-text-text produce:

**bla-bla-bla
** 
text-text-text

Is there any way to filter out (remove) all line breaks that precedes the closing tag?
Something like:

turndownService.removeAllBefore('<br>', '</*>')

Here is regex examples:

Remove the anchor with zero-width spaces (you can't see them until you paste it in dev console):

selectedHTML='<i>bla</i><b><a href="https://bla-bla-bla"></a>text-text-text</b><i>bla</i>'
selectedHTML.replace(/<[^<>]+?>[\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF\u0020\uFFFC]+<\/[^<>]+?>/gm, '')

Remove the line break that precedes closing tag:

selectedHTML='<i>bla</i><strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text<i>bla</i>'
selectedHTML.replace(/(<br ?\/?>)+(<\/[^<>]+?>)/gi, '$2')

Swap the line break that precedes closing tag and the closing tag with:

selectedHTML='<i>bla</i><strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text<i>bla</i>'
selectedHTML.replace(/((<br ?\/?>)+)(<\/[^<>]+?>)/gi, '$3$1')

It would be nice if regex filter will skip the content of code and pre tags.

P.S.
And also:

// Drop anchor html tags which contains only dots, commas
selectedHTML = '<a href="#">,</a>'
selectedHTML.replace(/<a [^<>]+?>[.,]+<\/a>/gim, '')

And

// Drop emoji images, keep emoji unicode (from alt attr)
selectedHTML = '<img src="img-apple-64/1f914.png" class="emoji" alt="🤔">'
selectedHTML.replace(/<img [^<>]+?alt=['"]([\p{Emoji}\u200d]+)['"][^<>]*?\/?>/gimu, '$1')

The text was updated successfully, but these errors were encountered:

Flashwalker mentioned this issue Jan 14, 2023

Span rules + br can break commonmark standard #405

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter or remove rules to filter/remove by regexp/wildcard #423

Filter or remove rules to filter/remove by regexp/wildcard #423

Flashwalker commented Jan 14, 2023 •

edited

Loading

Filter or remove rules to filter/remove by regexp/wildcard #423

Filter or remove rules to filter/remove by regexp/wildcard #423

Comments

Flashwalker commented Jan 14, 2023 • edited Loading

1.

2.

Here is regex examples:

Flashwalker commented Jan 14, 2023 •

edited

Loading