autolink when Goodreads breaks the URL #82

mirabilos · 2023-01-07T15:32:49Z

Goodreads’ RSS feeds hide parts of the link: […]URL: <a href="https://archiveofourown.org/works/21085382" rel="nofollow noopener" target="_blank">https://archiveofourown.org/works/210...</a><br />[…]

Fix:

diff --git a/markdownify/__init__.py b/markdownify/__init__.py
index e15ecd4..36e15e7 100644
--- a/markdownify/__init__.py
+++ b/markdownify/__init__.py
@@ -221,11 +221,17 @@ class MarkdownConverter(object):
         title = el.get('title')
         # For the replacement see #29: text nodes underscores are escaped
         if (self.options['autolinks']
-                and text.replace(r'\_', '_') == href
                 and not title
                 and not self.options['default_title']):
-            # Shortcut syntax
-            return '<%s>' % href
+            rtext = text.replace(r'\_', '_')
+            if rtext.endswith('...') and rtext.startswith('http'):
+                # Goodreads-shortened link?
+                if href.startswith(rtext.rstrip('.')):
+                    # force match
+                    rtext = href
+            if rtext == href:
+                # Shortcut syntax
+                return '<%s>' % href
         if self.options['default_title'] and not title:
             title = href
         title_part = ' "%s"' % title.replace('"', r'\"') if title else ''

The text was updated successfully, but these errors were encountered:

chrispy-snps · 2024-01-14T16:48:23Z

@mirabilos - it doesn't seem reasonable to patch the public Markdownify package to handle the nuances of a particular web page's content. How about preprocessing the HTML in Beautiful Soup to replace truncated link text with the @href value:

from bs4 import BeautifulSoup
html = """
<p>Link: <a href="https://archiveofourown.org/works/21085382"
     rel="nofollow noopener"
     target="_blank">https://archiveofourown.org/works/210...</a></p>'
"""
soup = BeautifulSoup(html, 'lxml')

for a in soup.find_all('a', href=True, string=re.compile(r'\.\.\.$')):
    a.string = a['href']

then converting the soup object:

from markdownify import MarkdownConverter
def md(soup, **options):
    return MarkdownConverter(**options).convert_soup(soup)

which should give you the autolinks you want:

>>> print(md(soup))
Link: <https://archiveofourown.org/works/21085382>

mirabilos · 2024-01-14T19:56:14Z

Chris Papademetrious dixit:

@mirabilos - it doesn't seem reasonable to patch the public Markdownify @Package to handle the nuances of a particular web page's content. How

But the package already has an autolinks feature, and this fits in well and would help others with this problem (e.g. some Fedi instances also trim links like that). It also has much less overhead.

@about preprocessing the HTML in Beautiful Soup:

[…] Thanks for giving this sample code (I haven’t worked with bs4 myself yet). If you’re not merging this (if the arguments above didn’t succeed persuading) but are merging #92, my only other diff, I might go that way, so I have no local diff. bye, //mirabilos -- "Using Lynx is like wearing a really good pair of shades: cuts out the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL." -- Henry Nelson, March 1999

mirabilos · 2024-01-29T02:29:49Z

Not exactly, but…

_cleanup_traildots = re.compile('\\.\\.\\.$')
[…]
for e in html.find_all('a', href=True, string=_cleanup_traildots):
    href = str(e['href'])
    if href.startswith(str(e.string).rstrip('.')):
        e.string.replace_with(href)

… will do. (The docs say to never assign directly to .string for example.)

chrispy-snps · 2024-01-31T10:16:30Z

@mirabilos - nice solution! I like the use of multiple filter critera.

In my own code, I use raw strings for regex expressions to simplify escaping (r'\.\.\.$') but regular strings work too.

matthewwithanm · 2024-01-31T15:08:19Z

Don't forget that subclassing is always an option!

mirabilos · 2024-01-31T23:14:47Z

Chris Papademetrious dixit:

@mirabilos - nice solution! I like the use of multiple filter critera.

Thanks!

In my own code, I use raw strings for regex expressions to simplify escaping (`r'\.\.\.$'`) but regular strings work too.

I don’t use Python/py3k raw strings because they make escaping more complicated (e.g. the impossibility to write a single quote), and writing strings for Python/py3k is too hard already anyway, compared with shell, and I’m used to nesting levels of escaping. (Maybe it is visible that I don’t program much in py3k…) bye, //mirabilos -- „Cool, /usr/share/doc/mksh/examples/uhr.gz ist ja ein Grund, mksh auf jedem System zu installieren.“ -- XTaran auf der OpenRheinRuhr, ganz begeistert (EN: “[…]uhr.gz is a reason to install mksh on every system.”)

mirabilos mentioned this issue Jan 7, 2023

Newlines lead to ugly output edsu/feediverse#36

Open

mirabilos closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autolink when Goodreads breaks the URL #82

autolink when Goodreads breaks the URL #82

mirabilos commented Jan 7, 2023

chrispy-snps commented Jan 14, 2024 •

edited

Loading

mirabilos commented Jan 14, 2024 via email

mirabilos commented Jan 29, 2024

chrispy-snps commented Jan 31, 2024

matthewwithanm commented Jan 31, 2024

mirabilos commented Jan 31, 2024 via email

autolink when Goodreads breaks the URL #82

autolink when Goodreads breaks the URL #82

Comments

mirabilos commented Jan 7, 2023

chrispy-snps commented Jan 14, 2024 • edited Loading

mirabilos commented Jan 14, 2024 via email

mirabilos commented Jan 29, 2024

chrispy-snps commented Jan 31, 2024

matthewwithanm commented Jan 31, 2024

mirabilos commented Jan 31, 2024 via email

chrispy-snps commented Jan 14, 2024 •

edited

Loading