Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autolink when Goodreads breaks the URL #82

Closed
mirabilos opened this issue Jan 7, 2023 · 6 comments
Closed

autolink when Goodreads breaks the URL #82

mirabilos opened this issue Jan 7, 2023 · 6 comments

Comments

@mirabilos
Copy link

Goodreads’ RSS feeds hide parts of the link: […]URL: <a href="https://archiveofourown.org/works/21085382" rel="nofollow noopener" target="_blank">https://archiveofourown.org/works/210...</a><br />[…]

Fix:

diff --git a/markdownify/__init__.py b/markdownify/__init__.py
index e15ecd4..36e15e7 100644
--- a/markdownify/__init__.py
+++ b/markdownify/__init__.py
@@ -221,11 +221,17 @@ class MarkdownConverter(object):
         title = el.get('title')
         # For the replacement see #29: text nodes underscores are escaped
         if (self.options['autolinks']
-                and text.replace(r'\_', '_') == href
                 and not title
                 and not self.options['default_title']):
-            # Shortcut syntax
-            return '<%s>' % href
+            rtext = text.replace(r'\_', '_')
+            if rtext.endswith('...') and rtext.startswith('http'):
+                # Goodreads-shortened link?
+                if href.startswith(rtext.rstrip('.')):
+                    # force match
+                    rtext = href
+            if rtext == href:
+                # Shortcut syntax
+                return '<%s>' % href
         if self.options['default_title'] and not title:
             title = href
         title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
@chrispy-snps
Copy link
Collaborator

chrispy-snps commented Jan 14, 2024

@mirabilos - it doesn't seem reasonable to patch the public Markdownify package to handle the nuances of a particular web page's content. How about preprocessing the HTML in Beautiful Soup to replace truncated link text with the @href value:

from bs4 import BeautifulSoup
html = """
<p>Link: <a href="https://archiveofourown.org/works/21085382"
     rel="nofollow noopener"
     target="_blank">https://archiveofourown.org/works/210...</a></p>'
"""
soup = BeautifulSoup(html, 'lxml')

for a in soup.find_all('a', href=True, string=re.compile(r'\.\.\.$')):
    a.string = a['href']

then converting the soup object:

from markdownify import MarkdownConverter
def md(soup, **options):
    return MarkdownConverter(**options).convert_soup(soup)

which should give you the autolinks you want:

>>> print(md(soup))
Link: <https://archiveofourown.org/works/21085382>

@mirabilos
Copy link
Author

mirabilos commented Jan 14, 2024 via email

@mirabilos
Copy link
Author

Not exactly, but…

_cleanup_traildots = re.compile('\\.\\.\\.$')
[…]
for e in html.find_all('a', href=True, string=_cleanup_traildots):
    href = str(e['href'])
    if href.startswith(str(e.string).rstrip('.')):
        e.string.replace_with(href)

… will do. (The docs say to never assign directly to .string for example.)

@mirabilos mirabilos closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024
@chrispy-snps
Copy link
Collaborator

@mirabilos - nice solution! I like the use of multiple filter critera.

In my own code, I use raw strings for regex expressions to simplify escaping (r'\.\.\.$') but regular strings work too.

@matthewwithanm
Copy link
Owner

Don't forget that subclassing is always an option!

@mirabilos
Copy link
Author

mirabilos commented Jan 31, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants