Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Clean" URLs without ending slash cause redirects to "Cleaner" URLs with ending slash #81

Open
forbytten opened this issue Mar 9, 2025 · 2 comments

Comments

@forbytten
Copy link

Hi,

Soupault generates clean URLs without a trailing slash, resulting in a redirect for each requested URL from, for example, "/blog" to "/blog/". Both https://soupault.app and https://baturin.org contain URLs of the former type, as does the blueprints blog.

Impact:

  1. Potential performance impact from unnecessary requests.
  2. Potential SEO penalty.

Details

The reference manual (https://soupault.app/reference-manual/#clean-urls) states:

Soupault uses clean URLs by default. If you add a page to site/, for example, site/about.html, it will turn into build/about/index.html so that it can be accessed as https://mysite.example.com/about.

That statement seems to imply that a request for htts://mysite.example.com/about should resolve to https://mysite.example.com/about/index.html and both https://soupault.app and the blueprints blog contain links without a trailing slash, like https://soupault.app/blog in the header.

However, those URLs cause a redirect in a number of web servers, including those used by:

  1. https://soupault.app
  2. https://baturin.org
  3. Gitlab Pages sites , which is the provider I am using.

I have attached screenshots from different tools showing how https://soupault.app/blog is being redirected to https://soupault.app/blog/. I have used multiple tools because it was confusing at first, as the response body of the 301 Moved Permanently is actually the same response body as the subsequent 200 OK response.

  1. Burp Suite Community Edition: the redirect response is 10247 bytes, seemingly containing a gzipped response that Burp doesn't bother unzipping since it was a 301 but the subsequent 200 OK response is 44185 bytes and unzipped by Burp.

  2. Chromium seems to throw away the 301 redirect body so the 301 response is only 226 B.

  3. Firefox seems to download both responses so they are both about about 10 kB.

  4. The redirect can also be seen with curl:

    1. curl with explicit --no-location option so it does not follow the redirect:

      curl -i --no-location --no-progress-meter https://soupault.app/blog |head -20
      HTTP/2 301
      accept-ranges: bytes
      age: 1004
      cache-control: public,max-age=0,must-revalidate
      cache-status: "Netlify Edge"; hit
      content-type: text/html; charset=UTF-8
      date: Sun, 09 Mar 2025 09:53:19 GMT
      etag: "122509aae14b19098ca372918952081b-ssl"
      location: /blog/
      server: Netlify
      strict-transport-security: max-age=31536000
      x-nf-request-id: 01JNX55GXJ0T9XZ9PREDS2HFQT
      content-length: 43771
      
      <!DOCTYPE html>
      <html lang="en">
       <head>
        <meta name="generator" content="soupault 4.11.0">
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
      
    2. curl with --location option so it does follow the redirect. The -i option makes it
      show the response headers for both the original request and the second request:

      curl -i --location --no-progress-meter https://soupault.app/blog |head -30
       HTTP/2 301
       accept-ranges: bytes
       age: 1067
       cache-control: public,max-age=0,must-revalidate
       cache-status: "Netlify Edge"; hit
       content-type: text/html; charset=UTF-8
       date: Sun, 09 Mar 2025 09:54:22 GMT
       etag: "122509aae14b19098ca372918952081b-ssl"
       location: /blog/
       server: Netlify
       strict-transport-security: max-age=31536000
       x-nf-request-id: 01JNX57E809J8AEDZV6TD5M1E0
       content-length: 43771
      
       HTTP/2 200
       accept-ranges: bytes
       age: 0
       cache-control: public,max-age=0,must-revalidate
       cache-status: "Netlify Edge"; fwd=miss
       content-type: text/html; charset=UTF-8
       date: Sun, 09 Mar 2025 09:54:22 GMT
       etag: "122509aae14b19098ca372918952081b-ssl"
       server: Netlify
       strict-transport-security: max-age=31536000
       x-nf-request-id: 01JNX57E9K6KJ8XJ535SV68ZPV
       content-length: 43771
      
       <!DOCTYPE html>
       <html lang="en">
        <head>
      

Impact

  1. Potential performance impact from unnecessary requests, the degree of impact depending on the end user's environment, the weight of the requested page and the behavior and configuration of the site's webserver.

  2. Potential SEO penalty. I'm unsure of this because SEO is confusing but:

    1. Google's Search Console explicitly complained about the redirects on two of my URLs. I don't know why it didn't complain about the others but I have generally had trouble with Google not indexing a bunch of pages. Notably, Gitlab pages, which I use, does a 302 Found redirect instead of a 301 Moved Permanently and Google describes the former as providing a weak signal that the redirect target should be canonical: https://developers.google.com/search/docs/crawling-indexing/301-redirects

    2. Some other random SEO info on the internet seems to also suggest that 301 Found should only be used for temporary redirects as it will not pass SEO value to the target URL.

Potential Remediation

I was able to fix the issue in my blog without needing any changes from Soupault itself. I have attached the equivalent diff if applied to the blueprint blog, generated by git diff > clean-url-issue/soupault-blueprints-blog.patch and updated to use the latest Soupault 4.11.0. However, fixes are potentially better made to Soupault itself but I am unsure. Some potential schools of thought:

  1. Some URLs are generated by Soupault's indexing, and used in templates such as "{{e.url}} and {{t.url}} so it seems reasonable to expect the URL to already have a trailing slash?

  2. Changing Soupault's generated URLs may break existing blogs, resulting in two slashes being added.

  3. If Soupault's generated URLs were changed to end in a slash, then the code in the blueprint blog would need a bit more care to edit with two conventions to be followed:

    1. If using a URL generated by Soupault, don't add a trailing slash in the HTML.
    2. Otherwise, add a trailing slash in the HTML.
      In contrast, my attached patch has a simpler rule: every HTML href should end in an explicit slash.

The issue can also sometimes be addressed by webserver configuration but not all providers provide such access, such as Gitlab Pages.

If you think changing the blueprint blog is the way to go, I would be happy to raise a pull request with my changes.

Attachments

Burp:

Image

Chromium:

Image

Firefox:

Image

Blueprint blog patch:

diff --git a/helpers/blog-index.lua b/helpers/blog-index.lua
index 0f4f20b..7a73458 100644
--- a/helpers/blog-index.lua
+++ b/helpers/blog-index.lua
@@ -112,7 +112,7 @@ local template = [[
 <h1>Posts by tag</h1>
 <ul>
 {% for t in tag_links %}
-  <li> <a href="{{t.url}}">{{t.title}}</a> </li>
+  <li> <a href="{{t.url}}/">{{t.title}}</a> </li>
 {% endfor %}
 </ul>
 ]]
diff --git a/netlify.sh b/netlify.sh
index 2d3d9f6..50b4d5d 100755
--- a/netlify.sh
+++ b/netlify.sh
@@ -1,6 +1,6 @@
 #!/bin/sh
 
-SOUPAULT_VERSION="4.0.0-beta1"
+SOUPAULT_VERSION="4.11.0"
 
 if [ -z "${SOUPAULT_VERSION}" ]; then
     echo "Error: soupault version is undefined, cannot decide what to download"
diff --git a/soupault.toml b/soupault.toml
index c5f62d6..1df50d7 100644
--- a/soupault.toml
+++ b/soupault.toml
@@ -99,19 +99,19 @@
   # Jingoo template for rendering extracted metadata
   index_template = """
     {% for e in entries %}
-    <h2><a href="{{e.url}}">{{e.title}}</a></h2>
+    <h2><a href="{{e.url}}/">{{e.title}}</a></h2>
     <div><strong>Last update:</strong> {{e.date}}</div>
     {% if e.tags %}
     <div class="post-tags">
        <strong>Tags: </strong>
        {%- for t in e.tags -%}
-         <a href="/blog/tag/{{t}}"><span class="post-tag">{{t}}</span></a>{% if not loop.last %}, {% endif %}
+         <a href="/blog/tag/{{t}}/"><span class="post-tag">{{t}}</span></a>{% if not loop.last %}, {% endif %}
        {%- endfor -%}
     </div>
     {% endif %}
     <div><strong>Reading time:</strong> {{e.reading_time}}</div>
     <p>{{e.excerpt}}</p>
-    <a href="{{e.url}}">Read more</a>
+    <a href="{{e.url}}/">Read more</a>
     {% endfor %}
   """
 
@@ -132,7 +132,7 @@
     <dl>
       {% for e in sublist(0, limit, entries) %}
       <dt>{{e.date}}</dt>
-      <dd> <a href="{{e.url}}">{{e.title}}</a> </dd>
+      <dd> <a href="{{e.url}}/">{{e.title}}</a> </dd>
       {% endfor %}
       </ul>
     </dl>
@@ -195,7 +195,7 @@
         <div class="post-tags">
          <span><strong>Tags:</strong> </span>
          {%- for t in tags -%}
-           <a href="/blog/tag/{{t}}"><span class="post-tag">{{t}}</span></a>{% if not loop.last %}, {% endif %}
+           <a href="/blog/tag/{{t}}/"><span class="post-tag">{{t}}</span></a>{% if not loop.last %}, {% endif %}
          {%- endfor -%}
          </div>
         {% endif %}
diff --git a/templates/header.html b/templates/header.html
index 89af48c..963f822 100644
--- a/templates/header.html
+++ b/templates/header.html
@@ -3,6 +3,6 @@
 </header>
 <nav>
   <a href="/">Home</a> |
-  <a href="/blog">Blog</a> |
-  <a href="/about">About</a>
+  <a href="/blog/">Blog</a> |
+  <a href="/about/">About</a>
 </nav>
dmbaturin added a commit that referenced this issue Mar 9, 2025
but give the user an option to disable that
@dmbaturin
Copy link
Collaborator

I never paid attention to that, but you are right. In fact, even SimpleHTTP of python -m http.server behaves exactly that way and responds to everything like /about with a 301 redirect to /about/.

I think trailing slashes should be the default for URLs produced by site indexing, since it's the de facto canonical URL in most web servers. 6232d56 makes it so, but also adds an option to revert to the old behavior with settings.clean_url_trailing_slash = true option, if anyone wants it.

@forbytten
Copy link
Author

Sounds good! Thanks for the very speedy response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants