You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Soupault generates clean URLs without a trailing slash, resulting in a redirect for each requested URL from, for example, "/blog" to "/blog/". Both https://soupault.app and https://baturin.org contain URLs of the former type, as does the blueprints blog.
Impact:
Potential performance impact from unnecessary requests.
Soupault uses clean URLs by default. If you add a page to site/, for example, site/about.html, it will turn into build/about/index.html so that it can be accessed as https://mysite.example.com/about.
I have attached screenshots from different tools showing how https://soupault.app/blog is being redirected to https://soupault.app/blog/. I have used multiple tools because it was confusing at first, as the response body of the 301 Moved Permanently is actually the same response body as the subsequent 200 OK response.
Burp Suite Community Edition: the redirect response is 10247 bytes, seemingly containing a gzipped response that Burp doesn't bother unzipping since it was a 301 but the subsequent 200 OK response is 44185 bytes and unzipped by Burp.
Chromium seems to throw away the 301 redirect body so the 301 response is only 226 B.
Firefox seems to download both responses so they are both about about 10 kB.
The redirect can also be seen with curl:
curl with explicit --no-location option so it does not follow the redirect:
curl with --location option so it does follow the redirect. The -i option makes it
show the response headers for both the original request and the second request:
Potential performance impact from unnecessary requests, the degree of impact depending on the end user's environment, the weight of the requested page and the behavior and configuration of the site's webserver.
Potential SEO penalty. I'm unsure of this because SEO is confusing but:
Google's Search Console explicitly complained about the redirects on two of my URLs. I don't know why it didn't complain about the others but I have generally had trouble with Google not indexing a bunch of pages. Notably, Gitlab pages, which I use, does a 302 Found redirect instead of a 301 Moved Permanently and Google describes the former as providing a weak signal that the redirect target should be canonical: https://developers.google.com/search/docs/crawling-indexing/301-redirects
Some other random SEO info on the internet seems to also suggest that 301 Found should only be used for temporary redirects as it will not pass SEO value to the target URL.
Potential Remediation
I was able to fix the issue in my blog without needing any changes from Soupault itself. I have attached the equivalent diff if applied to the blueprint blog, generated by git diff > clean-url-issue/soupault-blueprints-blog.patch and updated to use the latest Soupault 4.11.0. However, fixes are potentially better made to Soupault itself but I am unsure. Some potential schools of thought:
Some URLs are generated by Soupault's indexing, and used in templates such as "{{e.url}} and {{t.url}} so it seems reasonable to expect the URL to already have a trailing slash?
Changing Soupault's generated URLs may break existing blogs, resulting in two slashes being added.
If Soupault's generated URLs were changed to end in a slash, then the code in the blueprint blog would need a bit more care to edit with two conventions to be followed:
If using a URL generated by Soupault, don't add a trailing slash in the HTML.
Otherwise, add a trailing slash in the HTML.
In contrast, my attached patch has a simpler rule: every HTML href should end in an explicit slash.
The issue can also sometimes be addressed by webserver configuration but not all providers provide such access, such as Gitlab Pages.
If you think changing the blueprint blog is the way to go, I would be happy to raise a pull request with my changes.
Attachments
Burp:
Chromium:
Firefox:
Blueprint blog patch:
diff --git a/helpers/blog-index.lua b/helpers/blog-index.lua
index 0f4f20b..7a73458 100644
--- a/helpers/blog-index.lua
+++ b/helpers/blog-index.lua
@@ -112,7 +112,7 @@ local template = [[
<h1>Posts by tag</h1>
<ul>
{% for t in tag_links %}
- <li> <a href="{{t.url}}">{{t.title}}</a> </li>
+ <li> <a href="{{t.url}}/">{{t.title}}</a> </li>
{% endfor %}
</ul>
]]
diff --git a/netlify.sh b/netlify.sh
index 2d3d9f6..50b4d5d 100755
--- a/netlify.sh
+++ b/netlify.sh
@@ -1,6 +1,6 @@
#!/bin/sh
-SOUPAULT_VERSION="4.0.0-beta1"
+SOUPAULT_VERSION="4.11.0"
if [ -z "${SOUPAULT_VERSION}" ]; then
echo "Error: soupault version is undefined, cannot decide what to download"
diff --git a/soupault.toml b/soupault.toml
index c5f62d6..1df50d7 100644
--- a/soupault.toml
+++ b/soupault.toml
@@ -99,19 +99,19 @@
# Jingoo template for rendering extracted metadata
index_template = """
{% for e in entries %}
- <h2><a href="{{e.url}}">{{e.title}}</a></h2>
+ <h2><a href="{{e.url}}/">{{e.title}}</a></h2>
<div><strong>Last update:</strong> {{e.date}}</div>
{% if e.tags %}
<div class="post-tags">
<strong>Tags: </strong>
{%- for t in e.tags -%}
- <a href="/blog/tag/{{t}}"><span class="post-tag">{{t}}</span></a>{% if not loop.last %}, {% endif %}
+ <a href="/blog/tag/{{t}}/"><span class="post-tag">{{t}}</span></a>{% if not loop.last %}, {% endif %}
{%- endfor -%}
</div>
{% endif %}
<div><strong>Reading time:</strong> {{e.reading_time}}</div>
<p>{{e.excerpt}}</p>
- <a href="{{e.url}}">Read more</a>
+ <a href="{{e.url}}/">Read more</a>
{% endfor %}
"""
@@ -132,7 +132,7 @@
<dl>
{% for e in sublist(0, limit, entries) %}
<dt>{{e.date}}</dt>
- <dd> <a href="{{e.url}}">{{e.title}}</a> </dd>
+ <dd> <a href="{{e.url}}/">{{e.title}}</a> </dd>
{% endfor %}
</ul>
</dl>
@@ -195,7 +195,7 @@
<div class="post-tags">
<span><strong>Tags:</strong> </span>
{%- for t in tags -%}
- <a href="/blog/tag/{{t}}"><span class="post-tag">{{t}}</span></a>{% if not loop.last %}, {% endif %}
+ <a href="/blog/tag/{{t}}/"><span class="post-tag">{{t}}</span></a>{% if not loop.last %}, {% endif %}
{%- endfor -%}
</div>
{% endif %}
diff --git a/templates/header.html b/templates/header.html
index 89af48c..963f822 100644
--- a/templates/header.html
+++ b/templates/header.html
@@ -3,6 +3,6 @@
</header>
<nav>
<a href="/">Home</a> |
- <a href="/blog">Blog</a> |
- <a href="/about">About</a>
+ <a href="/blog/">Blog</a> |
+ <a href="/about/">About</a>
</nav>
The text was updated successfully, but these errors were encountered:
I never paid attention to that, but you are right. In fact, even SimpleHTTP of python -m http.server behaves exactly that way and responds to everything like /about with a 301 redirect to /about/.
I think trailing slashes should be the default for URLs produced by site indexing, since it's the de facto canonical URL in most web servers. 6232d56 makes it so, but also adds an option to revert to the old behavior with settings.clean_url_trailing_slash = true option, if anyone wants it.
Hi,
Soupault generates clean URLs without a trailing slash, resulting in a redirect for each requested URL from, for example, "/blog" to "/blog/". Both https://soupault.app and https://baturin.org contain URLs of the former type, as does the blueprints blog.
Impact:
Details
The reference manual (https://soupault.app/reference-manual/#clean-urls) states:
That statement seems to imply that a request for htts://mysite.example.com/about should resolve to https://mysite.example.com/about/index.html and both https://soupault.app and the blueprints blog contain links without a trailing slash, like https://soupault.app/blog in the header.
However, those URLs cause a redirect in a number of web servers, including those used by:
I have attached screenshots from different tools showing how https://soupault.app/blog is being redirected to https://soupault.app/blog/. I have used multiple tools because it was confusing at first, as the response body of the 301 Moved Permanently is actually the same response body as the subsequent 200 OK response.
Burp Suite Community Edition: the redirect response is 10247 bytes, seemingly containing a gzipped response that Burp doesn't bother unzipping since it was a 301 but the subsequent 200 OK response is 44185 bytes and unzipped by Burp.
Chromium seems to throw away the 301 redirect body so the 301 response is only 226 B.
Firefox seems to download both responses so they are both about about 10 kB.
The redirect can also be seen with curl:
curl with explicit
--no-location
option so it does not follow the redirect:curl with
--location
option so it does follow the redirect. The-i
option makes itshow the response headers for both the original request and the second request:
Impact
Potential performance impact from unnecessary requests, the degree of impact depending on the end user's environment, the weight of the requested page and the behavior and configuration of the site's webserver.
Potential SEO penalty. I'm unsure of this because SEO is confusing but:
Google's Search Console explicitly complained about the redirects on two of my URLs. I don't know why it didn't complain about the others but I have generally had trouble with Google not indexing a bunch of pages. Notably, Gitlab pages, which I use, does a 302 Found redirect instead of a 301 Moved Permanently and Google describes the former as providing a weak signal that the redirect target should be canonical: https://developers.google.com/search/docs/crawling-indexing/301-redirects
Some other random SEO info on the internet seems to also suggest that 301 Found should only be used for temporary redirects as it will not pass SEO value to the target URL.
Potential Remediation
I was able to fix the issue in my blog without needing any changes from Soupault itself. I have attached the equivalent diff if applied to the blueprint blog, generated by
git diff > clean-url-issue/soupault-blueprints-blog.patch
and updated to use the latest Soupault 4.11.0. However, fixes are potentially better made to Soupault itself but I am unsure. Some potential schools of thought:Some URLs are generated by Soupault's indexing, and used in templates such as
"{{e.url}}
and{{t.url}}
so it seems reasonable to expect the URL to already have a trailing slash?Changing Soupault's generated URLs may break existing blogs, resulting in two slashes being added.
If Soupault's generated URLs were changed to end in a slash, then the code in the blueprint blog would need a bit more care to edit with two conventions to be followed:
In contrast, my attached patch has a simpler rule: every HTML href should end in an explicit slash.
The issue can also sometimes be addressed by webserver configuration but not all providers provide such access, such as Gitlab Pages.
If you think changing the blueprint blog is the way to go, I would be happy to raise a pull request with my changes.
Attachments
Burp:
Chromium:
Firefox:
Blueprint blog patch:
The text was updated successfully, but these errors were encountered: