Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lastmod to sitemap #2604

Closed
jdevalk opened this issue Apr 14, 2020 · 33 comments · Fixed by #9954
Closed

Add lastmod to sitemap #2604

jdevalk opened this issue Apr 14, 2020 · 33 comments · Fixed by #9954
Labels
bug An error in the Docusaurus core causing instability or issues with its execution difficulty: intermediate Issues that are medium difficulty level, e.g. moderate refactoring with a clear test plan. help wanted Asking for outside help and/or contributions to this particular issue or PR.

Comments

@jdevalk
Copy link

jdevalk commented Apr 14, 2020

🐛 Bug Report

The XML sitemaps currently output loc, changefreq and priority for every url set. I would propose dropping the changefreq and priority fields, as none of the search engines use these, and instead adding the lastmod field, with the last modification date of the file.

Have you read the Contributing Guidelines on issues?

Yes.

To Reproduce

(Write your steps here:)

  1. Open any DocuSaurus v2 sitemap :)

Expected behavior

The current output would be:

<url>
        <loc>https://developer.yoast.com/features/canonical-urls/api</loc>
        <changefreq>weekly</changefreq>
        <priority>0.5</priority>
</url>

(Write what you thought would happen.)

Actual Behavior

I propose changing it to:

<url>
        <loc>https://developer.yoast.com/features/canonical-urls/api</loc>
        <lastmod>2020-04-14T11:22:05+00:00</lastmod>
</url>

Your Environment

  • Docusaurus version used: v2
@jdevalk jdevalk added bug An error in the Docusaurus core causing instability or issues with its execution status: needs triage This issue has not been triaged by maintainers labels Apr 14, 2020
@RDIL
Copy link
Contributor

RDIL commented Apr 14, 2020

I think this would be a good addition, but I do know web crawlers that use the priority field.

@jdevalk
Copy link
Author

jdevalk commented Apr 14, 2020

@RDIL such as? Honestly I’ve been doing SEO for well over a decade, not seen it used in the last 5 years.

@RDIL
Copy link
Contributor

RDIL commented Apr 14, 2020

Fair enough.

@yangshun yangshun added difficulty: intermediate Issues that are medium difficulty level, e.g. moderate refactoring with a clear test plan. help wanted Asking for outside help and/or contributions to this particular issue or PR. v2 and removed status: needs triage This issue has not been triaged by maintainers labels Apr 14, 2020
@yangshun
Copy link
Contributor

Great idea! Thanks for the suggestion!

@yangshun yangshun changed the title XML Sitemaps are incomplete [v2] XML Sitemaps are incomplete Apr 14, 2020
@AvroraPolnareff
Copy link

Hello! I want to help solve this issue.
As I can see, there are several implementation options here:

  1. Should I leave the old tags and add new ones or replace them?
  2. Which date should be specified in the "lastmod" tag: the date of the last build of the project or the date of the last page change? If the second, are there any easier ways to do it?

@RDIL
Copy link
Contributor

RDIL commented Oct 23, 2020

Most likely the last build time since even just tiny changes end up changing the chunk hashes, so its constantly being modified.

@slorber
Copy link
Collaborator

slorber commented Oct 23, 2020

@RDIL FYI Webpack 5 might help to make the js chunks more "stable" (see my recent comment in #3383), we may try to migrate after i18n is ready.

Not sure what we should do for this date. Also not sure how the sitemaps plugin could access the "last modification date" of the page, as this plugin is decoupled from the others.

Is it mandatory to add it to the sitemaps? It could likely be easier to handle this by adding a meta directly on the page, otherwise, we'd have to find a way to provide such metadata per path to the sitemap plugin.

Asking this, because for my work on i18n I'll also have to think about how to set up useful headers for localization (hreflang), and thought about adding them to the page directly instead of the sitemaps.

@jdevalk as it seems you know more about SEO than the rest of us, can you give us some insights?

@jdevalk
Copy link
Author

jdevalk commented Oct 23, 2020

Last modified is somewhat of a must for XML sitemaps indeed.

I think for hreflang I'd go for adding it to the page instead of the XML sitemaps as that makes debugging a lot easier and maybe even makes it accessible to other features within docusaurus, like a language switcher.

@slorber
Copy link
Collaborator

slorber commented Oct 23, 2020

Thanks, will do that.

About lastModified, some plugins already read git history to get the last modified date. We can enable also to hardcode it through frontmatter.

I think we should:

  • call addRoute apis with lastModified: lastModifiedFrontmatter || lastModifiedGit || lastModifiedFS || undefined
  • use that data when generating the sitemaps. If not available, add the date of the build?

If this info can't be obtained (pages might not be generated from FS files), is it better to not add the lastmod entry, or to fallback to build time (which is likely to be a recent value if the site is built often).

We agree that this date should rather be updated when the content change, but not when the code (ie the layout rendering the content etc) change?

@jdevalk
Copy link
Author

jdevalk commented Oct 23, 2020

If this info can't be obtained (pages might not be generated from FS files), is it better to not add the lastmod entry, or to fallback to build time (which is likely to be a recent value if the site is built often).

I would not add it then. Having it change all the time when it's actually not changing is also not beneficial.

We agree that this date should rather be updated when the content change, but not when the code (ie the layout rendering the content etc) change?

Agreed.

@Josh-Cena Josh-Cena removed the v2 label Oct 30, 2021
@Josh-Cena Josh-Cena changed the title [v2] XML Sitemaps are incomplete Add lastmod to sitemap Apr 10, 2022
@Ali-Shafiyev
Copy link

Hi!
Make the suggested changes to the code that generates the XML sitemaps.
Test the changes locally to ensure the desired structure with the lastmod field is generated.

@saul-data
Copy link

This would be super useful as we are busy automating spell checking and grammar using AI. I was hoping to use the lastmod to understand when a page has changed to do a spell check and grammar check before deploying to live. I wouldn't want to do this for the entire website.

I don't think there should be a distinction between content change and layout change. If a specific page has changed then the lastmod should be updated with that date.

Maybe it can be an input in Layout tag:

<Layout title="Dataplane Data &amp; Automation Platform | Open Source" lastmod="2020-04-14T11:22:05+00:00">

@jdevalk
Copy link
Author

jdevalk commented Jul 19, 2023

While I understand @saul-data has different needs, for SEO / crawl efficiency reasons I’d only change the lastmod when the content changes. I’d say basing it on the lastmod date of the underlying source document is probably easiest.

Note that search engines are putting more emphasis on adding lastmod as of recently, so I’d prioritize this issue a bit higher.

@saul-data
Copy link

Would this be linked to https://docusaurus.io/docs/blog#blog-post-date ?

I couldn't see a date reference for pages and docs (only versions).

I feel this should be an input by the user when the content or page has changed.

@slorber
Copy link
Collaborator

slorber commented Jul 20, 2023

Note: there's a related issue to add an explicit last update date for blog posts, that could be used as the sitemap lastmod

#8657

@pmarschik
Copy link

I have a prototype for adding <lastmod> to the sitemap.xml here https://github.com/facebook/docusaurus/pull/9234/files.

@slorber Is this how you envisioned the feature in #2604 (comment)?

@johnnyreilly
Copy link
Contributor

I solved this problem for my own site with a post build script; I blogged about it here: https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date

@scaleoutsean
Copy link

@RDIL such as? Honestly I’ve been doing SEO for well over a decade, not seen it used in the last 5 years.

https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date#updated-12th-november-2023-googles-view-on-lastmod-changefreq-and-priority

@jdevalk
Copy link
Author

jdevalk commented Feb 3, 2024

Yeah I’m sorry it’s basically a requirement now.

@slorber
Copy link
Collaborator

slorber commented Mar 15, 2024

Hey

We have merged support for git/front matter last update metadata for blog posts (#8657) which now means both blog and docs have unified support for this feature. (note that the pages plugin doesn't have support, although we could also add it there)

Now is a good time to add "lastmod" to the sitemap as well.

I'll review your PR soon @pmarschik, sorry for the delay.

In the meantime let's decide what should be implemented exactly here, using the Google sitemap doc as a ref:
https://developers.google.com/search/blog/2023/06/sitemaps-lastmod-ping#the-lastmod-element


I don't think there should be a distinction between content change and layout change. If a specific page has changed then the lastmod should be updated with that date.

@saul-data this is not what we will implement because it's not what Google recommends:

And when we say "last modification", we actually mean "last significant modification". If your CMS changed an insignificant piece of text in the sidebar or footer, you don't have to update the lastmod value for that page.


I would propose dropping the changefreq and priority fields

@jdevalk I'd rather keep them for now, and maybe we'll remove those later. I guess we can consider the removal as a breaking change? 🤷‍♂️


I solved this problem for my own site with a post build script; I blogged about it here: johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date

@johnnyreilly note that your solution filters pages from the sitemap such as the tags and paginated lists pages, since they do not match your regexp pattern.

To implement this feature properly, we should also consider that there isn't always a Markdown document per sitemap URL, and some pages are also displaying multiple documents at once.

It's more difficult to define a "lastmod" date for those URLs for example:

My suggestion is to initially keep things simple, and only add a "lastmod" date when the page is backed by a Markdown document.

The Google doc says:

You can use a lastmod element for all the pages in your sitemap, or just the ones you're confident about. For instance, some site software may not be able to easily tell the last modification date of the homepage or a category page because it just aggregates the other pages on the site. In these cases it's fine to leave out lastmod for those pages.


Do we agree on this plan?

@slorber
Copy link
Collaborator

slorber commented Mar 15, 2024

Something important to also consider: reading the file history from git is quite expensive (particularly for large sites), and we probably shouldn't do this by default unless the user wants to.

We only read from git when the showLastUpdateTime: true plugin option is provided, which means only in that case we would add the "lastmod" field to the sitemap.

Is it a problem? Are some of you looking to have lastmod in the sitemap, and yet do not want to use the showLastUpdateTime: true option?

I'd like to refactor the APIs and do breaking changes to make things less confusing, but I wonder if having the behavior above (a bit awkward) can be a problem to some of you?

@wparad
Copy link

wparad commented Mar 15, 2024

Is it a problem? Are some of you looking to have lastmod in the sitemap, and yet do not want to use the showLastUpdateTime: true option?

If you are using either the sitemap OR showLastUpdateTime then it should work, it doesn't make sense to require showLastUpdateTime to be set, that property has nothing to do with RSS feeds/SEO, coupling those together just will be confusing for everyone.

@johnnyreilly
Copy link
Contributor

Decent plan - happy with it. Do the breaking changes - good default

@slorber
Copy link
Collaborator

slorber commented Mar 19, 2024

Thanks for your feedback

Agree @wparad, will try to find a solution so that the sitemap lastmod can be used independently from the docs/blog plugin options, and yet we need to avoid reading twice the lastmod date from Git for performance reasons (this can be expensive for thousands of files)

@slorber
Copy link
Collaborator

slorber commented Mar 19, 2024

New sitemap options are implemented in PR, ready to review: #9954

{
  lastmod: null | 'date' | 'datetime'
  priority: null,
  changefreq: null,
}

Example with our Docusaurus website sitemap:
https://deploy-preview-9954--docusaurus-2.netlify.app/sitemap.xml

<urlset
	xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
	xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
	xmlns:xhtml="http://www.w3.org/1999/xhtml"
	xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
	xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
	<url>
		<loc>https://docusaurus.io/blog/</loc>
	</url>
	<url>
		<loc>https://docusaurus.io/blog/2017/12/14/introducing-docusaurus/</loc>
		<lastmod>2023-01-05</lastmod>
	</url>
	<url>
		<loc>https://docusaurus.io/blog/2018/04/30/How-I-Converted-Profilo-To-Docusaurus/</loc>
		<lastmod>2023-01-04</lastmod>
	</url>
	<url>
		<loc>https://docusaurus.io/blog/2018/09/11/Towards-Docusaurus-2/</loc>
		<lastmod>2023-04-21</lastmod>
	</url>
	<url>
		<loc>https://docusaurus.io/docs/versioning/</loc>
		<lastmod>2024-01-04</lastmod>
	</url>
	<url>
		<loc>https://docusaurus.io/</loc>
		<lastmod>2023-10-31</lastmod>
	</url>

    !-- ... Other URLs, this is just a sample -->
</urlset>

You will notice that not all the URLs have a lastmod attribute (ex /blog/, on purpose, according to Google guidelines above.

For now, I'm not changing defaults in Docusauurs v3, and the base sitemap for existing sites will stay the same as before. However, these options should help you remove priority + changefreq, and add lastmod. I do agree that according to Google recommendations, using the exact same priority and changefreq for all the URLs is kind of an anti-pattern, and we are likely to remove these options in V4.

The sitemap plugin will use in priority the route metadata lastModifiedAt provided by plugins (and our 3 content plugins eventually add that metadata).

But the sitemap plugin can also work in isolation, and will also call git history in case lastmod !== null and plugins did not provide the lastModifiedAt route metadata information. This way, we call at most once the git history per source file, instead of potentially doing twice the same expensive call.

Does it look good to you, or do you see any issues with the implementation above?

@johnnyreilly
Copy link
Contributor

This seems pretty good. I note that lastmod is date only, not datetime. I used datetime on my handrolled implementation:

<url>
<loc>https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date</loc>
<lastmod>2023-11-12T08:33:51+00:00</lastmod>
</url>

I suspect the time portion isn't that important. Most blogs won't be meaningfully updated more than once a day and crawlers may run less frequently than that.

Looks good!

@slorber
Copy link
Collaborator

slorber commented Mar 20, 2024

Thanks for the review

You can choose either date or datetime plugin option, formatted differently:

const LastmodFormatters: Record<LastModOption, LastModFormatter> = {
  date: (timestamp) => new Date(timestamp).toISOString().split('T')[0]!,
  datetime: (timestamp) => new Date(timestamp).toISOString(),
};

That date is "relative" and only help Google prioritize page crawls within your own site, so I will probably use "date" as a default in v4. datetime takes more space, and I doubt the default Docusaurus sites are updated enough for time to be useful. So if you want datetime, it will remain opt-in.

@johnnyreilly
Copy link
Contributor

I think I'll stick with the default of date - nice to have options though.

@slorber
Copy link
Collaborator

slorber commented Apr 11, 2024

Hey, not related to lastmod, but should Docusaurus supports sitemap images?

Apparently, this is a thing:

@johnnyreilly
Copy link
Contributor

Oh wow! Never heard of this. Despite all the links, I can't work out if there's a compelling reason to have them. Hmmmmm

@slorber
Copy link
Collaborator

slorber commented Apr 15, 2024

Yes 😄 TIL there are also video and news sitemap in @stefanjudis article:
https://www.stefanjudis.com/today-i-learned/image-video-news-sitemaps/

I'm not sure it's worth supporting officially or by default, but we could do like the blog plugin and let users provide a createSitemapItem hook to add extra attributes if they want to? 🤷‍♂️

@johnnyreilly
Copy link
Contributor

I think the hook is a good idea - I already manually amend my sitemap to exclude tags and pagination pages. Having a hook in the box would support that use case as well as this.

@johnnyreilly
Copy link
Contributor

johnnyreilly commented Apr 19, 2024

This made me laugh BTW: 🤣

Will I now drop everything and add these to all my sites? Naaaah, I think I'm fine.

https://www.stefanjudis.com/today-i-learned/image-video-news-sitemaps/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An error in the Docusaurus core causing instability or issues with its execution difficulty: intermediate Issues that are medium difficulty level, e.g. moderate refactoring with a clear test plan. help wanted Asking for outside help and/or contributions to this particular issue or PR.
Projects
None yet