Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemap improvements #4061

Closed
2 tasks done
robert-bryson opened this issue Nov 15, 2022 · 11 comments
Closed
2 tasks done

Sitemap improvements #4061

robert-bryson opened this issue Nov 15, 2022 · 11 comments
Assignees

Comments

@robert-bryson
Copy link
Contributor

robert-bryson commented Nov 15, 2022

User Story

In order to [improve catalog SEO], [datagov team] wants [fix/improve the current sitemap].

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN [visiting catalog sitemap]
    WHEN [the page load] happens
    THEN [a html page should be loaded]
    [AND the XML file not downloaded but with a download link]

  • GIVEN [checking Google Search Console]
    WHEN [checking sitemap Discovered URLs] happens
    THEN [the number is growing]

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Based on the meeting with Freddie Blicher and the datagov team on 11/14, we need to take a couple of actions to improve the effectiveness of our sitemap for catalog.

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

@hkdctol hkdctol moved this to 📔 Product Backlog in data.gov team board Nov 17, 2022
@hkdctol hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Nov 17, 2022
@hkdctol hkdctol moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Nov 17, 2022
@hkdctol
Copy link
Contributor

hkdctol commented Nov 17, 2022

Would be good to pair to spread background on this issue.

@hkdctol hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Nov 28, 2022
@nickumia-reisys
Copy link
Contributor

So, with the current sitemap design, I don't think is possible. Right now, we are hosting sitemap files in an S3 bucket. Since the bucket is not configured as a static site domain, I don't think it will be possible to use S3 as a single solution hosting strategy. We could build a sitemap template into ckan catalog, probably ckanext-datagovtheme that looks for the S3 files and then dynamically generates these pages on load. I'm not sure how time-consuming it will be, or if this added complexity is worth it, but I suppose I don't have another option for implementing this. @robert-bryson @jbrown-xentity @FuhuXia @btylerburton @Jin-Sun-tts thoughts?

@nickumia-reisys
Copy link
Contributor

**Time-consuming in terms of: (1) implementation and (2) actual user load time.

@robert-bryson robert-bryson self-assigned this Dec 6, 2022
@robert-bryson robert-bryson moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Dec 6, 2022
@robert-bryson
Copy link
Contributor Author

After experimenting with explicitly setting the Content-Type in nginx yesterday, I have been looking more into how s3 behaves with the cli. I uploaded the same sitemap.xml file with aws s3 cp.. with and without setting the type and both correctly return application/xml:
image

@robert-bryson
Copy link
Contributor Author

With GSA/catalog.data.gov#702, the sitemaps should be getting created with the correct content-type and should be able to be consumed by Google. Once the PR is merged, a new sitemap job can be dispatched (or one will run overnight), and then Google Search Console should start showing numbers go up.

There are still a number of changes to the sitemaps we can and should make, but hopefully this will enable Google to start picking up URLs.

@robert-bryson
Copy link
Contributor Author

Well, with GSA/catalog.data.gov#703:
image

@robert-bryson
Copy link
Contributor Author

image

image

Sitemap urls are being generated correctly, and site map files are being correctly generated but the connection between the two is not working. It should be a simple ngnix fix (and probably is), but I haven't gotten it to work correctly yet. It's possible that there is an s3 config that needs to be changed as well.

@robert-bryson
Copy link
Contributor Author

It's working! It's working!

Image

(The 'Couldn't Fetch' ones are slowing flipping to 'Success' as the scan runs)

@robert-bryson
Copy link
Contributor Author

After letting Google do its thing for the weekend, it has made some limited forward progress.

sitemap.xml remains at 5k URLs, but there is now a chart of index coverage, but it has not grabbed the rest of the sitemap files (showing 5 files successfully read, 76 as couldn't fetch, and the other ~300 not present in the UI for an unknown reason).

I submitted sitemap-20 by hand last week to check as an example of a missing file from above. Initially it was also showing 403, but has now flipped to success with it's 1k URLs, though no chart of index coverage.

I'm not sure what to conclude or which actions to take. It does seem like Google will slowly crawl and index these URLs, but (from what I can tell) on the limited set of 5 or 81 sitemap files. Will continue to see if there is a way to account for the missing (on the ui at least) sitemap files and overcome Google's alleged 403s.

@robert-bryson robert-bryson moved this from 🏗 In Progress [8] to 📡 Blocked in data.gov team board Jan 12, 2023
@robert-bryson
Copy link
Contributor Author

robert-bryson commented Jan 19, 2023

As a bump: Search Console is now showing 17k URLs and (perhaps importantly?) 86 sitemap files found in the index (as opposed to the 81 it has been finding, but still missing the total of 374).

Image

Since it does look like Google is slowly making its way through the files, perhaps we should close this for now and make a 'Check back in a month'-type ticket? We do have a lot of URLs after all.

@hkdctol
Copy link
Contributor

hkdctol commented Jan 19, 2023

We will make a follow up ticket, but close this one as complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants