SEO - Page is blocked from indexing - robots.txt #5273

ethcryptodev · 2018-05-19T09:01:53Z

I wrote about this bug before, then someone correctly mentioned an issue on one of the sites in question, which made me question if it was a bug, however, I have since corrected that issue, and after more testing, every single site I try and analyze with Lighthouse 3.0.0-alpha.2 says that "Page is blocked from indexing" and applies an SEO penalty. Older versions of Lighthouse seem to be fine. This is the version of Lighthouse that automatically downloads / installs when a user visits the Chrome webstore to add the Lighthouse plugin as an extension for easier one click testing. Other reviews seem to indicate that perhaps this version was pushed to the Chrome webstore a little too early?

Version info on the Chrome webstore indicates it is 2.10.0.3002 (May 16, 2018) however what actually downloads and installs as a browser plugin/extension is 3.0.0-alpha.2.

arjunthakur08 · 2018-05-21T15:58:31Z

@strayangelfilms I'm wondering if it's a bug or it might be possible that Google's Search engine algorithm now requires robots.txt file to be indexed, as it is of no use to index the robots.txt file. Please shine a light on this issue.

@patrickhulce or @ebidel or @googlebot - can anyone of you help us with this?

darwing1210 · 2018-05-21T16:32:10Z

Same issue

patrickhulce · 2018-05-21T17:25:27Z

Thanks for following up! It seems like part of the issue is from robots.txt manually enumerating the bot user agents they allow indexing from, and then including something along the lines of GitHub's fallback *
.

Perhaps we should request from typical bot user agents like Googlebot @rviscomi @kdzwinel?

The other problem seems to be extension specific. I don't see the wordpress.com issue when run from the CLI, we'll have to investigate what's going on there.

rviscomi · 2018-05-21T18:52:23Z

Unclear from the screenshot if the audit is actually failing or encountering a fatal error. I think the red icon indicates that it's an error.

In the LH test I just ran of this page, I'm getting errors:

In the JSON results, I'm seeing this:

    "is-crawlable": {
      "id": "is-crawlable",
      "title": "Page is blocked from indexing",
      "description": "Search engines are unable to include your pages in search results if they don't have permission to crawl them. [Learn more](https://developers.google.com/web/tools/lighthouse/audits/indexing).",
      "score": null,
      "scoreDisplayMode": "error",
      "rawValue": null,
      "errorMessage": "Audit error: Unable to identify the main resource"
    }

So the Audit error: Unable to identify the main resource error message indicates that something is wrong with the main resource gatherer (the thing that identifies which request is the one that corresponds with the main HTML document).

@kdzwinel is OOO this week. @patrickhulce you're probably more familiar with this than me, is this something you can investigate further?

arjunthakur08 · 2018-05-21T19:06:54Z

@rviscomi So, you are saying this is a Lighthouse error. Right?

patrickhulce · 2018-05-21T19:52:23Z

@rviscomi what you're bumping into is actually #5266 :)

The screenshots @strayangelfilms provided aren't erroring, just seems to be a logic error/something wrong with the robots-parser library when browserified. We can look into it.

ethcryptodev · 2018-05-21T19:58:39Z

I agree, it seems to be an issue reading / parsing the robots.txt file which is clearly there and readable on every site / server / environment I’ve tested by all the major crawlers. I have never seen this error except with the current latest build available for download on the Google Chrome web store.

rviscomi · 2018-05-21T20:45:30Z

Ok thanks for clarifying. I see what's happening now. The audit was recently updated in v3 to look at robots.txt in addition to meta tags. It only looks at User-agent: *, which is an uncommon fallback directive in this case because there are other bot-specific directives that come first. So the audit's failure is a false negative.

This question came up during development and we decided to avoid distinguishing between crawlers (eg looking only at Googlebot). * seemed like the common case, but this counter-example shows how that can be an incorrect assumption.

The audit answers the question "Is this page indexable?" but I think we're missing an important follow-up question: "to whom?" Because it can be indexable to some and not others.

I think the safest path forward is to handle these indeterminate cases as a warning rather than a failure. The audit help text should convey that the page is indexable to some and not others, and enumerate to which bots it's not indexable. Do we have the ability to pass/fail/warn dynamically? If so that would be a good compromise to avoid penalizing the SEO score while drawing attention to potentially serious misconfigurations.

patrickhulce · 2018-05-21T21:00:25Z

Do we have the ability to pass/fail/warn dynamically?

Yep! We can handle this, though we just decided in #5270 to not show passed audits with warnings as failures, which we might want to reverse if this is our solution to this issue :)

patrickhulce · 2018-05-21T22:18:49Z

OK so the secondary piece of this is an extension failure and a regression of #4794, which was fixed, but then broken again by #4875. While browserify does indeed shim URL such that no more errors appear, it's logic is useless and causes robots-parser to always return a failure if a robots.txt is present. That fact that the audit only fails when a robots.txt is present is why it wasn't caught earlier.

Thank you for reporting and persisting through the many layers of bugs/non-bugs here @strayangelfilms! 👍

fix up at #5293

ethcryptodev · 2018-05-21T22:33:04Z

@patrickhulce Anything for you Patrick after helping me catch that meta robots nofollow problem happening on our site. I work with an optimization company and we are constantly using Lighthouse all day to evaluate results. It is the best and most robust tool out there! I have another possible issue to report as well which I will create a separate report for.

ethcryptodev changed the title ~~SEO - Page is blocked from indexing~~ SEO - Page is blocked from indexing - ROBOTS.TXT May 19, 2018

ethcryptodev changed the title ~~SEO - Page is blocked from indexing - ROBOTS.TXT~~ SEO - Page is blocked from indexing - robots.txt May 19, 2018

patrickhulce added bug needs-investigation needs-priority labels May 21, 2018

rviscomi mentioned this issue May 21, 2018

report: audit warnings are not top-level #5270

Merged

patrickhulce mentioned this issue May 21, 2018

extension: expose URL shim #5293

Merged

paulirish closed this as completed in #5293 May 23, 2018

patrickhulce mentioned this issue May 23, 2018

is-crawlable fails when robots.txt specifies user-agents #5329

Closed

connorjclark mentioned this issue Nov 18, 2022

Old v3 LHR fails to render in viewer #14549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEO - Page is blocked from indexing - robots.txt #5273

SEO - Page is blocked from indexing - robots.txt #5273

ethcryptodev commented May 19, 2018

arjunthakur08 commented May 21, 2018

darwing1210 commented May 21, 2018

patrickhulce commented May 21, 2018

rviscomi commented May 21, 2018

arjunthakur08 commented May 21, 2018

patrickhulce commented May 21, 2018

ethcryptodev commented May 21, 2018 via email

rviscomi commented May 21, 2018 •

edited

Loading

patrickhulce commented May 21, 2018

patrickhulce commented May 21, 2018

ethcryptodev commented May 21, 2018

SEO - Page is blocked from indexing - robots.txt #5273

SEO - Page is blocked from indexing - robots.txt #5273

Comments

ethcryptodev commented May 19, 2018

arjunthakur08 commented May 21, 2018

darwing1210 commented May 21, 2018

patrickhulce commented May 21, 2018

rviscomi commented May 21, 2018

arjunthakur08 commented May 21, 2018

patrickhulce commented May 21, 2018

ethcryptodev commented May 21, 2018 via email

rviscomi commented May 21, 2018 • edited Loading

patrickhulce commented May 21, 2018

patrickhulce commented May 21, 2018

ethcryptodev commented May 21, 2018

rviscomi commented May 21, 2018 •

edited

Loading