Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEO - Page is blocked from indexing - robots.txt #5273

Closed
ethcryptodev opened this issue May 19, 2018 · 11 comments · Fixed by #5293
Closed

SEO - Page is blocked from indexing - robots.txt #5273

ethcryptodev opened this issue May 19, 2018 · 11 comments · Fixed by #5293

Comments

@ethcryptodev
Copy link

I wrote about this bug before, then someone correctly mentioned an issue on one of the sites in question, which made me question if it was a bug, however, I have since corrected that issue, and after more testing, every single site I try and analyze with Lighthouse 3.0.0-alpha.2 says that "Page is blocked from indexing" and applies an SEO penalty. Older versions of Lighthouse seem to be fine. This is the version of Lighthouse that automatically downloads / installs when a user visits the Chrome webstore to add the Lighthouse plugin as an extension for easier one click testing. Other reviews seem to indicate that perhaps this version was pushed to the Chrome webstore a little too early?

Version info on the Chrome webstore indicates it is 2.10.0.3002 (May 16, 2018) however what actually downloads and installs as a browser plugin/extension is 3.0.0-alpha.2.

image

image

@ethcryptodev ethcryptodev changed the title SEO - Page is blocked from indexing SEO - Page is blocked from indexing - ROBOTS.TXT May 19, 2018
@ethcryptodev ethcryptodev changed the title SEO - Page is blocked from indexing - ROBOTS.TXT SEO - Page is blocked from indexing - robots.txt May 19, 2018
@arjunthakur08
Copy link

@strayangelfilms I'm wondering if it's a bug or it might be possible that Google's Search engine algorithm now requires robots.txt file to be indexed, as it is of no use to index the robots.txt file. Please shine a light on this issue.

@patrickhulce or @ebidel or @googlebot - can anyone of you help us with this?

@darwing1210
Copy link

Same issue

@patrickhulce
Copy link
Collaborator

Thanks for following up! It seems like part of the issue is from robots.txt manually enumerating the bot user agents they allow indexing from, and then including something along the lines of GitHub's fallback *
image.

Perhaps we should request from typical bot user agents like Googlebot @rviscomi @kdzwinel?

The other problem seems to be extension specific. I don't see the wordpress.com issue when run from the CLI, we'll have to investigate what's going on there.

@rviscomi
Copy link
Member

Unclear from the screenshot if the audit is actually failing or encountering a fatal error. I think the red icon indicates that it's an error.

In the LH test I just ran of this page, I'm getting errors:

image

In the JSON results, I'm seeing this:

    "is-crawlable": {
      "id": "is-crawlable",
      "title": "Page is blocked from indexing",
      "description": "Search engines are unable to include your pages in search results if they don't have permission to crawl them. [Learn more](https://developers.google.com/web/tools/lighthouse/audits/indexing).",
      "score": null,
      "scoreDisplayMode": "error",
      "rawValue": null,
      "errorMessage": "Audit error: Unable to identify the main resource"
    }

So the Audit error: Unable to identify the main resource error message indicates that something is wrong with the main resource gatherer (the thing that identifies which request is the one that corresponds with the main HTML document).

@kdzwinel is OOO this week. @patrickhulce you're probably more familiar with this than me, is this something you can investigate further?

@arjunthakur08
Copy link

@rviscomi So, you are saying this is a Lighthouse error. Right?

@patrickhulce
Copy link
Collaborator

@rviscomi what you're bumping into is actually #5266 :)

The screenshots @strayangelfilms provided aren't erroring, just seems to be a logic error/something wrong with the robots-parser library when browserified. We can look into it.

@ethcryptodev
Copy link
Author

ethcryptodev commented May 21, 2018 via email

@rviscomi
Copy link
Member

rviscomi commented May 21, 2018

Ok thanks for clarifying. I see what's happening now. The audit was recently updated in v3 to look at robots.txt in addition to meta tags. It only looks at User-agent: *, which is an uncommon fallback directive in this case because there are other bot-specific directives that come first. So the audit's failure is a false negative.

This question came up during development and we decided to avoid distinguishing between crawlers (eg looking only at Googlebot). * seemed like the common case, but this counter-example shows how that can be an incorrect assumption.

The audit answers the question "Is this page indexable?" but I think we're missing an important follow-up question: "to whom?" Because it can be indexable to some and not others.

I think the safest path forward is to handle these indeterminate cases as a warning rather than a failure. The audit help text should convey that the page is indexable to some and not others, and enumerate to which bots it's not indexable. Do we have the ability to pass/fail/warn dynamically? If so that would be a good compromise to avoid penalizing the SEO score while drawing attention to potentially serious misconfigurations.

@patrickhulce
Copy link
Collaborator

Do we have the ability to pass/fail/warn dynamically?

Yep! We can handle this, though we just decided in #5270 to not show passed audits with warnings as failures, which we might want to reverse if this is our solution to this issue :)

@patrickhulce
Copy link
Collaborator

OK so the secondary piece of this is an extension failure and a regression of #4794, which was fixed, but then broken again by #4875. While browserify does indeed shim URL such that no more errors appear, it's logic is useless and causes robots-parser to always return a failure if a robots.txt is present. That fact that the audit only fails when a robots.txt is present is why it wasn't caught earlier.

Thank you for reporting and persisting through the many layers of bugs/non-bugs here @strayangelfilms! 👍

fix up at #5293

@ethcryptodev
Copy link
Author

@patrickhulce Anything for you Patrick after helping me catch that meta robots nofollow problem happening on our site. I work with an optimization company and we are constantly using Lighthouse all day to evaluate results. It is the best and most robust tool out there! I have another possible issue to report as well which I will create a separate report for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants