Skip to content
This repository has been archived by the owner on Nov 6, 2023. It is now read-only.

Update National_Geographic.xml #14211

Merged
merged 9 commits into from
Jan 23, 2018
Merged

Conversation

episkorowski
Copy link
Contributor

@episkorowski episkorowski commented Jan 6, 2018

#12901
@Bisaloo New PR. Added a barebones list of subdomains and merged National_Geographic.com-problematic.xml into the primary ruleset in order to use the simplified regex.

Ruleset no longer uses www-s, so the website should load now.

@Hainish Hainish added the top-10k label Jan 6, 2018
@Bisaloo
Copy link
Collaborator

Bisaloo commented Jan 6, 2018

Great! It is passing tests now. Some things to change:

	<target host="media-mmdb.nationalgeographic.com" />
		<test url="http://media-mmdb.nationalgeographic.com/static-media/img/favicon.ico" />
  • press.nationalgeographic.com has mixed content as well but as far as I can tell, it is only fonts and ads so it is okay to keep it.
  • Rename the file to NationalGeographic.com.xml

@episkorowski
Copy link
Contributor Author

I've implemented most of the changes, but I'm running into some trouble with getting a test url for media-yourshot.nationalgeographic.com. The one you provided works fine for http://media-members.nationalgeographic.com and http://media-mmdb.nationalgeographic.com, but results in a 404 for media-yourshot, which means it fails the testing.

This might just be lack of experience on my part, but I'm not sure how to get a test URL for it either, since the URL by itself results in the 403. Do you have any suggestions? For now I just left it as is.

@Bisaloo
Copy link
Collaborator

Bisaloo commented Jan 7, 2018

Yes, finding test urls can be tricky.

Usually, you can find them with your search engine. site:media-yourshot.nationalgeographic.com or "media-yourshort.nationalgeographic.com. But it doesn't work in this case.

According to the name, it must be a cdn for yourshort.nationalgeographic.com so I tried visiting this subdomain but still nothing.

A last thing that can sometimes help for cdn is visiting reddit. Still nothing.

If we can't find a working test url, it is safer to remove the target and add it to the top comment under No working URL known.

@Bisaloo
Copy link
Collaborator

Bisaloo commented Jan 7, 2018

As you have noticed, it is easy to miss the mixed content on archive.nationalgeographic.com. Could you add the URL I provided to the top comment as example? It will greatly help the next person who works on this ruleset.


<!-- Tracking cookies:
-->
<securecookie host="^\.nationalgeographic\.com$" name="^(?:gpv_p\d+|s_\w+|__utm)\w$" />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are at it, could you use a capturing group in the regex?

-->
<ruleset name="National Geographic">

<target host="www.nationalgeographic.com" />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add

	<target host="nationalgeographic.com" />

	<rule from="^http://nationalgeographic.com/"
		to="https://www.nationalgeographic.com/" />

with a note in the top comment explaining that ^ times out over https.

@episkorowski
Copy link
Contributor Author

episkorowski commented Jan 7, 2018

I'm not totally sure if what I did with the regex is correct, but I've implemented the other changes.

EDIT:
Looks like the CI build gave an error I didn't get when I was testing locally:
failure: /opt/utils/../src/chrome/content/rules/National_Geographic.com.xml failed test: The 'from' rule contains unescaped period in regular expression. Try escaping it with a backslash. ERROR: Validation of rulesets failed.

I'm guessing I need an escape character before the period. I'll try it with:
<rule from="^http://nationalgeographic\.com/" to="https://www.nationalgeographic.com/" />

I'm not totally sure what the other error is about, though:
test/selenium/test_navigation.py::TestNavigation::test_httpnowhere_blocks FAILED [ 22%]
Probably messed up the regex.

EDIT2:
Well, looks like it fixed both.


Redirect to HTTP:

- adventureblog
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have Invalid certificate for this one.


- adventureblog
- channel
- help
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid certificate for me.

- adventureblog
- channel
- help
- yourshot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me over https

- channel
- help
- yourshot
- yourshotblog
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid certificate


- archive (Example: https://archive.nationalgeographic.com/?iid=133596#folio=CV1)
- intelligenttravel
- news
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any page with mixed content for news. Do you have an example?

- intelligenttravel
- news
- press
- video
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any mixed content, do you have an example?

(Note that we only care about active mixed content)

@Bisaloo
Copy link
Collaborator

Bisaloo commented Jan 8, 2018

I don't have the same results as you for some subdomains listed in the top comment, can you double check?

@episkorowski
Copy link
Contributor Author

Sorry for the delay. I ran through those subdomains again and most of it was consistent with what you said, so I made the changes. Also, I wasn't sure of the difference between active and passive mixed content at first, so I fixed that as well.

I couldn't get channel to work over HTTPS, I was getting the "too many redirects" error in the tests, so I made a note of that in the comment.

@Bisaloo
Copy link
Collaborator

Bisaloo commented Jan 17, 2018

@episkorowski, thank you and sorry for the hassle. As I said, this is a very tricky ruleset. To be honest, I have been wanting to update it myself for a long time but hadn't found the time yet. I knew it was going to be tedious.

@jeremyn @J0WI, feel free to add some comments but this should be about ready to merge (follow up of #14164).

@Bisaloo
Copy link
Collaborator

Bisaloo commented Jan 17, 2018

I couldn't get channel to work over HTTPS, I was getting the "too many redirects" error in the tests, so I made a note of that in the comment.

This means that the target redirects to plain http.

@episkorowski
Copy link
Contributor Author

Ah, okay. I'll correct that real quick.

@J0WI
Copy link
Contributor

J0WI commented Jan 18, 2018

Thanks @Bisaloo for the comments.
@episkorowski please make sure that all previous comments/rules are present in the new ruleset (as long as the host does still exist). E. g. mshop does still exist, but is no longer present in the new ruleset.

@episkorowski
Copy link
Contributor Author

I added the rules I missed. I also added the sections for CDN buckets and the mixed content on news from the old ruleset. I'm not totally sure how relevant the info still is, though.

images doesn't appear to exist anymore, so I removed mentions of it.

I wasn't sure what to do with metric, since the link in the rule (https://nationalgeographic-com.112.2o7.net/) leads to a 404 now.

@Bisaloo
Copy link
Collaborator

Bisaloo commented Jan 22, 2018

For metric, you are right we don't want to keep the old rule but you can still list the subdomain under Invalid certificate.

Copy link
Contributor

@J0WI J0WI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still missing images and media-channel.

<ruleset name="National Geographic">

<target host="www.nationalgeographic.com" />
<target host="nationalgeographic.com" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be the first target.

@episkorowski
Copy link
Contributor Author

I've added media-channel to the "Redirects to HTTP" section.

I'm not totally sure what to do with images, though. https://images.nationalgeographic.com/ gives me a 502 Error, same as https://www-s.nationalgeographic.com/. Should I make a category for it or just put it into one of the existing ones?

@J0WI
Copy link
Contributor

J0WI commented Jan 23, 2018

I've added media-channel to the "Redirects to HTTP" section.

There is only channel

I'm not totally sure what to do with images, though.

Since http and https doesn't behave the same way I would add a new category for it.

@episkorowski
Copy link
Contributor Author

Added a category for images and put media-channel in the "Redirects to HTTP" section.

Reordered targets and subdomains in the Invalid certificate section to meet standards.

@J0WI J0WI closed this Jan 23, 2018
@J0WI J0WI reopened this Jan 23, 2018
Copy link
Contributor

@J0WI J0WI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@J0WI J0WI merged commit 4f9df9d into EFForg:master Jan 23, 2018
@episkorowski episkorowski deleted the nationalgeographic.com branch January 24, 2018 03:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants