Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avatars: Fetch avatars for websites with taken usernames #42

Merged
merged 2 commits into from
Dec 21, 2017
Merged

Avatars: Fetch avatars for websites with taken usernames #42

merged 2 commits into from
Dec 21, 2017

Conversation

nalinbhardwaj
Copy link
Contributor

@nalinbhardwaj nalinbhardwaj commented Dec 20, 2017

Adds get_avatar to fetch avatars from all websites. Currently works for everything except Behance. Only works for Facebook if profile == visible(as expected). The function expects the user's page to exist, but handles most errors gracefully even if it doesn't(['pinterest', 'gitlab'] are offenders of this).

Adds appropriate tests, maybe they should be mixed with test_username_check, but I think the repo could use some more breaking down into parts, perhaps this could be a start.

Closes #35

username_api.py Outdated
import sys
import re
import yaml

app = Flask(__name__)
cors = CORS(app)
cors = CORS(app)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change

username_api.py Outdated
app.run(host='0.0.0.0', port=8521)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change, #41 will take care of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, none of these are conflicting/breaking changes, I guess it's fine for them to be here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're gonna keep them here, you should do them in a new commit so there's not a bunch of random changes in a commit that adds a new feature 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, it's just a newline at EOF, nonetheless, I've undone those changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, sorry for being nitpicky, it's just generally distracting from the purpose of the PR, and makes the git blame look weird.

@@ -1,5 +1,8 @@
---
facebook:
avatar_usernames:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need different usernames for facebook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the PR description, profiles on Facebook can be “hidden”, therefore having no avatar to display. These are public profile links.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

websites.yml Outdated
type: meta
property: og:image
url: null
behance:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use behance: false , and order the keys alphabetically, so behance is at the top of the avatar group.

websites.yml Outdated
pinterest:
type: regex
property: image_xlarge_url
url: null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not have empty keys in yaml unless strictly necessary.

websites.yml Outdated
@@ -46,3 +46,40 @@ username_patterns:
invalid_patterns:
- "\\.(com|net)"
- "(\\.)\\1{1,}" # consecutive '.' not allowed
avatar:
pinterest:
type: regex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type isnt needed. it can be inferred. we need property: , url: , otherwise the entries can be short and clear gitlab: opengraph , as that is the most common.


class TestGet_avatar(object):

@pytest.mark.parametrize('user', data['github']['taken_usernames'])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have better test parameterisation in this PR? (specifically for the newly added tests) There is lot of duplicated logic already in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will look into parameterisation. I would assume it's okay to use the same methods/parameterisation in the username_check tests. Is that fine?(Of course, I'll do those in a separate PR later)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If more parameterisation is not easy, it can be deferred to GCI task #36

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to shift us from pytest to nosetests as part of #36 I think. It has a pretty powerful library parameterized, which although supports py.test, isn't as well supported(not as pretty and I see quite a few bug reports) (see FAQ for more details).

This is probably a large change(to builds etc.) , so I'd want it to be in it's own PR, but let me know whatever you decide.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nosetests is dead, and nose2 is stillborn.
It works well in pytest. The caveat about the generated method names is of little consequence as usually test method names are hidden, and name_func can be used to create new names that are meaningful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we okay dealing parameterized/#34 then?
Or can we just shift to unittest(since we probably won't need pytest with that)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not our problem. pytest devs are not going to delete remove support for that until their main plugins are fixed. That is just an early warning system, and the alarm has been heard correctly.

username_api.py Outdated
soup = BeautifulSoup(response.text, 'html.parser')
if data['type'] == 'meta':
# Look in metadata for `property` attributed image.
result = [item.attrs['content'] for item in soup('meta') if item.has_attr('property') and item.attrs['property'].lower()==data['property']]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split this line

username_api.py Outdated
result = result[0]
elif data['type'] == 'regex':
# Searches for "`property`": "`link`"
regex = re.compile('[\'\"]' + re.escape(data['property']) + '[\'\"]:(\s)?[\'\"](?P<link>[^\s]+)[\'\"]')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split

username_api.py Outdated
@@ -12,11 +14,56 @@

patterns = yaml.load(open('websites.yml'))

def check_username(website, username):
url = patterns['urls'].get(website, 'https://{w}.com/{u}').format(
def get_url(website, username):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_profile_url ?

websites.yml Outdated
pinterest:
property: image_xlarge_url
gitlab:
property: og:image
Copy link
Collaborator

@jayvdb jayvdb Dec 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isnt a property. This is a standard. It is the standard. If you dont like gitlab: true or gitlab: opengraph to reflect using OpenGraph, then please use

gitlab:
  opengraph: true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayvdb : Not sure if I understand what you mean correctly but this is for future compatibility, if there is a website added in future which has twitter:image( see this ) but not og:image, all that needs to be done is just write twitter:image here(instead of og:image) for it to work.
This also allows for supporting arbitrary meta tags for the image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you mean this should be renamed to meta-name or something?
Or that this is an unnecessary feature? Let me know if it's the second case, I'll remove it.

Copy link
Collaborator

@jayvdb jayvdb Dec 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I want is gitlab: true. OpenGraph is the default. No extra information is needed for this case. Any other alternative is an override. And that can be designed later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I'll shift this to

gitlab:
  opengraph: true

so we can still have url overrides easily.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gitlab: true just coexists fine with

facebook:
  url: ...

This is YAML.
You are using YAML to store data in a JSON-like structure. That is not YAML.

websites.yml Outdated
gitlab: opengraph
github: opengraph
tumblr: opengraph
behance: null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

behance: false . That is more user-friendly . It has a clearer meaning.

username_api.py Outdated
if not result:
return None
result = result.group('link')
elif 'image' in response.headers.get('content-type'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.headers.get('content-type') can return None.

...startswith('image/')

username_api.py Outdated
if response.status_code == 404:
return None

soup = BeautifulSoup(response.text, 'html.parser')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be inside the relevant branch of the if ; it is an expensive object to create

Copy link
Collaborator

@jayvdb jayvdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only a few things; you can submit when they are done

@parameterized.expand(load_test_cases('with'),
testcase_func_name=custom_name_func)
def test_with_avatar(self, website, user):
if website == 'behance':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this belongs in load_test_cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we shouldn't even load those test cases? I'd rather keep a "skipped" notification no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can generate a skipped test in load_test_cases

pytest.skip("behance doesn't work")
link = username_api.check_username(website, user)['avatar']
response = r.get(link)
assert('image' in response.headers.get('content-type') or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startswith(image/)

username_api.py Outdated
w=website,
u=username
)
u=username)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not a very good diff comparison, I have a fresh function in those lines, for some reason because of the repetition from the old one, git decides it's a modification of that. I'll change it nonetheless.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a very good comparison. :P
Splitting a function like this is very common.
not changing the lines in the middle helps code review and understanding what is changing and what isnt.

username_api.py Outdated
if not result:
return None
result = result.group('link')
elif (response.headers.get('content-type') and
Copy link
Collaborator

@jayvdb jayvdb Dec 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use response.headers.get('content-type', '').startswith('image/') to make it more readable and only perform slightly worse in the very unlikely uncommon scenario that one of these sites sends a response without a content-type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, that would still perform better, because every . is a perf hit using getattribute so your and here is doubling the number of times response.headers.get('content-type') occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, optimally, by locality of reference, the perf hit would be negligible, but we're dealing with python so performance^TM 🙃

websites.yml Outdated
@@ -46,3 +46,15 @@ username_patterns:
invalid_patterns:
- "\\.(com|net)"
- "(\\.)\\1{1,}" # consecutive '.' not allowed
avatar:
pinterest:
property: image_xlarge_url
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is property a good term for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about key if you don't like it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think key would be better since we're picking from json data. property seems vague here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

result = [item.attrs['content'] for item in soup('meta')
if item.has_attr('property') and
item.attrs['property'].lower() == 'og:image']
if not result or not result[0]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is check for both result and result[0] needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some website returned an empty string for some reason. This is for that. (I can't remember which one, I think it was fb)

if not result:
return None
result = result.group('link')
elif response.headers.get('content-type', '').startswith('image/'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one applies for twitter?

Copy link
Contributor Author

@nalinbhardwaj nalinbhardwaj Dec 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. It redirects to homepage if the user doesn't exist or something else.


class TestGet_avatar(object):

@parameterized.expand(load_test_cases('with'),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do with_avatar and without_avatar here? Other than that this PR is good to go.

websites.yml Outdated
@@ -46,3 +46,15 @@ username_patterns:
invalid_patterns:
- "\\.(com|net)"
- "(\\.)\\1{1,}" # consecutive '.' not allowed
avatar:
pinterest:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alpha sort the websites?
(it isnt done elsewhere, but that is a cleanup for later.)

Nalin Bhardwaj added 2 commits December 22, 2017 00:17
Makes function get_avatar() and adds avatar related
info to websites.yml

Closes #35
Adds tests for get_avatar. Checks if
given links are images or not.
@manu-chroma
Copy link
Owner

good job @nalinbhardwaj 😄

@manu-chroma manu-chroma merged commit c738d8a into manu-chroma:master Dec 21, 2017
@nalinbhardwaj nalinbhardwaj deleted the avatars branch December 21, 2017 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants