-
-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for instagram.com user profiles and pages #134
Conversation
Whoops, sorry for that! (I've run the single tests for instagram but not I will fix that ASAP! |
The extractor scrapes `instagram.com/<user>' timelines and `instagram.com/p/<shortcode>' by mimicking the behaviour of a web browser and extracting the sharedData JSON of the single pages. Please note that this mean that for user timelines we also do an extra request to the `instagram.com/p/<shortcode>' page but this permit to have consistent (and all) information about the media fetched. The MD5 logic used for X-Instagram-GIS was documented in <https://stackoverflow.com/questions/49786980/>
I have fixed the pattern in Unfortunately it seems that the first two tests of |
URLs returned by instagram seems not stable so avoid testing for them and instead test for keyword returned.
Also check the count of media returned.
I have adjusted the tests in order to avoid checking the URL for I have added a second commit to also check the count of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the instagram support! It mostly it looks really solid, but there some small things that should be changed.
For example extractor- and subcategory names: the current ones (postpage
, profilepage
) are a bit too verbose, I think. Just post
and profile
are better, but I would even consider calling them image
and user
to be in line with other extractors. Your choice.
I guess that the URLs are not stable. Any suggestion on how to address that?
Use pattern
instead of url
and specify a regex that should match all returned URLs. (example)
- Change description, subcategories to generate a better description in docs/supportedsite.rst - Remove not needed InstagramExtractor.__init__() - Use text.parse_int() instead of directly using int() (the former is more robust) - Use self.request().json() instead of using json.loads() the self.request().text() - Add `pattern:' to check the URLs where we do not have a stable URLs. It seems that only the subdomain is not stable. Thanks to @mikf!
Hello Mike,
Mike Fährmann writes:
mikf commented on this pull request.
Thanks for the instagram support! It mostly it looks really solid, but there some small things that should be changed.
[...]
Thank you very much for your review! Sure, let's adjust it!
I have pushed everything via a separate commit and rerun locally
the tests... I hope there will not be new regressions in the CI,
I'll keep a look to it and adjust accordingly if any!
Thank you very much again Mike for the review and all the feedbacks!
Keep up the good work!
EDIT: I've copypasted all previous comments by email in the single review comments.
|
Very nice, thank you very much, both of you! What you call Additional bonus comment: |
The extractor scrapes
instagram.com/<user>' timelines and
instagram.com/p/' by mimicking the behaviour of a webbrowser and extracting the sharedData JSON of the single pages.
Please note that this mean that for user timelines we also do an
extra request to the `instagram.com/p/' page but this
permit to have consistent (and all) information about the media
fetched.
The MD5 logic used for X-Instagram-GIS was documented in
https://stackoverflow.com/questions/49786980/
The current known (and documented via TODO comment) limitation of
the extractor is that it treats
GraphSidecar' media (that contains multiple images) like
GraphImage'. So, for `GraphSidecar' media - at themoment -only an image is retrieved, not all of them.
(I would prefer to add support for it once this patch is reviewed
and imported!)
Thank you very much!