Add support for instagram.com user profiles and pages #134

iamleot · 2018-12-08T11:52:59Z

The extractor scrapes instagram.com/<user>' timelines and instagram.com/p/' by mimicking the behaviour of a web
browser and extracting the sharedData JSON of the single pages.

Please note that this mean that for user timelines we also do an
extra request to the `instagram.com/p/' page but this
permit to have consistent (and all) information about the media
fetched.

The MD5 logic used for X-Instagram-GIS was documented in

https://stackoverflow.com/questions/49786980/

The current known (and documented via TODO comment) limitation of
the extractor is that it treats GraphSidecar' media (that contains multiple images) like GraphImage'. So, for `GraphSidecar' media - at the
moment -only an image is retrieved, not all of them.

(I would prefer to add support for it once this patch is reviewed
and imported!)

Thank you very much!

iamleot · 2018-12-08T12:29:21Z

Whoops, sorry for that! (I've run the single tests for instagram but not test_extractor.py (`p/' is actually reserved but there are single letter users so we need to accept them).

I will fix that ASAP!

The extractor scrapes `instagram.com/<user>' timelines and `instagram.com/p/<shortcode>' by mimicking the behaviour of a web browser and extracting the sharedData JSON of the single pages. Please note that this mean that for user timelines we also do an extra request to the `instagram.com/p/<shortcode>' page but this permit to have consistent (and all) information about the media fetched. The MD5 logic used for X-Instagram-GIS was documented in <https://stackoverflow.com/questions/49786980/>

iamleot · 2018-12-08T12:53:25Z

I have fixed the pattern in InstagramProfilepageExtarctor and text_extractor.py tests now passes!

Unfortunately it seems that the first two tests of InstagramPostpageExtractor fails.
I guess that the URLs are not stable. Any suggestion on how to address that?
(I think that we can check their ID (media_id) and shortcodes (shortcode) instead).

URLs returned by instagram seems not stable so avoid testing for them and instead test for keyword returned.

Also check the count of media returned.

iamleot · 2018-12-08T15:07:29Z

I have adjusted the tests in order to avoid checking the URL for GraphImage and GraphSidecar. While here I have also added checks for keywords in GraphVideo too (in that case the URL is stable because it is a ytdl: one with just the shortcode).

I have added a second commit to also check the count of InstagramProfilepageExtractor.

mikf

Thanks for the instagram support! It mostly it looks really solid, but there some small things that should be changed.

For example extractor- and subcategory names: the current ones (postpage, profilepage) are a bit too verbose, I think. Just post and profile are better, but I would even consider calling them image and user to be in line with other extractors. Your choice.

I guess that the URLs are not stable. Any suggestion on how to address that?

Use pattern instead of url and specify a regex that should match all returned URLs. (example)

gallery_dl/extractor/instagram.py

@mikf

- Change description, subcategories to generate a better description in docs/supportedsite.rst - Remove not needed InstagramExtractor.__init__() - Use text.parse_int() instead of directly using int() (the former is more robust) - Use self.request().json() instead of using json.loads() the self.request().text() - Add `pattern:' to check the URLs where we do not have a stable URLs. It seems that only the subdomain is not stable. Thanks to @mikf!

iamleot · 2018-12-09T10:14:11Z

Hello Mike, Mike Fährmann writes:

mikf commented on this pull request. Thanks for the instagram support! It mostly it looks really solid, but there some small things that should be changed. [...]

Thank you very much for your review! Sure, let's adjust it! I have pushed everything via a separate commit and rerun locally the tests... I hope there will not be new regressions in the CI, I'll keep a look to it and adjust accordingly if any! Thank you very much again Mike for the review and all the feedbacks! Keep up the good work! EDIT: I've copypasted all previous comments by email in the single review comments.

Hrxn · 2018-12-09T21:24:00Z

Very nice, thank you very much, both of you!

What you call GraphSidecar media here (that contains multiple images), doesn't this also apply to videos as well? I don't really use Instagram, but as far as I know, these kinds of post (I think Instagram originally called them carousel posts) allow up to 10 pages, i.e. 10 different items per single post page (and URL) which can either be pictures or videos, I think you can completely mix them at will.

Additional bonus comment:
I actually think that support for Insta in gallery-dl is a good idea, because if mikf would decide to run a poll on here including sites not yet implemented, I think this would come out on top 😅 .

iamleot added 2 commits December 8, 2018 15:55

[instagram] Test for keywords, not url for GraphImage and GraphSidecar

92a91b0

URLs returned by instagram seems not stable so avoid testing for them and instead test for keyword returned.

[instagram] Improve test of InstagramProfilepageExtractor

389c31b

Also check the count of media returned.

mikf reviewed Dec 9, 2018

View reviewed changes

mikf merged commit 2655a2e into mikf:master Dec 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for instagram.com user profiles and pages #134

Add support for instagram.com user profiles and pages #134

iamleot commented Dec 8, 2018

iamleot commented Dec 8, 2018

iamleot commented Dec 8, 2018 •

edited

Loading

iamleot commented Dec 8, 2018

mikf left a comment

iamleot commented Dec 9, 2018 via email •

edited

Loading

Hrxn commented Dec 9, 2018

Add support for instagram.com user profiles and pages #134

Add support for instagram.com user profiles and pages #134

Conversation

iamleot commented Dec 8, 2018

iamleot commented Dec 8, 2018

iamleot commented Dec 8, 2018 • edited Loading

iamleot commented Dec 8, 2018

mikf left a comment

Choose a reason for hiding this comment

iamleot commented Dec 9, 2018 via email • edited Loading

Hrxn commented Dec 9, 2018

iamleot commented Dec 8, 2018 •

edited

Loading

iamleot commented Dec 9, 2018 via email •

edited

Loading