Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twitter - API limit? #544

Closed
panhartstuff opened this issue Dec 26, 2019 · 38 comments
Closed

Twitter - API limit? #544

panhartstuff opened this issue Dec 26, 2019 · 38 comments

Comments

@panhartstuff
Copy link

panhartstuff commented Dec 26, 2019

It seems that there is an API limit for Twitter? Downloading https://twitter.com/k3_spaceybear/media doesn't download anything beyond: https://twitter.com/i/web/status/802921901689425921
twMediaDownloader and Hitomi Downloader doesn't seem to have this problem, though.
Is there any way around this limit?

EDIT:
I don't know how the code from the twMediaDownloader above works because I'm not bigbrained enough to read through them, but I found some simple solutions searching online:
https://github.com/bpb27/twitter_scraping
https://github.com/MatthewWolff/TwitterScraper
https://stackoverflow.com/questions/8471489/find-all-tweets-from-a-user-not-just-the-first-3-200
It seems the general opinion is to use a Selenium headless browser and query a Twitter search.
I like MattheyWolff's solution where he search results month by month, it seems like a very effective solution as you're probably not going to get blocked by Twitter's scrolling limit that way. Even if we do, maybe we can still decrease the range from monthly to weekly or maybe even daily.

As an example, these are the results from k3_spaceybear's join date (March 2009)
from: k3_spaceybear since: 2009-03-01 until: 2009-03-31

Maybe there are better solutions, but this looks pretty feasible

@wankio
Copy link
Contributor

wankio commented Dec 30, 2019

here : https://developer.twitter.com/en/docs/basics/rate-limiting they changed limit recently i think so, before the changes, i can download almost all reachable tweet

@panhartstuff
Copy link
Author

I'm not sure if rate limit is the cause, I can download dozens of Twitter users at a time just fine. There's also no error, so it probably assumed it has reached the end of the timeline when it hasn't.

@alice945
Copy link
Contributor

I had this issue recently and I used the search method to fix it (take a look at #448). I skimmed through the code of twMediaDownloader and it looks like they might be using this same method. To get all the images from big accounts, I first download normally and then use a search from the date that I'm blocked at to the date of the account creation, with "filter:images" to remove the text posts. You may need to break it up into multiple searches if the account is super big but I haven't run into that yet. Make sure that when you are searching you click on "Latest", the "Top" results sometimes mix up the order and miss some tweets.

@panhartstuff
Copy link
Author

panhartstuff commented Jan 11, 2020

Thank you @alice945, I just realized that gallery-dl supports twitter search.
I tried it out and I was able to get more tweets this way.

But I want to ask more about your code since I don't quite get a couple of things:
For example, with searching: "https://twitter.com/search?q=from%3Audon0531%20exclude%3Aretweets%20filter%3Amedia%20-filter%3Aperiscope&src=typed_query&f=live"
I was missing a few of their older tweets. So I needed to get the id of the last tweet from that search and do a max_id search:
"https://twitter.com/search?q=from%3Audon0531%20max_id%3A260351044213698560%20exclude%3Aretweets%20filter%3Amedia%20-filter%3Aperiscope&src=typed_query&f=live"

Does this have something to do with the max_position / min_position you were talking about? Is this a bug in the code or is this just how Twitter behaves?

@alice945
Copy link
Contributor

Like I said in my previous post, make sure you are using "Latest" when searching. Twitter does a lot of weird things with the "Top" search and it's unreliable. Try using this search url and let me know if it gets all the posts: https://twitter.com/search?f=tweets&vertical=default&q=from%3Audon0531%20exclude%3Aretweets%20filter%3Amedia%20-filter%3Aperiscope&src=typd

@panhartstuff
Copy link
Author

panhartstuff commented Jan 11, 2020

Oh I thought the "&f=live" ensures you're searching from Latest?
Clicking your link actually sends me to the "Top" tab, while my link sends me directly to the "Latest" tab

@alice945
Copy link
Contributor

The "f=tweets" at the begining make it Latest. I think you might be using the new twitter interface and that is messing up the url because it searches differently. gallery-dl uses the old interface so use that for getting search links.

@panhartstuff
Copy link
Author

panhartstuff commented Jan 11, 2020

Thank you, I didn't know gallery-dl was using the old interface.

But it seems that even with that URL, the search ends in this tweet: https://twitter.com/i/web/status/306762324792983552
forcing me to do the following max_id search:
"https://twitter.com/search?f=tweets&vertical=default&q=from%3Audon0531%20max_id%3A306762324792983552%20exclude%3Aretweets%20filter%3Amedia%20-filter%3Aperiscope&src=typed_query"
in order to get the rest of the tweets.

Not sure if I should use this method or max_position, but I don't really know how the latter works.

@panhartstuff
Copy link
Author

panhartstuff commented Jan 11, 2020

Ok, I think this is starting to look like a bug. Another URL to try:
https://twitter.com/search?f=tweets&q=from%3Ak3_spaceybear%20max_id%3A802921901689425921%20exclude%3Aretweets%20filter%3Amedia%20-filter%3Aperiscope&src=typd

gallery-dl --no-download "https://twitter.com/search?f=tweets&q=from%3Ak3_spaceybear%20max_id%3A802921901689425921%20exclude%3Aretweets%20filter%3Amedia%20-filter%3Aperiscope&src=typd"

The gallery-dl run will end right as early as a tweet from 2016: https://twitter.com/i/web/status/760878547590258688
But by simply browsing the page on your web browser for a short while will grant you tweets from at least 2015.
(Note that this discrepancy happens without max_id too, I added it to make examining the issue a bit faster for other people)

I think there's something wrong in the way that gallery-dl navigates the search page. I noticed that when speeding through the search results by holding "PgDown", there's sometimes a temporary fake "Back to top" message as if it has reached the end but it quickly disappears and the page loads the next batch of results. Maybe gallery-dl is misinterpreting this message.
EDIT: I rechecked and can confirm that the temporary "Back to top" message appears exactly at the tweet where gallery-dl stops at.

A note from another user from a Hitomi-Downloader issue might be relevant:

When it reaches the limit of the search page, it just continues sending next request.
I may be wrong, but seems like your program stops when new_latent_count=0 or items_html is empty, but that's not the end, it still sends new min_position. Addon just keeps sending new requests until min_position begins to repeat in sequence, and gets older posts without new search queries.
So just check min_position, when it repeats like 5 times in sequence thats probably the true end.

@alice945
Copy link
Contributor

Yup, this looks like a bug. I tried downloading the url I gave you and it indeed stopped just like you said. A quick look at the json the request returns reveals that the next "min_position" exists but since "has_more_items" is false, gallery-dl stops. The fix for this is to check to make sure that both "min_position" is null and "has_more_items" is false before stopping. I opened a pull request fixing this issue #573. Using this I was able to download everything with that url. Check and see if it works for you.

@panhartstuff
Copy link
Author

Thank you so much! I tested it around a bunch of users and it's working perfectly!

One minor edge case though, when a user's media timeline has only one page, it'll spit out this error:

Traceback (most recent call last):
  File "c:\users\name\appdata\local\programs\python\python38-32\lib\site-packages\gallery_dl\job.py", line 49, in run
    for msg in self.extractor:
  File "c:\users\name\appdata\local\programs\python\python38-32\lib\site-packages\gallery_dl\extractor\twitter.py", line 45, in items
    for tweet in self.tweets():
  File "c:\users\name\appdata\local\programs\python\python38-32\lib\site-packages\gallery_dl\extractor\twitter.py", line 234, in _tweets_from_api
    if not data["has_more_items"] and data["min_position"] == None:
KeyError: 'min_position'

Probably because min_position wouldn't exist in that situation.

@panhartstuff
Copy link
Author

panhartstuff commented Jan 12, 2020

Going to tack on some notes about weird Twitter search edge cases just in case someone is reading:

  • On certain cases, using no filters (filter:media, etc.) will yield a more complete result. I'm still not sure how universally applicable this is, but I've confirmed it in a few cases. (this will make the search way slower since gallery-dl has to go through a lot more tweets)

  • Cycling between different user agents can also reveal more results that would otherwise be invisible

  • I'm not sure if the new UI and the legacy UI will yield different results, but right now I'm just hamming in both URL types and merging the results together just for completeness

All in all twitter scraping is a nightmare

@alice945
Copy link
Contributor

Probably because min_position wouldn't exist in that situation.

Whoops, didn't realize that. I fixed it and it should be fine now. Check and see if it works.

On certain cases, using no filters (filter:media, etc.) will yield a more complete result. I'm still not sure how universally applicable this is, but I've confirmed it in a few cases. (this will make the search way slower since gallery-dl has to go through a lot more tweets)

While testing my update I ran into this where some images would just not download in the search timeline vs the media timeline. I compared both timelines side by side and indeed just one image doesn't show up in the search. However, after removing filter:media it showed up, but as a pic.twitter.com link. For some reason this link is recognized as media and shows up as a normal picture in the media timeline but not in the search timeline and is filtered out. Downloading the search timeline without the filter got the image normally. However, sometimes that doesn't even happen and no matter what I try, they don't appear in search. I found 3 examples of this from udon0531: 1183758726504730625, 1183709762354962432, 1183699603733897216 Let me know if you find out anything more about this.

Cycling between different user agents can also reveal more results that would otherwise be invisible

Do you have any examples of this? This might somehow be related to the new UI and the issue above. Currently gallery-dl uses the legacy UI by switching the user agent to IE 11. Maybe the new UI doesn't have the issue above and by cycling between user agents, you are switching between the legacy and new UI, therefore getting different results. It doesn't seem that way to me from my tests but I thought I'd bring it up.

I'm not sure if the new UI and the legacy UI will yield different results, but right now I'm just hamming in both URL types and merging the results together just for completeness

I just remembered that gallery-dl technically only uses the query part of a search url you provide and plugs that into "https://twitter.com/i/search/timeline?f=tweets&q={}". It doesn't really matter what else is attached to the url other than the query. Both UI's will therefore give you the same results.

mikf pushed a commit that referenced this issue Jan 14, 2020
* [twitter] Fix stop before real end

Fix for #544. Makes sure that it really reached the end by checking that both "min_position" is null and "has_more_items" is false before stopping.

* [twitter] Fix stop before real end (update)
@panhartstuff
Copy link
Author

I found 3 examples of this from udon0531: 1183758726504730625, 1183709762354962432, 1183699603733897216

I managed to obtain those three images but I'm not sure precisely what caused it.
The way I scrape is that I go through the media timeline first, get the final id, then continue with the search timeline.
I was also cycling (using "--no-download") between ie, opera, chrome, firefox and safari user agents using fake-useragent, then I combine the output and dedupe it before sending it back to gallery-dl for the actual download.
At the time of the download I was still using the media filter too.

Do you have any examples of this?

This particular tweet: https://twitter.com/udon0531/status/785302504653197312
I couldn't access it even through the advanced search (nor the media tab): https://twitter.com/search?q=(from%3Audon0531)%20max_id%3A867420736897531905%20exclude%3Aretweets%20filter%3Amedia%20-filter%3Aperiscope&src=typed_query
Then I added the user agent thing to my workflow and the images showed up in my downloads. My method was very brute-forcey so I couldn't really pinpoint which user agent the image finally showed up on.
It could be that the image would show up had I not used filter, I'll try to do some more verification.

I just remembered that gallery-dl technically only uses the query part of a search url

Oh thanks, that saves so much time since I was going through different variations of the url.

I did a few more tests without using media filter and it seems like it's working pretty well, it's scraping the right amount of stuff. Still need to do a lot more verification though.

@alice945
Copy link
Contributor

I managed to obtain those three images but I'm not sure precisely what caused it.
The way I scrape is that I go through the media timeline first, get the final id, then continue with the search timeline.

I wasn't able to get those posts because I was just using the search timeline. Since you got the media timeline first, you got those 3 just fine.

This particular tweet: https://twitter.com/udon0531/status/785302504653197312
I couldn't access it even through the advanced search (nor the media tab): https://twitter.com/search?q=(from%3Audon0531)%20max_id%3A867420736897531905%20exclude%3Aretweets%20filter%3Amedia%20-filter%3Aperiscope&src=typed_query

Strange, I'm able to see that post just fine. I used this link: https://twitter.com/search?f=tweets&q=from%3Audon0531%20max_id%3A785302504653197312%20exclude%3Aretweets%20filter%3Amedia&src=typd The user agent I'm using is Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko If you can find reproducible results with specific user agents that would be great. I'll keep testing and see what I can find out.

@kattjevfel
Copy link
Contributor

Ran into this issue myself today, is there really no way to work around it within gallery-dl?

@mikf
Copy link
Owner

mikf commented Oct 15, 2020

@kattjevfel If search queries by date etc. or anything else mentioned in this issue don't help, then no, there is nothing that I know of.

@ZenythFactor
Copy link

Oh damn, this wasn't resolved yet?
I got snagged into a limit as well up to 2,308 files max something media tweets.

and yeah, I didn't get this problem with TMD, but man having to go back to that processor-eating nightmare again...one by one downloads...date by date...zip by zip....MB by MB....
has anything changed yet @mikf ?

@mikf
Copy link
Owner

mikf commented Aug 23, 2021

@ZenythFactor It's not really possible to resolve as long as gallery-dl uses the unofficial API of the Twitter website itself. You can try your luck with search results, those usually go further than the media timeline. See also #1396 (comment).

@ZenythFactor
Copy link

ZenythFactor commented Aug 23, 2021

@ZenythFactor It's not really possible to resolve as long as gallery-dl uses the unofficial API of the Twitter website itself. You can try your luck with search results, those usually go further than the media timeline. See also #1396 (comment).

I'll try.
I also assume there's no alternative for /likes either, correct?

@mikf
Copy link
Owner

mikf commented Aug 23, 2021

Correct, there isn't.

@ZenythFactor
Copy link

Correct, there isn't.

Aw shoot.
Maybe i could pitch an idea for that?

Have you tried adding an option to download images up to a specific set date? Like "start from XX-XX-XXXX to YY-YY-YYYY"?
TMD has that option and i been using that as a method to download several images in certain wave of dates.

After when its done, the stop on date will be set as the ending date, letting me proceed with the next wave of downloads.

Can you do something like that please, it could be a temporary solution to the limited API problem until its figured.

@Hrxn
Copy link
Contributor

Hrxn commented Sep 2, 2021

@ZenythFactor is TMD on GitHub? Or somewhere similar?

@ZenythFactor
Copy link

@ZenythFactor is TMD on GitHub? Or somewhere similar?

Sadly not, AFAIK.
I also presume its japanese based too.
I'll recheck however.

@ZenythFactor
Copy link

Good news, @Hrxn, I'm wrong.
He does have a github!

https://github.com/furyutei/twMediaDownloader
@mikf you should check this out too if possible for #544 (comment)

@mikf
Copy link
Owner

mikf commented Sep 3, 2021

@ZenythFactor limiting results by date can be done via search (https://twitter.com/search?lang=en&q=(from%3Agithub)%20until%3A2021-07-31%20since%3A2021-05-01) or with gallery-dl's --filter (--filter "datetime(2021, 5, 1) <= date <= datetime(2021, 7, 31) "), but these sadly don't help with getting more Tweets from a timeline.

@ZenythFactor
Copy link

@ZenythFactor limiting results by date can be done via search (https://twitter.com/search?lang=en&q=(from%3Agithub)%20until%3A2021-07-31%20since%3A2021-05-01) or with gallery-dl's --filter (--filter "datetime(2021, 5, 1) <= date <= datetime(2021, 7, 31) "), but these sadly don't help with getting more Tweets from a timeline.

So it's stuck at the api limit too?

@mikf
Copy link
Owner

mikf commented Sep 4, 2021

Yeah, using https://twitter.com/search?q=from:github without any filters is going to return the same amount of Tweets as going step-by-step with

  • https://twitter.com/search?q=from:github since:2021-07-01 until:2021-10-01
  • https://twitter.com/search?q=from:github since:2021-04-01 until:2021-07-01
  • https://twitter.com/search?q=from:github since:2021-01-01 until:2021-04-01
  • etc

Restricting by date doesn't get around any limit imposed by Twitter.

The limit for searches is quite a bit looser compared to other timelines, but for some reason it doesn't work equally for all accounts from what I've noticed: Sometimes it returns 50k files for one account and sometimes only ~100 for another, even though /media returns 2k for both.

@ZenythFactor
Copy link

Soooo any updates on the API by any chance?

@mikf
Copy link
Owner

mikf commented Oct 15, 2022

Well, since version 1.22.0 / commit 3346f58, gallery-dl uses a strategy similar to TMD for twitter.com/USER URLs (/media + /search), which in my tests produced the same results as TMD.

The previous functionality for twitter.com/USER got moved to twitter.com/USER/tweets.

@ZenythFactor
Copy link

Well, since version 1.22.0 / commit 3346f58, gallery-dl uses a strategy similar to TMD for twitter.com/USER URLs (/media + /search), which in my tests produced the same results as TMD.

The previous functionality for twitter.com/USER got moved to twitter.com/USER/tweets.

Aha, that's great. That should mean the API limit isn't the issue anymore, right?

@mikf
Copy link
Owner

mikf commented Oct 15, 2022

It should, yeah. Hopefully. That was the reason this was implemented.
You still need to be logged in to download "adult" rated content or you enable the syndication option (#2354)

@ZenythFactor
Copy link

ZenythFactor commented Oct 15, 2022

I'm gonna test this out and report back soon, thanks!
Edit: Yup it works!!

@Twi-Hard
Copy link

I was just thinking about opening an issue because most tweets in a large search I did didn't download. It only downloaded 30-something thousand tweets out of over 1 million. It started scraping just the pages and not the tweets. Could this be the same issue?

@mikf
Copy link
Owner

mikf commented Dec 3, 2022

@Twi-Hard I'm pretty sure searching for Tweets with gallery-dl returns the exact same results as it would on Twitter itself, it's just that Twitter limits its results. You've probably already done this, but just in case you haven't: try using max_id and min_id or since and until and go chunk by chunk.

@mikf mikf closed this as completed Dec 3, 2022
@Twi-Hard
Copy link

Twi-Hard commented Dec 3, 2022

I'm able to get the full list of tweets with snscrape which also just scrapes search results.

mikf added a commit that referenced this issue Dec 15, 2022
Only stop when list of all returned Tweets is empty
instead of when no valid Tweet was found.
@mikf
Copy link
Owner

mikf commented Jan 5, 2023

I tried looking though the snscrape Twitter code but couldn't really find anything different from what gallery-dl already does except 90a9c07. Maybe that makes a difference.

@Scrooge200FES
Copy link

I'm trying to download from a user with 1.4k media tweets, but gallery-dl only gets around 300 images. Is this an API limit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants