-
-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Twitter - API limit? #544
Comments
here : https://developer.twitter.com/en/docs/basics/rate-limiting they changed limit recently i think so, before the changes, i can download almost all reachable tweet |
I'm not sure if rate limit is the cause, I can download dozens of Twitter users at a time just fine. There's also no error, so it probably assumed it has reached the end of the timeline when it hasn't. |
I had this issue recently and I used the search method to fix it (take a look at #448). I skimmed through the code of twMediaDownloader and it looks like they might be using this same method. To get all the images from big accounts, I first download normally and then use a search from the date that I'm blocked at to the date of the account creation, with "filter:images" to remove the text posts. You may need to break it up into multiple searches if the account is super big but I haven't run into that yet. Make sure that when you are searching you click on "Latest", the "Top" results sometimes mix up the order and miss some tweets. |
Thank you @alice945, I just realized that gallery-dl supports twitter search. But I want to ask more about your code since I don't quite get a couple of things: Does this have something to do with the max_position / min_position you were talking about? Is this a bug in the code or is this just how Twitter behaves? |
Like I said in my previous post, make sure you are using "Latest" when searching. Twitter does a lot of weird things with the "Top" search and it's unreliable. Try using this search url and let me know if it gets all the posts: https://twitter.com/search?f=tweets&vertical=default&q=from%3Audon0531%20exclude%3Aretweets%20filter%3Amedia%20-filter%3Aperiscope&src=typd |
Oh I thought the "&f=live" ensures you're searching from Latest? |
The "f=tweets" at the begining make it Latest. I think you might be using the new twitter interface and that is messing up the url because it searches differently. gallery-dl uses the old interface so use that for getting search links. |
Thank you, I didn't know gallery-dl was using the old interface. But it seems that even with that URL, the search ends in this tweet: https://twitter.com/i/web/status/306762324792983552 Not sure if I should use this method or max_position, but I don't really know how the latter works. |
Ok, I think this is starting to look like a bug. Another URL to try:
The gallery-dl run will end right as early as a tweet from 2016: https://twitter.com/i/web/status/760878547590258688 I think there's something wrong in the way that gallery-dl navigates the search page. I noticed that when speeding through the search results by holding "PgDown", there's sometimes a temporary fake "Back to top" message as if it has reached the end but it quickly disappears and the page loads the next batch of results. Maybe gallery-dl is misinterpreting this message. A note from another user from a Hitomi-Downloader issue might be relevant:
|
Yup, this looks like a bug. I tried downloading the url I gave you and it indeed stopped just like you said. A quick look at the json the request returns reveals that the next "min_position" exists but since "has_more_items" is false, gallery-dl stops. The fix for this is to check to make sure that both "min_position" is null and "has_more_items" is false before stopping. I opened a pull request fixing this issue #573. Using this I was able to download everything with that url. Check and see if it works for you. |
Thank you so much! I tested it around a bunch of users and it's working perfectly! One minor edge case though, when a user's media timeline has only one page, it'll spit out this error:
Probably because min_position wouldn't exist in that situation. |
Going to tack on some notes about weird Twitter search edge cases just in case someone is reading:
All in all twitter scraping is a nightmare |
Whoops, didn't realize that. I fixed it and it should be fine now. Check and see if it works.
While testing my update I ran into this where some images would just not download in the search timeline vs the media timeline. I compared both timelines side by side and indeed just one image doesn't show up in the search. However, after removing
Do you have any examples of this? This might somehow be related to the new UI and the issue above. Currently gallery-dl uses the legacy UI by switching the user agent to IE 11. Maybe the new UI doesn't have the issue above and by cycling between user agents, you are switching between the legacy and new UI, therefore getting different results. It doesn't seem that way to me from my tests but I thought I'd bring it up.
I just remembered that gallery-dl technically only uses the query part of a search url you provide and plugs that into "https://twitter.com/i/search/timeline?f=tweets&q={}". It doesn't really matter what else is attached to the url other than the query. Both UI's will therefore give you the same results. |
* [twitter] Fix stop before real end Fix for #544. Makes sure that it really reached the end by checking that both "min_position" is null and "has_more_items" is false before stopping. * [twitter] Fix stop before real end (update)
I managed to obtain those three images but I'm not sure precisely what caused it.
This particular tweet: https://twitter.com/udon0531/status/785302504653197312
Oh thanks, that saves so much time since I was going through different variations of the url. I did a few more tests without using media filter and it seems like it's working pretty well, it's scraping the right amount of stuff. Still need to do a lot more verification though. |
I wasn't able to get those posts because I was just using the search timeline. Since you got the media timeline first, you got those 3 just fine.
Strange, I'm able to see that post just fine. I used this link: https://twitter.com/search?f=tweets&q=from%3Audon0531%20max_id%3A785302504653197312%20exclude%3Aretweets%20filter%3Amedia&src=typd The user agent I'm using is |
Ran into this issue myself today, is there really no way to work around it within gallery-dl? |
@kattjevfel If search queries by date etc. or anything else mentioned in this issue don't help, then no, there is nothing that I know of. |
Oh damn, this wasn't resolved yet? and yeah, I didn't get this problem with TMD, but man having to go back to that processor-eating nightmare again...one by one downloads...date by date...zip by zip....MB by MB.... |
@ZenythFactor It's not really possible to resolve as long as gallery-dl uses the unofficial API of the Twitter website itself. You can try your luck with search results, those usually go further than the media timeline. See also #1396 (comment). |
I'll try. |
Correct, there isn't. |
Aw shoot. Have you tried adding an option to download images up to a specific set date? Like "start from XX-XX-XXXX to YY-YY-YYYY"? After when its done, the stop on date will be set as the ending date, letting me proceed with the next wave of downloads. Can you do something like that please, it could be a temporary solution to the limited API problem until its figured. |
@ZenythFactor is TMD on GitHub? Or somewhere similar? |
Sadly not, AFAIK. |
Good news, @Hrxn, I'm wrong. https://github.com/furyutei/twMediaDownloader |
@ZenythFactor limiting results by date can be done via search ( |
So it's stuck at the api limit too? |
Yeah, using
Restricting by date doesn't get around any limit imposed by Twitter. The limit for searches is quite a bit looser compared to other timelines, but for some reason it doesn't work equally for all accounts from what I've noticed: Sometimes it returns 50k files for one account and sometimes only ~100 for another, even though |
Soooo any updates on the API by any chance? |
Well, since version 1.22.0 / commit 3346f58, gallery-dl uses a strategy similar to TMD for The previous functionality for |
Aha, that's great. That should mean the API limit isn't the issue anymore, right? |
It should, yeah. Hopefully. That was the reason this was implemented. |
I'm gonna test this out and report back soon, thanks! |
I was just thinking about opening an issue because most tweets in a large search I did didn't download. It only downloaded 30-something thousand tweets out of over 1 million. It started scraping just the pages and not the tweets. Could this be the same issue? |
@Twi-Hard I'm pretty sure searching for Tweets with gallery-dl returns the exact same results as it would on Twitter itself, it's just that Twitter limits its results. You've probably already done this, but just in case you haven't: try using |
I'm able to get the full list of tweets with snscrape which also just scrapes search results. |
Only stop when list of all returned Tweets is empty instead of when no valid Tweet was found.
I tried looking though the snscrape Twitter code but couldn't really find anything different from what gallery-dl already does except 90a9c07. Maybe that makes a difference. |
I'm trying to download from a user with 1.4k media tweets, but gallery-dl only gets around 300 images. Is this an API limit? |
It seems that there is an API limit for Twitter? Downloading https://twitter.com/k3_spaceybear/media doesn't download anything beyond: https://twitter.com/i/web/status/802921901689425921
twMediaDownloader and Hitomi Downloader doesn't seem to have this problem, though.
Is there any way around this limit?
EDIT:
I don't know how the code from the twMediaDownloader above works because I'm not bigbrained enough to read through them, but I found some simple solutions searching online:
https://github.com/bpb27/twitter_scraping
https://github.com/MatthewWolff/TwitterScraper
https://stackoverflow.com/questions/8471489/find-all-tweets-from-a-user-not-just-the-first-3-200
It seems the general opinion is to use a Selenium headless browser and query a Twitter search.
I like MattheyWolff's solution where he search results month by month, it seems like a very effective solution as you're probably not going to get blocked by Twitter's scrolling limit that way. Even if we do, maybe we can still decrease the range from monthly to weekly or maybe even daily.
As an example, these are the results from k3_spaceybear's join date (March 2009)
from: k3_spaceybear since: 2009-03-01 until: 2009-03-31
Maybe there are better solutions, but this looks pretty feasible
The text was updated successfully, but these errors were encountered: