Skip to content
This repository has been archived by the owner on Nov 10, 2024. It is now read-only.

Tweets of multiple Twitter-Accounts #136

Closed
respektive-reen opened this issue Nov 15, 2017 · 6 comments
Closed

Tweets of multiple Twitter-Accounts #136

respektive-reen opened this issue Nov 15, 2017 · 6 comments

Comments

@respektive-reen
Copy link

Hey, thanks for fixing the issues about the authorization method and the data output yesterday :)

Now I'm a bit puzzled if there's a possibility to get the maximum number of tweets (3.200 per account) from a large sample i.e. 1.000 persons.

I already tried something like this:
tmls_flw <- get_timelines(c("cnn", "BBCWorld", "foxnews"), n = 3200, retryonratelimit =TRUE)
But it didn't worked the way I expected. I'm now just getting a total of 3.200 tweets and not 3.200 from each of them.

Is there any workaround to get all the tweets of such a large number of accounts with the get_timelines-function which says: "Hey R, give me the maximum number (3.200 per account) of recent tweets of these accounts."?

Or do I have to code it like this, for every account I want to mine?
flw1 <- get_timeline("cnn"), n = 3200)
flw2 <- get_timeline("bbc"), n = 3200)
flw3 <- get_timeline("fox"), n = 3200)

Thanks in advance

@mkearney
Copy link
Collaborator

mkearney commented Nov 15, 2017

The code worked for me:

> tmls_flw <- get_timelines(c("cnn", "BBCWorld", "foxnews"), n = 3200, retryonratelimit =TRUE)
tmls_flw
> 
# A tibble: 9,649 x 42
            status_id          created_at user_id screen_name
 *              <chr>              <dttm>   <chr>       <chr>
 1 930797610092449792 2017-11-15 14:00:18  759251         CNN
 2 930794780812070913 2017-11-15 13:49:03  759251         CNN
 3 930792031496044544 2017-11-15 13:38:08  759251         CNN
 4 930789258218164224 2017-11-15 13:27:07  759251         CNN
 5 930786544159518720 2017-11-15 13:16:19  759251         CNN
 6 930784144744951808 2017-11-15 13:06:47  759251         CNN
 7 930783345948200961 2017-11-15 13:03:37  759251         CNN
 8 930778887403048960 2017-11-15 12:45:54  759251         CNN
 9 930775665586163714 2017-11-15 12:33:06  759251         CNN
10 930772881491005441 2017-11-15 12:22:02  759251         CNN
# ... with 9,639 more rows, and 38 more variables: text <chr>, source <chr>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_name <chr>,
#   place_full_name <chr>, place_type <chr>, country <chr>, country_code <chr>,
#   geo_coords <list>, coords_coords <list>, bbox_coords <list>

I actually haven't gotten around to adding retryonratelimit functionality to get_timelines() yet. Perhaps including that is causing some bug?

Otherwise, it looks like you'd burn through about 17 requests per user, which means you should be able to get the max number of statuses returned for 52 users every 15 minutes.

> rate_limit("get_timeline")
# A tibble: 1 x 6
                   query limit remaining         reset            reset_at
                   <chr> <int>     <int>        <time>              <dttm>
1 statuses/user_timeline   900       849 12.05776 mins 2017-11-15 08:15:46
# ... with 1 more variables: app <chr>

If you're dealing with a larger number of accounts than 52, then you'd probably want to set up a for loop. For example, let's say you have a vector, users, of screen names you'd like to get timeline data for. I'd execute a loop that looks something like this:

tmls <- vector("list", length(users))

for (i in seq_along(tmls)) {
  tmls[[i]] <- get_timeline(users[i], n = 3200)
  ## assuming full rate limit at start, wait for fresh reset every 52 users
  if (i %% 52L == 0L) {
    rl <- rate_limit("get_timeline")
    Sys.sleep(as.numeric(rl$reset, "secs"))
  }
  ## print update message
  cat(i, " ")
}

## merge into single data frame (do_call_rbind will preserve users data)
tmls <- do_call_rbind(tmls)

Side note, this actually returned slightly more than 3200 [unique] tweets per user, which I don't think I've seen before.

# A tibble: 3 x 3
      term     n   percent
     <chr> <int>     <dbl>
1 BBCWorld  3218 0.3335061
2  FoxNews  3216 0.3332988
3      CNN  3215 0.3331951

@respektive-reen
Copy link
Author

respektive-reen commented Nov 16, 2017

Thanks for your reply.

I run the loop code you considered and it worked fine.

I just got multiple warnings, that some pages do not exist. Could this be an error as a result of no tweets on these timelines and if the answer is yes, is there a possibility to code it with an if-function like "if statuses_count <=1 then dismiss this account" or something like this? It would help me to save a lot of time and processing power.

Thanks in advance RG

@mrmvergeer
Copy link

Hi @renegro90.
I had a similar problem. I had made a similar script as @mkearney.
Because some accounts are set to private and therefore you can't get the tweets, the script stops. I fixed this by putting "try" before the get_timeline-bit. Though maybe not elegant, it continues to collect the tweets of the remaining accounts. This is untested :

tmls <- vector("list", length(users))

for (i in seq_along(tmls)) {
tmls[[i]] <- try(get_timeline(users[i], n = 3200))
if (i %% 52L == 0L) {
rl <- rate_limit("get_timeline")
Sys.sleep(as.numeric(rl$reset, "secs"))
}

cat(i, " ")
}

tmls <- do_call_rbind(tmls)

@mkearney
Copy link
Collaborator

@renegro90 @mrmvergeer Thanks for following up on this!

Question: with the newest version (0.6.0) of rtweet, are these empty timelines creating errors or warnings? The should be creating warnings...so please let me know if you experience anything differently!

@respektive-reen
Copy link
Author

@mkearney. Yes I got plenty of warnings by running the code with ~100 accounts. After completing the computation, R says:

There were 50 or more warnings (use warnings() to see the first 50)

and

1: 34 - Sorry, that page does not exist.
2: Sorry, that page does not exist.

@mrmvergeer. Your code works. But the original script by @mkearney worked as well (I got the same output with both of your codes) and didn't stopped but with your addition it's possible to see on which user the script is working at the moment (it's kind of like a loading bar).

@mkearney Is it possible (maybe in an interim stage between the lookup_users and the get_followers step) to dismiss the accounts which are either protected (I think this indicates if the timeline is set to private?!?), and/or have less then x statuses in their timeline, and/or are posted in an other language then english?

@mkearney
Copy link
Collaborator

mkearney commented Nov 21, 2017

@renegro90 You should be able to filter users using the protected and account_lang variables.

> ## users with public/english, public/french, private/english accounts respectively
> sns <- c("kearneymw", "Vachier_Lagrave", "mikewaynesworld")
>
> ## lookup users data
> (usr <- lookup_users(sns))
# A tibble: 3 x 20
     user_id                     name     screen_name      location
       <chr>                    <chr>           <chr>         <chr>
1 2973406683 "Mike Kearney\U0001f4ca"       kearneymw  Columbia, MO
2  157070052                      MVL Vachier_Lagrave Paris, France
3  174454226                       mw mikewaynesworld         SMDHU
# ... with 16 more variables: description <chr>, url <chr>, protected <lgl>,
#   followers_count <int>, friends_count <int>, listed_count <int>,
#   statuses_count <int>, favourites_count <int>, account_created_at <dttm>,
#   verified <lgl>, profile_url <chr>, profile_expanded_url <chr>,
#   account_lang <chr>, profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>
>
> ## view protected variable values
> usr$protected
[1] FALSE FALSE  TRUE
>
> ## view account_lang variable values
> usr$account_lang
[1] "en" "fr" "en"

So you could create a function to filter those like this:

## function to filter only English-language and public accounts.
filter_users <- function(x) {
  if (!is.data.frame(x) || !all(c("account_lang", "protected") %in% names(x))) {
    stop("Users data not found")
  }
  x$user_id[x$account_lang == "en" & x$protected]
}

Apply filter_users function to usr data from above

> filter_users(usr)
[1] "174454226"

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants