Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For retweets, consider extracting text of original tweet (when present) to provide fuller context for truncated retweets #294

Open
kerchner opened this issue Jan 6, 2015 · 2 comments

Comments

@kerchner
Copy link
Member

kerchner commented Jan 6, 2015

Retweets commonly have the form: RT @original_tweeter Original tweet text.

Twitter appears to truncate the retweet, including the prefixes, to 144 characters.

In tweets which Twitter "recognizes" as retweets, i.e. where mytweet["retweeted_status"] is not None, the original, non-truncated tweet text is available as mytweet["retweeted_status"]["text"].

This could in theory be used to replace the original (and truncated) tweet text portion of the retweet in item_text.

We should verify that these always match; are there cases where the retweet text might diverge from the original tweet? If so, then replacing it might create an accuracy/integrity issue, and we might not want to overwrite it (although we would never change the raw JSON as stored - this discussion is only regarding item_text).

More conservative options might include:

  • Adding a new field - e.g. original_text_of_retweet or something to that effect - to which we extract ["retweet_status"]["text"] and make it available (optionally?) in extracts. We could include a column with the original tweet text, and another column with our best guess at "fixing" the retweet.
  • Adding a flag to the extract commands to indicate whether or not to "fix" item_text. This still entails the risk that extracts then include item_text values that don't match item_text in our database.

Note also that the ["truncated"] node seems to be unreliable. As an example, this retweet truncated the original tweet, but ["truncated"] is false: http://sfm.library.gwu.edu/twitter-item/7695264/

@kerchner kerchner changed the title For retweets, consider using original tweet text (when present) to repair truncation by Twitter For retweets, consider extracting text of original tweet (when present) to provide fuller context for truncated retweets Jan 6, 2015
@dchud
Copy link
Contributor

dchud commented Jan 12, 2015

We do something like this for is_retweet, adding a column to the csv output using our own logic to catch retweets that didn't use twitter's retweet function. Researchers asked for this.

Has someone asked us to do something like this?

At most we should add a value rather than changing anything received directly from twitter.

@kerchner
Copy link
Member Author

@dchud yes this was requested by the student project team from the Elliott School when they noticed that the text of some retweets is truncated (relative to the original tweet).

It sounds like you concur with the first bullet in the comment above (the first comment) that at most we should add a new value to surface ["retweet_status"]["text"] when present - and/or a new value which computes a "complete" (i.e. un-truncated) retweet using ["retweet_status"]["text"] when present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants