A collection of utility R scripts based around collecting and parsing Twitter discussion based around the Zika virus.
Current parsed variables include --
timestamp_ms
datetime
tweet_id
text
retweet_count
favorite_count
expanded_url
friends_count
screen_name
user_id_str
in_reply_to_screen_name
in_reply_to_user_id
rt_screen_name
rt_screen_id
full_name
followers_count
place_lat
place_lon
lat
lon
mentioned_users -";" separated multivalued column
mentioned_id -";" separated multivalued column
hashes -";" separated multivalued column
parsed_media_type -";" separated multivalued column
parsed_media_url -";" separated multivalued column
Collect tweets with Python (tweepy) or R (streamR) - key is streaming JSON output.
Run parse program as follows:
Rscript parse_twitter_zika.R ./zika_hagen_dat/data_0825.txt sample_output.csv 1
(1 for header, 2 for no header....used in bulk output)
parse_bulk_json.sh collates the 12 json files into 1 giant CSV file for now, supressing the CSV header output when writing at additional stages. Need to manually change folder if re-using this script right now, can be cleaner.