Skip to content

essiviippola/whatsapp-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Whatsapp analysis

The scripts on this repository can be used to extract statistics from Whatsapp group chats as well generate text based on Whatsapp messages. The main application was to analyze messages to celebrate our group chat's ("Tuuli Nousee") 4th anniversary.

Dataset

Data comprise of a CSV file of messages exported from Whatsapp. Each row represents one message. Messages are written primarily in Finnish with some English here and there.

The data contains the following fields:

Variable Description
Datetime Date and time in the following format: D.M.YYYY klo hh.mm
Author The name of the author
Message The contents of the message or the phrase "<Media jätettiin pois>" in case of a media message

The following variables were derived from the above source data:

Variable Type Description
date datetime Message date
time char Message time
author char Message author
message char Message contents
message_count num Always 1 because each row contains one message. Used for aggregation.
media_count num Number of media attached to the message
letter_count num Number of letters in the message
word_count num Number of words in the message
message_with_emoji bool Indicator for messages with at least one emoji
link list List of links in the message
link_count num Number of links in the message
message_with_link bool Indicator for messages with at least one link
weekday char Weekday of the message in English
weekend bool Indicator for messages sent on a weekend
hour num Hour when the message was sent
hour_group char Hour group, split by 6 hours

Analysis

The analysis contains the following sections:

  • Messages by author
  • Messages by date and time
  • Emojis
  • Links
  • Message contents

Text generation

The final goal is to build a chatbot generating answers like the ones typically sent in out group chat.

The first attempt at text generation uses an LSTM recurrent neural network to generate text character-by-characted based on an input of 100 characters. The output does not make much sense but the model was able to learn some characteristics of spoken Finnish.

Seed: "n maksaa softapivitykset? 0e huollon yhteydess no sit 5/5 iha jees toki huolto oli 169e mut ei ny ka"
Result: "ikki tarvii kaikki tiet et se on koko tietoturvaittu kaikki tiet et se on kyll tiet et se on koko si"

The second attempt uses Markov chains with back-off to generate text word-by-word. Word-by-word generation works seemingly better but is not much better than a 'bag of words' approach. The model can only use words present in the training data.

Seed: "mitä tehdään lauantaina"
Result: "käydä kyselemässä että opin elämästä kun 2/9 * oblivion music television näky, mut ens viikon perjantaina 
saadaan ihmiset pelkää mikroja😃 Tää puukoriste on kai virallinen asiointikieli on semmonen autossa ja luot yhteyden 
ottamisest apreeseensissä eikä edes teamssii pyytäny jälkikäteen usb voi ydinvoimalais ni sitten just nyt luet siellä 
kermaviiliämpäreitä ?"

Ideas for the next steps

  • Using a pretrained neural network (maybe FinBERT if possible)
  • Utilizing the chat-format of the training data (as the final goal is to generate whatsapp message type texts)

Sources

About

Whatsapp group chat analysis and text generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published