The scripts on this repository can be used to extract statistics from Whatsapp group chats as well generate text based on Whatsapp messages. The main application was to analyze messages to celebrate our group chat's ("Tuuli Nousee") 4th anniversary.
Data comprise of a CSV file of messages exported from Whatsapp. Each row represents one message. Messages are written primarily in Finnish with some English here and there.
The data contains the following fields:
Variable | Description |
---|---|
Datetime | Date and time in the following format: D.M.YYYY klo hh.mm |
Author | The name of the author |
Message | The contents of the message or the phrase "<Media jätettiin pois>" in case of a media message |
The following variables were derived from the above source data:
Variable | Type | Description |
---|---|---|
date | datetime | Message date |
time | char | Message time |
author | char | Message author |
message | char | Message contents |
message_count | num | Always 1 because each row contains one message. Used for aggregation. |
media_count | num | Number of media attached to the message |
letter_count | num | Number of letters in the message |
word_count | num | Number of words in the message |
message_with_emoji | bool | Indicator for messages with at least one emoji |
link | list | List of links in the message |
link_count | num | Number of links in the message |
message_with_link | bool | Indicator for messages with at least one link |
weekday | char | Weekday of the message in English |
weekend | bool | Indicator for messages sent on a weekend |
hour | num | Hour when the message was sent |
hour_group | char | Hour group, split by 6 hours |
The analysis contains the following sections:
- Messages by author
- Messages by date and time
- Emojis
- Links
- Message contents
The final goal is to build a chatbot generating answers like the ones typically sent in out group chat.
The first attempt at text generation uses an LSTM recurrent neural network to generate text character-by-characted based on an input of 100 characters. The output does not make much sense but the model was able to learn some characteristics of spoken Finnish.
Seed: "n maksaa softapivitykset? 0e huollon yhteydess no sit 5/5 iha jees toki huolto oli 169e mut ei ny ka"
Result: "ikki tarvii kaikki tiet et se on koko tietoturvaittu kaikki tiet et se on kyll tiet et se on koko si"
The second attempt uses Markov chains with back-off to generate text word-by-word. Word-by-word generation works seemingly better but is not much better than a 'bag of words' approach. The model can only use words present in the training data.
Seed: "mitä tehdään lauantaina"
Result: "käydä kyselemässä että opin elämästä kun 2/9 * oblivion music television näky, mut ens viikon perjantaina
saadaan ihmiset pelkää mikroja😃 Tää puukoriste on kai virallinen asiointikieli on semmonen autossa ja luot yhteyden
ottamisest apreeseensissä eikä edes teamssii pyytäny jälkikäteen usb voi ydinvoimalais ni sitten just nyt luet siellä
kermaviiliämpäreitä ?"
- Using a pretrained neural network (maybe FinBERT if possible)
- Utilizing the chat-format of the training data (as the final goal is to generate whatsapp message type texts)