Final Year Project submitted to the School of Computer Science and Engineering of Nanyang Technological University in 2024
- Table of contents
- About this project
- Getting started
- Directory structure
- Frontend (React JS)
- Backend (Python)
- Acknowledgements
The World Health Organisation estimates that hearing loss will impact 2.5 billion people by 2050, and such conditions will be disabling for 10% of the global population. The growing demand for assistive technology for hearing impairment can be attributed to a globally ageing population and unsafe listening practices among young adults.
“Dinner table syndrome” describes the isolation faced by many deaf people due to difficulties engaging in group conversations with multiple non-signing hearing people. Participating in such conversations can be difficult for those with hearing loss due to the time required to process auditory inputs, reluctance to ask for repetition, and missing common verbal cues for turn-taking such as intonational change and pauses. The need to compensate for missing or unclear speech may require additional cognitive effort, which can lead to excess fatigue. As such, the exclusion of deaf people from avenues for bonding and socialisation, such as the dinner table, can result in isolation and loneliness.
This project explores the design and implementation of an application which provides accessibility to hearing impaired users, specifically in the context of conversations involving multiple speakers. The proposed application differentiates itself from existing solutions for automated captioning, by identifying the active speaker in addition to providing captions for what is being said. After uploading a video, selecting an annotation interface and specifying other settings, the user is able to download an annotated video with captions for each speaker as well as a transcript. The application aims to reduce barriers to understanding group conversations for those with hearing loss, and could be used for recorded panel discussions, meetings, and interviews.
Lim, N. S. T. (2024). The augmented human — seeing sounds. Final Year Project (FYP), Nanyang Technological University, Singapore.
Port numbers are specified in port_config.json.
cd application/fe
pnpm install
pnpm update
pip install fastapi
pip install python-multipart
pip install "uvicorn[standard]"
pip install websockets
brew install ffmpeg
pip install face-alignment
pip install facenet-pytorch
pip install h5py
pip install av
pip install -U openmim
mim install mmcv
Follow instructions here to download pretrained model weights into application/be/video_processing/VisualVoice/pretrained_models
.
cd application/be/video_processing/VisualVoice/pretrained_models
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/facial_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/lipreading_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/unet_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/vocal_best.pth
pip install setuptools-rust
pip install -U openai-whisper
pip install python-docx
These commands were tested on a Mac zsh terminal.
sudo chmod +x application/run_application.sh
./application/run_application.sh
Run backend on http://localhost:8000 View Swagger UI on http://localhost:8000/docs
cd application/be/endpoints
uvicorn main:app --reload --port 8000
Run frontend on http://localhost:8001/
cd application/fe
pnpm run dev
.
├── application # Final application
│ ├── be # Backend
│ │ ├── endpoints # FastAPI endpoints
│ │ │ ├── ...
│ │ │ └── main.py # Backend server entry point
│ │ ├── utils
│ │ ├── video_processing # Logic for captioning + speaker identification
│ │ │ ├── annotation # Video annotation
│ │ │ ├── transcript # Transcript generation
│ │ │ ├── VisualVoice # Speech separation code from VisualVoice
│ │ │ ├── process_video.py # Video processing main function
│ │ │ ├── speech_recognition.py # Speech recognition with Whisper
│ │ │ ├── speech_separation.py # Speech separation with VisualVoice
│ │ │ ├── video_preprocessing.py # Video preprocessing: resampling, face detection
│ │ │ └── visual_voice_changes.md # Modifications to VisualVoice code
│ │ └── requirements.txt # Required packages for backend
│ ├── fe # Frontend
│ │ ├── ...
│ │ ├── src
│ │ │ ├── ...
│ │ │ ├── components # Custom components
│ │ │ ├── shadcn # shadcn/ui components
│ │ │ ├── App.tsx
│ │ │ └── main.tsx # Frontend entry point to React application
│ │ └── vite.config.ts # Frontend server config
│ ├── port_config.json # Application config for port numbers
│ └── run_application.sh # Bash script to start BE and FE servers
├── speaker_diarization # Alternative approach initial implementations
│ ├── mediapipe_config.py
│ ├── mediapipe_test.py # Active speaker detection with WebRTC VAD and MediaPipe facial landmarks
│ ├── pyannote_config.py # TODO: create this file and add HF_TOKEN
│ ├── pyannote_test.py # Speaker diarization
│ ├── README.md
│ └── requirements.txt # Required packages
├── .gitignore
└── README.md
see application/fe
Web page developed with React Vite using components from shadcn/ui
see application/be
Backend built with FastAPI in Python 3.8.9. A list of packages and versions used can be found in requirements.txt.
Input: video file of multiple speakers
- Preprocessing of raw video input with FFmpeg
- Face detection with MTCNN
- Speech separation with VisualVoice (modifications are detailed in visual_voice_changes.md)
- Automated Speech Recognition (ASR) with Whisper
- Video annotation with MoviePy
- Transcript generation with python-docx
Output: annotated video + transcript with captions and speaker identification
- POST /upload_video
- Expected response
- 202 Accepted
- returns request ID
- query parameters
- video settings: annotation interface, caption text colour, caption font size, number of speakers, speaker colours
- speech separation model settings: hop length, number of identity frames per speaker, VisualVoice visual feature type (lipmotion/identity/both)
- speech recognition model settings: Whisper model size, whether to use english only model (not available for large model)
- query body: input video file (mp4)
- Expected response
- OPEN WEBSOCKET /ws/status/{request_id}
- message format from BE
{ "request_id": "request id", "status": "processing status", "message": "error message if any" }
- GET /download_annotated
- query params: request ID
- returns zip folder (
.zip
) with annotated video file (.mp4
) and subfolder of speaker thumbnail images (.png
)
- GET /download_transcript"
- query params
- request ID
- speaker names
- speaker colours
- returns annotated transcript (
.docx
)
- query params
VisualVoice is licensed under CC BY-NC (Attribution-NonCommercial 4.0 International). Licensing information can be found here. Changes are documented in visual_voice_changes.md.