Skip to content

Final Year Project (FYP SCSE23-0038) The Augmented Human -- Seeing Sounds

Notifications You must be signed in to change notification settings

nicolelst/seeing_sounds

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SCSE23-0038 The Augmented Human -- Seeing Sounds

Final Year Project submitted to the School of Computer Science and Engineering of Nanyang Technological University in 2024

Table of contents

About this project

The World Health Organisation estimates that hearing loss will impact 2.5 billion people by 2050, and such conditions will be disabling for 10% of the global population. The growing demand for assistive technology for hearing impairment can be attributed to a globally ageing population and unsafe listening practices among young adults.

“Dinner table syndrome” describes the isolation faced by many deaf people due to difficulties engaging in group conversations with multiple non-signing hearing people. Participating in such conversations can be difficult for those with hearing loss due to the time required to process auditory inputs, reluctance to ask for repetition, and missing common verbal cues for turn-taking such as intonational change and pauses. The need to compensate for missing or unclear speech may require additional cognitive effort, which can lead to excess fatigue. As such, the exclusion of deaf people from avenues for bonding and socialisation, such as the dinner table, can result in isolation and loneliness.

This project explores the design and implementation of an application which provides accessibility to hearing impaired users, specifically in the context of conversations involving multiple speakers. The proposed application differentiates itself from existing solutions for automated captioning, by identifying the active speaker in addition to providing captions for what is being said. After uploading a video, selecting an annotation interface and specifying other settings, the user is able to download an annotated video with captions for each speaker as well as a transcript. The application aims to reduce barriers to understanding group conversations for those with hearing loss, and could be used for recorded panel discussions, meetings, and interviews.

Video demo

Video demo thumbnail

Final report

Lim, N. S. T. (2024). The augmented human — seeing sounds. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175150

Getting started

Port numbers are specified in port_config.json.

Install dependencies

Frontend packages

cd application/fe
pnpm install
pnpm update

FastAPI requirements

pip install fastapi
pip install python-multipart
pip install "uvicorn[standard]"
pip install websockets

VisualVoice speech separation requirements

brew install ffmpeg

pip install face-alignment 
pip install facenet-pytorch
pip install h5py
pip install av

pip install -U openmim
mim install mmcv

VisualVoice pretrained models

Follow instructions here to download pretrained model weights into application/be/video_processing/VisualVoice/pretrained_models.

cd application/be/video_processing/VisualVoice/pretrained_models
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/facial_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/lipreading_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/unet_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/vocal_best.pth

Whisper speech recognition requirements

pip install setuptools-rust
pip install -U openai-whisper

Transcript generation requirements

pip install python-docx

Run application

Run application using bash script

These commands were tested on a Mac zsh terminal.

sudo chmod +x application/run_application.sh
./application/run_application.sh

Run application manually

Run backend on http://localhost:8000 View Swagger UI on http://localhost:8000/docs

cd application/be/endpoints
uvicorn main:app --reload --port 8000

Run frontend on http://localhost:8001/

cd application/fe
pnpm run dev

Directory structure

.
├── application                                     # Final application
│   ├── be                                          # Backend
│   │   ├── endpoints                               # FastAPI endpoints
│   │   │   ├── ...
│   │   │   └── main.py                             # Backend server entry point
│   │   ├── utils
│   │   ├── video_processing                        # Logic for captioning + speaker identification
│   │   │   ├── annotation                          # Video annotation 
│   │   │   ├── transcript                          # Transcript generation 
│   │   │   ├── VisualVoice                         # Speech separation code from VisualVoice
│   │   │   ├── process_video.py                    # Video processing main function
│   │   │   ├── speech_recognition.py               # Speech recognition with Whisper
│   │   │   ├── speech_separation.py                # Speech separation with VisualVoice
│   │   │   ├── video_preprocessing.py              # Video preprocessing: resampling, face detection
│   │   │   └── visual_voice_changes.md             # Modifications to VisualVoice code
│   │   └── requirements.txt                        # Required packages for backend
│   ├── fe                                          # Frontend
│   │   ├── ...
│   │   ├── src
│   │   │   ├── ...
│   │   │   ├── components                          # Custom components 
│   │   │   ├── shadcn                              # shadcn/ui components
│   │   │   ├── App.tsx
│   │   │   └── main.tsx                            # Frontend entry point to React application
│   │   └── vite.config.ts                          # Frontend server config
│   ├── port_config.json                            # Application config for port numbers
│   └── run_application.sh                          # Bash script to start BE and FE servers
├── speaker_diarization                             # Alternative approach initial implementations
│   ├── mediapipe_config.py
│   ├── mediapipe_test.py                           # Active speaker detection with WebRTC VAD and MediaPipe facial landmarks
│   ├── pyannote_config.py                          # TODO: create this file and add HF_TOKEN
│   ├── pyannote_test.py                            # Speaker diarization
│   ├── README.md
│   └── requirements.txt                            # Required packages
├── .gitignore
└── README.md

Frontend (React JS)

see application/fe

Web page developed with React Vite using components from shadcn/ui

Backend (Python)

see application/be

Backend built with FastAPI in Python 3.8.9. A list of packages and versions used can be found in requirements.txt.

Video processing flow

Input: video file of multiple speakers

  1. Preprocessing of raw video input with FFmpeg
  2. Face detection with MTCNN
  3. Speech separation with VisualVoice (modifications are detailed in visual_voice_changes.md)
  4. Automated Speech Recognition (ASR) with Whisper
  5. Video annotation with MoviePy
  6. Transcript generation with python-docx

Output: annotated video + transcript with captions and speaker identification

API endpoints

  1. POST /upload_video
    • Expected response
      • 202 Accepted
      • returns request ID
    • query parameters
      • video settings: annotation interface, caption text colour, caption font size, number of speakers, speaker colours
      • speech separation model settings: hop length, number of identity frames per speaker, VisualVoice visual feature type (lipmotion/identity/both)
      • speech recognition model settings: Whisper model size, whether to use english only model (not available for large model)
      • query body: input video file (mp4)
  2. OPEN WEBSOCKET /ws/status/{request_id}
    • message format from BE
    {
     "request_id": "request id",
     "status": "processing status",
     "message": "error message if any"
    }
  3. GET /download_annotated
    • query params: request ID
    • returns zip folder (.zip) with annotated video file (.mp4) and subfolder of speaker thumbnail images (.png)
  4. GET /download_transcript"
    • query params
      • request ID
      • speaker names
      • speaker colours
    • returns annotated transcript (.docx)

Acknowledgements

VisualVoice is licensed under CC BY-NC (Attribution-NonCommercial 4.0 International). Licensing information can be found here. Changes are documented in visual_voice_changes.md.

About

Final Year Project (FYP SCSE23-0038) The Augmented Human -- Seeing Sounds

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published