Skip to content

StalkerShurik/Source-Separation

Repository files navigation

Source Separation

About

This repository contains an implementation of two DL models for Audio-Visual Source Separation: TASnet and RTFS-Net.

Installation

  1. (Optional) Create and activate new environment using conda or venv (+pyenv).

    a. conda version:

    # create env
    conda create -n project_env python=PYTHON_VERSION
    
    # activate env
    conda activate project_env

    b. venv (+pyenv) version:

    # create env
    ~/.pyenv/versions/PYTHON_VERSION/bin/python3 -m venv project_env
    
    # alternatively, using default python version
    python3 -m venv project_env
    
    # activate env
    source project_env
  2. Install all required packages

    pip install -r requirements.txt

How To Use

Data preparation:

  1. First of all you need to prepare your dataset with this structure:
NameOfTheDirectoryWithUtterances
├── audio
│   ├── mix
│   │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   │   .
│   │   .
│   │   .
│   │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│   ├── s1 # ground truth for the speaker s1, may not be given
│   │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   │   .
│   │   .
│   │   .
│   │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│   └── s2 # ground truth for the speaker s2, may not be given
│       ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│       ├── FirstSpeakerID2_SecondSpeakerID2.wav
│       ...
│       └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── mouths # contains video information for all speakers
    ├── FirstOrSecondSpeakerID1.npz # npz mouth-crop
    ├── FirstOrSecondSpeakerID2.npz
    .
    .
    .
    └── FirstOrSecondSpeakerIDn.npz
  1. Generation video embedding

Embeddings

We used open-source project for generation video embeddings: repo. But original repo contained some problems so we forked repo and fixed them: forked repo.

Guide for embedding extraction:

Preparation:

git clone https://github.com/dikirillov/Lipreading_using_Temporal_Convolutional_Networks/
pip install -r Lipreading_using_Temporal_Convolutional_Networks/requirements.txt

Exctraction:

Embedding extraction: model url .

python3 Lipreading_using_Temporal_Convolutional_Networks/main.py --modality video \
        --extract-feats \
        --config-path 'Lipreading_using_Temporal_Convolutional_Networks/configs/lrw_resnet18_dctcn_boundary.json' \
        --model-path <PATH-TO-DOWNLOADED-MODEL> \
        --mouth-patch-path <MOUTH-PATCH-PATH>

Training

If you want to retrain model you can use train script with your config:

python3 train.py -cn=CONFIG_NAME HYDRA_CONFIG_ARGUMENTS

For example if you want to retrain RTFS model with your dataset you should change src/configs/rtfs.yaml(set correct dataset dir) and then run command:

python3 train.py dataset

Inference

You can use inference.py to generate separated audio from your dataset:

HYDRA_FULL_ERROR=1 python inference.py \
 datasets.inference.part=null +datasets.inference.dataset_dir='PATH_TO_YOUR_DATASET' \
 inferencer.save_path='PATH_TO_SAVE'

PATH_TO_YOUR_DATASET - path to your dataset folder, it should be located in data folder and contain audio, mouth and mouth_embeds folder.

PATH_TO_SAVE - path where you want to save your files(they will be located in 'data/saved/PATH_TO_SAVE')

If you also want to calculate metrics you should add s1 and s2 folders to your audio folder and then run:

HYDRA_FULL_ERROR=1 python inference.py \
 datasets.inference.part=null +datasets.inference.dataset_dir='PATH_TO_YOUR_DATASET' \
 inferencer.save_path='PATH_TO_SAVE' \
 metrics=inference_metrics

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages