pip install kokoro==0.8.4 does not have the pickle issue
A powerful and feature-rich custom node collection for ComfyUI that integrates the Kokoro TTS (Text-to-Speech) system with advanced voice modification capabilities. This package allows you to generate natural-sounding speech and apply various voice effects within ComfyUI workflows.

- Multiple Language Support: English (US and UK) voices
- Voice Selection: 27+ voices to choose from (male and female options)
- Voice Blending: Combine two different voices with adjustable blend ratio
- Speed Control: Adjust speech rate from 0.5x to 2.0x
- GPU Acceleration: Utilize GPU for faster generation (with fallback to CPU)
- Voice Morphing: Transform voices into different characters (Child, Teen, Elder, etc.)
- Pitch and Formant Control: Adjust pitch and formant independently
- Effects Processing: Apply various audio effects (reverb, echo, distortion, etc.)
- Presets System: One-click voice transformations with predefined settings
- Character Effects: Special voice effects like Robot, Telephone, Megaphone, etc.
- ComfyUI installed and working
- Python 3.8 or newer
- PyTorch 2.0+ (already included with ComfyUI)
- Clone this repository into your ComfyUI custom_nodes directory:
cd ComfyUI-Geeky-Kokoro-TTS
git clone https://github.com/GeekyGhost/ComfyUI-Geeky-Kokoro-TTS.git
- Run the installer script:
cd geeky_kokoro_tts
python install.py
The installer will:
- Detect your ComfyUI installation
- Install required dependencies (including audio processing libraries)
- Download the Kokoro model files (if needed)
- Set up both the TTS and Voice Mod nodes
- Clone this repository into your ComfyUI's
custom_nodes
folder:
cd ComfyUI/custom_nodes
git clone https://github.com/yourusername/geeky-kokoro-tts.git geeky_kokoro_tts
- Install the required dependencies:
cd geeky_kokoro_tts
pip install -r requirements.txt
# Optional but recommended for better audio processing:
pip install resampy==0.4.2
pip install librosa>=0.10.0
- Download the required model files manually:
mkdir -p models
cd models
# Download the model file (about 83 MB)
wget https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/kokoro-v1.0.onnx
# Download the voices file (about 1.3 MB)
wget https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/voices-v1.0.bin
- Restart ComfyUI
- resampy installation failures: If resampy installation fails, try installing numba first:
pip install numba pip install resampy==0.4.2
- librosa issues: If librosa fails to install, the Voice Mod node will fall back to basic implementations:
# Try with specific versions pip install llvmlite==0.39.0 pip install numba==0.56.4 pip install librosa==0.10.0
If model download fails, you can manually download from these URLs:
- Kokoro model: https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/kokoro-v1.0.onnx
- Voices file: https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/voices-v1.0.bin
Place these files in the ComfyUI/custom_nodes/geeky_kokoro_tts/models/
directory.
- Add the "π Geeky Kokoro TTS" node to your workflow
- Connect inputs:
- text: Enter the text you want to convert to speech
- voice: Select a voice from the dropdown
- speed: Adjust the speech rate (0.5 to 2.0)
- use_gpu: Enable GPU acceleration (if available)
- Optional parameters:
- enable_blending: Turn on voice blending
- second_voice: Select a second voice for blending
- blend_ratio: Adjust the mix between primary and secondary voices
- Outputs:
- audio: Connect to audio playback or save nodes
- text_processed: The processed text after normalization
To create a custom voice blend:
- Enable the "enable_blending" toggle
- Select a primary voice (e.g., "πΊπΈ πΊ Heart β€οΈ")
- Choose a secondary voice (e.g., "π¬π§ πΉ George")
- Adjust the blend ratio (0.0 to 1.0):
- 1.0 = 100% primary voice
- 0.5 = 50% primary + 50% secondary
- 0.0 = 100% secondary voice
This can create unique voice combinations that aren't available as standard voices.
β οΈ Note: The Voice Mod node is currently in beta. Some effects may not work as expected and may be subject to change.
- Add the "π Geeky Kokoro Advanced Voice" node to your workflow
- Connect the audio input (typically from the TTS node)
- Choose between:
- Presets: Quick voice transformations (e.g., "Chipmunk", "Robot Voice", "Podcast")
- Custom Settings: Manually configure individual effects
The Voice Mod node organizes effects into logical groups that can be enabled independently:
- Voice Morphing: Transform voice character (Child, Masculine, Elder, etc.)
- Pitch & Formant: Adjust pitch, formant shift, and auto-tune
- Time Effects: Change playback speed or add vibrato
- Spatial Effects: Add reverb and echo
- Tone Controls: Adjust EQ bands (bass, mids, treble) and add harmonics
- Effects: Apply distortion, tremolo, bitcrush, and noise reduction
- Dynamics: Compression and analog warmth simulation
- Character Effects: Special transformations like Robot, Telephone, Whisper, etc.
For quick voice transformations, use the preset system:
- Select a preset from the dropdown (e.g., "Robot Voice", "Chipmunk", "Deep Voice")
- Adjust the preset_strength parameter (0.0 to 1.0) to control intensity
- Set effect_blend (0.0 to 1.0) to mix with the original voice
The Voice Mod node uses a multi-layered approach to audio processing:
- Primary Implementation: Uses librosa and resampy for high-quality processing
- Fallback Layer 1: Uses scipy-based algorithms when resampy is unavailable
- Fallback Layer 2: Uses numpy-only implementations when scipy is unavailable
This ensures the node can function even when optional dependencies are missing, but with potentially reduced quality.
The audio_utils.py
file contains fallback implementations for when specialized audio libraries aren't available:
# Example of the phase vocoder fallback when resampy isn't available
def stft_phase_vocoder(audio, sr, n_steps, bins_per_octave=12):
"""
Phase vocoder pitch shifting using STFT, more advanced than simple resampling
"""
if abs(n_steps) < 0.01:
return audio
# Convert steps to rate
rate = 2.0 ** (-n_steps / bins_per_octave)
# STFT parameters
n_fft = 2048
hop_length = n_fft // 4
# Compute STFT
D = stft(audio, n_fft=n_fft, hop_length=hop_length)
# Create new spectrogram with adjusted phase progression
time_steps = D.shape[1]
new_time_steps = int(time_steps / rate)
# Phase advance
phase_adv = np.linspace(0, np.pi * rate, D.shape[0])[:, np.newaxis]
# Time-stretch and phase manipulation logic...
# Invert STFT
y_shift = istft(D_stretch, hop_length=hop_length, length=len(audio))
return y_shift
The voice morphing system uses a combination of effects to create realistic voice transformations:
# Simplified example of voice morphing parameters
morph_params = {
"Child": {
"pitch_shift": 4.0,
"formant_shift": 2.0,
"brightness": 0.4,
"breathiness": 0.3,
"bass_boost": -0.3,
"mid_boost": 0.3,
"compression": 0.2
},
"Elder": {
"pitch_shift": -1.0,
"formant_shift": -0.5,
"brightness": -0.2,
"breathiness": 0.4,
"bass_boost": 0.2,
"mid_boost": -0.2,
"compression": 0.0,
"tremolo": 0.2
},
# Other voice types...
}
The TTS node leverages GPU acceleration for the Kokoro model when available:
# GPU loading in TTS node with fallback
if use_gpu and True not in self.MODEL and torch.cuda.is_available():
try:
with self.MODEL_LOCK:
self.MODEL[True] = KModel().to('cuda').eval()
except Exception as e:
print(f"GPU load failed: {e}. Using CPU.")
use_gpu = False
Voice Name | Description |
---|---|
πΊπΈ πΊ Heart β€οΈ | Female US English voice |
πΊπΈ πΊ Bella π₯ | Female US English voice |
πΊπΈ πΊ Nicole π§ | Female US English voice |
πΊπΈ πΊ Aoede | Female US English voice |
πΊπΈ πΊ Kore | Female US English voice |
πΊπΈ πΊ Sarah | Female US English voice |
πΊπΈ πΊ Nova | Female US English voice |
πΊπΈ πΊ Sky | Female US English voice |
πΊπΈ πΊ Alloy | Female US English voice |
πΊπΈ πΊ Jessica | Female US English voice |
πΊπΈ πΊ River | Female US English voice |
πΊπΈ πΉ Michael | Male US English voice |
πΊπΈ πΉ Fenrir | Male US English voice |
πΊπΈ πΉ Puck | Male US English voice |
πΊπΈ πΉ Echo | Male US English voice |
πΊπΈ πΉ Eric | Male US English voice |
πΊπΈ πΉ Liam | Male US English voice |
πΊπΈ πΉ Onyx | Male US English voice |
πΊπΈ πΉ Adam | Male US English voice |
Voice Name | Description |
---|---|
π¬π§ πΊ Emma | Female UK English voice |
π¬π§ πΊ Isabella | Female UK English voice |
π¬π§ πΊ Alice | Female UK English voice |
π¬π§ πΊ Lily | Female UK English voice |
π¬π§ πΉ George | Male UK English voice |
π¬π§ πΉ Fable | Male UK English voice |
π¬π§ πΉ Lewis | Male UK English voice |
π¬π§ πΉ Daniel | Male UK English voice |
The Voice Mod node includes several presets for common voice transformations:
Preset | Description |
---|---|
Chipmunk | High-pitched, child-like voice |
Deep Voice | Low-pitched, authoritative voice |
Robot Voice | Mechanical, synthesized voice |
Phone Call | Classic telephone audio quality |
Elder Voice | Aged voice with characteristic tremolo |
Ethereal | Dreamlike, reverb-heavy voice |
Monster | Deep, distorted, threatening voice |
Ghost | Eerie, spectral voice with reverb |
Podcast | Optimized for clarity and warmth (like professional audio) |
Movie Trailer | Deep, compressed voice for dramatic announcements |
- Auto-tune effect: Requires librosa; falls back to a basic chorus effect when unavailable
- Formant shifting: Most effective with librosa installed
- High-quality reverb: Best results with scipy installed
- Voice morphing: Some combinations of effects may produce unexpected results
- Processing time: Some effects (especially reverb and auto-tune) can be CPU-intensive
- Memory Optimization: Process shorter text segments when working with complex Voice Mod effects
- Voice Consistency: Use the same voice and settings for multiple text segments to maintain consistency
- Custom Voices: Try different blend ratios between voices to create unique combinations
- Pitch Effects: Subtle pitch adjustments (+/- 1.0) often sound more natural than extreme values
- GPU Acceleration: Use GPU for faster TTS processing, especially with longer texts
- Fallback Quality: Install resampy and librosa for the best audio quality in Voice Mod effects
- GPU Issues: If you encounter GPU-related errors, try switching to CPU mode by unchecking the "use_gpu" option
- Memory Errors: If you run into memory issues, try processing shorter text segments
- Audio Distortion: For distorted output, try reducing effect intensities or disabling some effect groups
- Missing Dependencies: Check the console for warnings about missing libraries (librosa, resampy, etc.)
- Model Load Errors: Ensure the model files are correctly installed in the
models
directory
Contributions are welcome! Feel free to submit issues or pull requests if you have improvements or bug fixes.
This project is licensed under the MIT License - see the LICENSE file for details.
- Kokoro TTS Project - The foundation of the TTS engine
- ComfyUI - The UI framework this node integrates with
- librosa - Audio processing library used for high-quality effects
- scipy - Scientific computing library used for audio signal processing
- resampy - High-quality audio resampling library