2. Project Architecture

Project Architecture

High-Level Overview

The system is composed of several key components that work together to process user input, retrieve relevant context, generate a response, and provide an audio output. The architecture is modular, allowing for each component (speech processing, context retrieval, content generation, and audio synthesis) to be maintained and updated independently.

Components and Workflow

Flask Web Application:

Endpoints:
- /: Serves the main interface.
- /process_text: Receives JSON requests containing text queries.
- /process_voice: Handles file uploads for voice queries.
Role: Acts as the central hub that orchestrates incoming requests, processes input, and returns responses.

Speech Recognition Module:

Service: Google Cloud Speech-to-Text.
Functionality: Converts uploaded audio files (in WAV format) into text, which is then used for further processing.
Error Handling: If the transcription fails or returns an empty result, the system provides a fallback message indicating that the speech was not recognized.

Embedding and Retrieval Module:

####Embedding Generation:

Uses Google’s Generative AI API to convert the user’s query into an embedding vector.

####Context Retrieval:

Queries a BigQuery table that stores precomputed context embeddings.
Calculates cosine similarity between the user embedding and stored embeddings to fetch the top relevant context chunks.
Purpose: To supply the generative AI with additional context for generating informed and relevant answers.

####Content Generation Module:

Service: Google Generative AI. -Process:
Constructs a prompt that includes both the retrieved context chunks and the original user query.
Invokes the generative model to produce a detailed, context-aware response.
Prompt Engineering: The prompt is carefully designed to instruct the model to base its answer on the provided context and, if no context is relevant, to return a fallback response without fabricating details.

####Text-to-Speech (TTS) Module:

Service: Google Cloud Text-to-Speech.
Functionality: Converts the generated text response into audio. The resulting audio is encoded in base64 so it can be easily transmitted over JSON.
Customization: Voice parameters such as pitch, speaking rate, and sample rate are configurable. Environment and Credential Management:

Configuration: Uses environment variables (loaded via dotenv) to securely manage API keys and credentials. Cloud Integration: Supports both environment-based credentials and service account JSON files for Google Cloud integration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly