-
Notifications
You must be signed in to change notification settings - Fork 0
2. Project Architecture
Muhammad Waleed Hassan edited this page Feb 10, 2025
·
1 revision
The system is composed of several key components that work together to process user input, retrieve relevant context, generate a response, and provide an audio output. The architecture is modular, allowing for each component (speech processing, context retrieval, content generation, and audio synthesis) to be maintained and updated independently.
- Endpoints:
-
- /: Serves the main interface.
-
- /process_text: Receives JSON requests containing text queries.
-
- /process_voice: Handles file uploads for voice queries.
- Role: Acts as the central hub that orchestrates incoming requests, processes input, and returns responses.
- Service: Google Cloud Speech-to-Text.
- Functionality: Converts uploaded audio files (in WAV format) into text, which is then used for further processing.
- Error Handling: If the transcription fails or returns an empty result, the system provides a fallback message indicating that the speech was not recognized.
####Embedding Generation:
- Uses Google’s Generative AI API to convert the user’s query into an embedding vector.
####Context Retrieval:
- Queries a BigQuery table that stores precomputed context embeddings.
- Calculates cosine similarity between the user embedding and stored embeddings to fetch the top relevant context chunks.
- Purpose: To supply the generative AI with additional context for generating informed and relevant answers.
####Content Generation Module:
- Service: Google Generative AI. -Process:
- Constructs a prompt that includes both the retrieved context chunks and the original user query.
- Invokes the generative model to produce a detailed, context-aware response.
- Prompt Engineering: The prompt is carefully designed to instruct the model to base its answer on the provided context and, if no context is relevant, to return a fallback response without fabricating details.
####Text-to-Speech (TTS) Module:
- Service: Google Cloud Text-to-Speech.
- Functionality: Converts the generated text response into audio. The resulting audio is encoded in base64 so it can be easily transmitted over JSON.
- Customization: Voice parameters such as pitch, speaking rate, and sample rate are configurable. Environment and Credential Management:
Configuration: Uses environment variables (loaded via dotenv) to securely manage API keys and credentials. Cloud Integration: Supports both environment-based credentials and service account JSON files for Google Cloud integration.