-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize the connector and add monitoring #70
Conversation
Please mark whether you used Copilot to assist coding in this PR
|
@@ -65,18 +69,25 @@ def create_thread_and_run(self): | |||
return self.thread | |||
|
|||
def run(self): | |||
# Start the micro monitoring thread | |||
monitoring_thread = threading.Thread(target=self.run_micro_monitoring) | |||
monitoring_thread.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls set all thread to daemon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled daemon for threads.
a2befd4
to
264fce2
Compare
…78/add_monitoring
SonarQube Quality Gate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has the new changes been tested with cognitive-mesh?
Tested by a few agents. A comprehensive test may be required. Also, the CM configuration can be updated to take advantages of the new changes such as log rotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please ensure you can run cognitive-mesh locally smoothly with all your changes and test all the functionalities before merging the PR.
SonarQube Quality Gate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal: Remove dependency to Langchain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal: Add log rotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal: Split a long payload to small chunks rather than getting an array of chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal: Remove dependency to langchain and litellm. Upgrade the pubsub library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal: Support log rotation and backup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal: remove payload from logs due to security.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal: Add daemon to threads for terminating them gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal: Gracefully terminate AI Connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal: Set daemon for threads to terminate them gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal:
- Gracefully terminate the connector.
- Configure log rotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested! ✅
LGTM
What is the purpose of this change?
This change covers a couple of issues and features:
1. Implement a monitoring system to periodically check the status of workflows.
2. Control log file sizes and archive older logs to prevent the AI Connector from breaking due to large log files.
3. Fix the issue where the connector aggressively logs errors in an infinite loop when the broker stops. This happens because the error log is written continuously without a delay until the broker reconnects.
4. Ensure the connector shuts down gracefully by addressing two main issues:
- The connector does not handle SIGINT and SIGTERM signals properly.
- Threads in sleep mode cannot be terminated by signals.
5. Remove dependencies on Langchain and Litellm from the connector’s core, as these libraries are not used directly by the connector’s core functionality.
6. Prevent logs from recording confidential information, such as keys and user payloads, to address security concerns.
7. Logging the connection/reconnection attempts. The PubSub API does not log attempts.
8. Getting the connection status using an API
How is this accomplished?
1. Monitoring:
1.1. Added a monitoring singleton class that is instantiated once the connector is created.
1.2. Updated the component base to run the monitoring per component and periodically get metrics of the component. Each metric is a dictionary that contains [flow name, flow index, component name, component index, time stamp, metric].
1.3. Each component defines the actual metrics. Herein, updated the broker base component to get the ACK metric. We do not need other component's metrics at this moment.
1.4. Updated the SolaceMessage and ComponentBase classes to manage the state of connection. Then, the componentBase pulls the connection status of each broker component. Finally, the monitoring module aggregates all brokers' status and result the final connection status of the AI connector. The AI Connector is connected when all broker modules are connected.
1.5 For logging connection/reconnection attempt, we added some threads to the connection/reconnection handler to log attempts.
2. Log files: Updated the logger in the log.py to support the below features:
2.1. Enable the log rotating. It gets the maximum size of logs and the number of archived files. It configures the logger to consider these parameters.
2.1. Set a pattern for the archived files (e.g., ..). Note that the main log should be the log file name rather than the pattern.
3. Aggressive log: Added the grow and reset sleep time methods. These methods exponentially increase the sleep time when they fail to connect to the broker and then reset the sleep time when it can connect.
4. Graceful shutdown: Handled SIGINT and SIGTERM in the main.py and replaced sleep.time methods with the stop_signal.wait, which breaks threads while they are in the sleep mode.
5. Langchain and Litellm dependencies: Removed these packages from the component-> init.py, requirements.txt and pyproject.toml files.
6. Log security: Removed payloads from logs.
Anything reviews should focus on/be aware of?