[飞桨多模态大模型套件PaddleMIX开发大赛] RFC: enhance paddlemix llava data processing #905
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces three new classes (
DataAnalyzer
,DataAugmentor
, andDataCleaner
) to thepaddlemix/datacopilot/example/data_insight
module, enhancing the functionalities for data analysis, augmentation, and cleaning. The most important changes include the implementation of methods for analyzing datasets, augmenting data, and cleaning datasets by handling anomalies, detecting duplicates, and filtering by quality.Data Analysis Enhancements:
paddlemix/datacopilot/example/data_insight/data_analysis.py
: Introduced theDataAnalyzer
class, which includes methods for analyzing dataset statistics, image quality, text quality, and image-text matching using the CLIP model.Data Augmentation Enhancements:
paddlemix/datacopilot/example/data_insight/data_augmentation.py
: Added theDataAugmentor
class, which provides methods for augmenting images and text, as well as combining original and augmented samples.Data Cleaning Enhancements:
paddlemix/datacopilot/example/data_insight/data_cleaning.py
: Introduced theDataCleaner
class, which includes methods for handling anomalies, detecting duplicates, and filtering samples based on quality metrics.