[飞桨多模态大模型套件PaddleMIX开发大赛] RFC: enhance paddlemix llava data processing #905

GreatV · 2024-12-23T11:43:12Z

This pull request introduces three new classes (DataAnalyzer, DataAugmentor, and DataCleaner) to the paddlemix/datacopilot/example/data_insight module, enhancing the functionalities for data analysis, augmentation, and cleaning. The most important changes include the implementation of methods for analyzing datasets, augmenting data, and cleaning datasets by handling anomalies, detecting duplicates, and filtering by quality.

Data Analysis Enhancements:

paddlemix/datacopilot/example/data_insight/data_analysis.py: Introduced the DataAnalyzer class, which includes methods for analyzing dataset statistics, image quality, text quality, and image-text matching using the CLIP model.

Data Augmentation Enhancements:

paddlemix/datacopilot/example/data_insight/data_augmentation.py: Added the DataAugmentor class, which provides methods for augmenting images and text, as well as combining original and augmented samples.

Data Cleaning Enhancements:

paddlemix/datacopilot/example/data_insight/data_cleaning.py: Introduced the DataCleaner class, which includes methods for handling anomalies, detecting duplicates, and filtering samples based on quality metrics.

paddle-bot · 2024-12-23T11:43:16Z

Thanks for your contribution!

add enhance_paddlemix_llava

f8aaff9

paddle-bot bot added the contributor label Dec 23, 2024

paddle-bot bot assigned lyuwenyu Dec 23, 2024

GreatV changed the title ~~[飞桨多模态大模型套件PaddleMIX开发大赛] add enhance_paddlemix_llava~~ [飞桨多模态大模型套件PaddleMIX开发大赛] RFC: enhance paddlemix llava data processing Dec 23, 2024

luotao1 added the HappyOpenSource Pro 快乐开源issue与PR，更具挑战的任务 label Dec 24, 2024

luotao1 self-assigned this Dec 24, 2024

GreatV added 4 commits January 6, 2025 07:13

add impl for data enhance

b37cc6a

update data analysis

c29874c

update test_pipeline

436e71f

add more comments

fcc6252

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[飞桨多模态大模型套件PaddleMIX开发大赛] RFC: enhance paddlemix llava data processing #905

[飞桨多模态大模型套件PaddleMIX开发大赛] RFC: enhance paddlemix llava data processing #905

GreatV commented Dec 23, 2024 •

edited

Loading

paddle-bot bot commented Dec 23, 2024

[飞桨多模态大模型套件PaddleMIX开发大赛] RFC: enhance paddlemix llava data processing #905

Are you sure you want to change the base?

[飞桨多模态大模型套件PaddleMIX开发大赛] RFC: enhance paddlemix llava data processing #905

Conversation

GreatV commented Dec 23, 2024 • edited Loading

Data Analysis Enhancements:

Data Augmentation Enhancements:

Data Cleaning Enhancements:

paddle-bot bot commented Dec 23, 2024

GreatV commented Dec 23, 2024 •

edited

Loading