Design user experience of subset selection #510

bbrowning · 2025-01-27T15:48:05Z

How will we expose subset selection to users? Is it part of data mixing and Recipe yaml files? Is it a separate step outside of data mixing? Will we add any new top-level Python APIs to expose this? Or change existing ones, such as mix_datasets?

The purpose of this issue is to figure these things out and get feedback on that design.

The text was updated successfully, but these errors were encountered:

eshwarprasadS · 2025-01-30T21:13:04Z

Thanks for this issue, @bbrowning

Some portion of this issue is also covered in #528

eshwarprasadS · 2025-02-03T23:55:15Z

Design for Exposing Subset Selection

Below are some design decisions and open questions regarding how to expose subset selection to users:

Entry Point to the Algorithm:
- The primary entry point will be via datamixing.py, where we add parsing logic (from YAML) and functions to invoke subset_datasets from subset_selection.py (Enable activation of subset selection feature per dataset with related args from recipe.yaml #528).
- For users who prefer a standalone approach, the subset selection function can also be directly called from subset_selection.py.
Parameter Configuration via YAML:
- All parameters required for subset selection will be configurable through a YAML file.
- Required Parameters:
  - input_files: List of input files to process.
  - subset_sizes: List of subset sizes (can be integers for absolute counts or floats for percentages).
- Basic Parameters (optional but recommended to adjust for use-case):
  - output_dir: Directory to save output files (default: "output").
  - batch_size: Size of batches for processing (default: 100000).
  - num_folds: Number of folds for subset selection (default: 50).
  - combine_files: Whether to combine input files before processing (default: False).
- Advanced Parameters (for advanced users):
  - instruction: Instruction for the encoder.
  - query_description: Description for queries.
  - templates: Dictionary of templates for formatting text.
  - template_name: Name of the template to use.
  - num_gpus: Number of GPUs to use.
  - seed: Random seed.
  - max_retries: Maximum number of retries for failed operations.
  - retry_delay: Delay between retries in seconds.
  - encoder_type: Type of encoder to use.
  - encoder_model: Specific model to use for encoding.
- Note: It's recommended that users only modify the basic parameters unless they have advanced needs and are familiar with the underlying mechanics.
Prompt Templates and Model Instructions:
- Currently, prompt templates and model instructions for embedding generation can be hardcoded in subset_selection.py.
- They can be overwritten via YAML configuration for advanced users, but it is highly recommended not to modify these unless you are very familiar with their workings.
Saving Artifacts:
- Embeddings and other intermediate artifacts (e.g., embedding batches) will be saved to disk.
- For now, all output can be stored under the provided output_dir (default to where?).
- Open Question: Should we consider a more flexible storage strategy for these artifacts, or is the current approach sufficient?

Appreciate feedback on these decisions, especially on the integration approach with data mixing and whether this meets the needs for exposing subset selection. Any thoughts on the parameter configuration or artifact storage strategy would be very helpful.

eshwarprasadS · 2025-02-04T20:11:08Z

cc: @khaledsulayman you might want to check this out and see if this aligns with your approach to resolving #528

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design user experience of subset selection #510

Design user experience of subset selection #510

bbrowning commented Jan 27, 2025

eshwarprasadS commented Jan 30, 2025 •

edited

Loading

eshwarprasadS commented Feb 3, 2025

eshwarprasadS commented Feb 4, 2025

Design user experience of subset selection #510

Design user experience of subset selection #510

Comments

bbrowning commented Jan 27, 2025

eshwarprasadS commented Jan 30, 2025 • edited Loading

eshwarprasadS commented Feb 3, 2025

Design for Exposing Subset Selection

eshwarprasadS commented Feb 4, 2025

eshwarprasadS commented Jan 30, 2025 •

edited

Loading