Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design user experience of subset selection #510

Open
bbrowning opened this issue Jan 27, 2025 · 3 comments
Open

Design user experience of subset selection #510

bbrowning opened this issue Jan 27, 2025 · 3 comments

Comments

@bbrowning
Copy link
Contributor

How will we expose subset selection to users? Is it part of data mixing and Recipe yaml files? Is it a separate step outside of data mixing? Will we add any new top-level Python APIs to expose this? Or change existing ones, such as mix_datasets?

The purpose of this issue is to figure these things out and get feedback on that design.

@eshwarprasadS
Copy link
Contributor

eshwarprasadS commented Jan 30, 2025

Thanks for this issue, @bbrowning

Some portion of this issue is also covered in #528

@eshwarprasadS
Copy link
Contributor

Design for Exposing Subset Selection

Below are some design decisions and open questions regarding how to expose subset selection to users:

  • Entry Point to the Algorithm:

  • Parameter Configuration via YAML:

    • All parameters required for subset selection will be configurable through a YAML file.
    • Required Parameters:
      • input_files: List of input files to process.
      • subset_sizes: List of subset sizes (can be integers for absolute counts or floats for percentages).
    • Basic Parameters (optional but recommended to adjust for use-case):
      • output_dir: Directory to save output files (default: "output").
      • batch_size: Size of batches for processing (default: 100000).
      • num_folds: Number of folds for subset selection (default: 50).
      • combine_files: Whether to combine input files before processing (default: False).
    • Advanced Parameters (for advanced users):
      • instruction: Instruction for the encoder.
      • query_description: Description for queries.
      • templates: Dictionary of templates for formatting text.
      • template_name: Name of the template to use.
      • num_gpus: Number of GPUs to use.
      • seed: Random seed.
      • max_retries: Maximum number of retries for failed operations.
      • retry_delay: Delay between retries in seconds.
      • encoder_type: Type of encoder to use.
      • encoder_model: Specific model to use for encoding.
    • Note: It's recommended that users only modify the basic parameters unless they have advanced needs and are familiar with the underlying mechanics.
  • Prompt Templates and Model Instructions:

    • Currently, prompt templates and model instructions for embedding generation can be hardcoded in subset_selection.py.
    • They can be overwritten via YAML configuration for advanced users, but it is highly recommended not to modify these unless you are very familiar with their workings.
  • Saving Artifacts:

    • Embeddings and other intermediate artifacts (e.g., embedding batches) will be saved to disk.
    • For now, all output can be stored under the provided output_dir (default to where?).
    • Open Question: Should we consider a more flexible storage strategy for these artifacts, or is the current approach sufficient?

Appreciate feedback on these decisions, especially on the integration approach with data mixing and whether this meets the needs for exposing subset selection. Any thoughts on the parameter configuration or artifact storage strategy would be very helpful.

@eshwarprasadS
Copy link
Contributor

cc: @khaledsulayman you might want to check this out and see if this aligns with your approach to resolving #528

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants