Ability to parallelize between GPUs #30

matsen · 2024-06-06T16:14:11Z

factoring apart load_and_add_shm...
finding least used gpu and using it, with a default
branch lengths are tensors
simplifying device handling of crepes

The code changes modify the `pick_device` function in `common.py` to improve the CUDA device selection for a specific job. The print statement has been updated to include the job ID, providing more informative output. Based on the recent user commits and repository commits, it seems that there have been updates related to GPU selection and dataset movement. However, these commits do not directly relate to the current code changes. Please note that the commit message should not include any meta information such as issue references, tags, or author names.

matsen · 2024-06-06T16:17:15Z

Dropped:

    def model_and_optimizer_to(self, device):
        self.model.to(device)
        for state in self.optimizer.state.values():
            for k, v in state.items():
                if isinstance(v, torch.Tensor):
                    state[k] = v.to(device)
        for dataset in [self.train_dataset, self.val_dataset]:
            if dataset is not None:
                dataset.to(device)

matsen · 2024-06-06T16:59:24Z

Also deferred to a future issue:

def dataset_of_pcp_df(pcp_df, branch_length_multiplier=5.0):
    return DNSMDataset.from_data(
        pcp_df["parents"],
        pcp_df["children"],
        pcp_df["rates"],
        pcp_df["subs_probs"],
        branch_length_multiplier=branch_length_multiplier,
    )


def train_val_datasets_of_pcp_df(pcp_df, branch_length_multiplier=5.0):
    """
    Perform a train-val split based on a "in_train" column.
    """
    train_df = pcp_df[pcp_df["in_train"]].reset_index(drop=True)
    val_df = pcp_df[~pcp_df["in_train"]].reset_index(drop=True)

    val_dataset = dataset_of_pcp_df(
        val_df,
        branch_length_multiplier=branch_length_multiplier,
    )
    if len(train_df) == 0:
        return None, val_dataset
    # else:
    train_dataset = dataset_of_pcp_df(
        train_df,
        branch_length_multiplier=branch_length_multiplier,
    )
    return train_dataset, val_dataset

matsen added 23 commits June 5, 2024 07:55

better printings

9e13221

factoring apart load_and_add_shm...

03ea63e

finding least used gpu

fe8fa85

picking device midstream

4332ce0

pick device in train

a6444a2

device for neutral model

a857b8d

flopping model around every 10 epochs

c2648ad

less flopping, global for GPU to use

533a816

moving optimizer as well

f90963a

cleaning out silly global

7fbbfdd

not flopping midstream

d2a024c

restoring tensor flopping; more thorough move?

e320694

test: can we flop?

5a405f3

moving as part of cycle

22485ad

moving datasets as well

c20b826

giving up on switching gpus

ea38435

dropping model_and_optimizer_to

435d210

feat: Add function to find least used CUDA GPU

03f68d2

blarg

c6f0193

no default for jobid

3f3710c

format

db0607f

gpu index improvement

ffffaec

matsen added 5 commits June 6, 2024 10:01

reverting changes to dnsm.py

8f570f0

cleanup

8195ca6

cleanup and update

5f90930

device output

248068e

device debug

399d413

matsen added 3 commits June 6, 2024 10:20

branch lengths are tensors

6455ab6

simplifying device handling of crepes

647dde3

clineau

3792d5f

matsen marked this pull request as ready for review June 6, 2024 17:32

matsen merged commit c8d6ef4 into main Jun 6, 2024
1 check passed

matsen mentioned this pull request Jun 6, 2024

more elegant GPU usage handling #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to parallelize between GPUs #30

Ability to parallelize between GPUs #30

matsen commented Jun 6, 2024 •

edited

Loading

matsen commented Jun 6, 2024

matsen commented Jun 6, 2024

Ability to parallelize between GPUs #30

Ability to parallelize between GPUs #30

Conversation

matsen commented Jun 6, 2024 • edited Loading

matsen commented Jun 6, 2024

matsen commented Jun 6, 2024

matsen commented Jun 6, 2024 •

edited

Loading