Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm balancing logic issues #389

Open
fadenb opened this issue Jul 20, 2023 · 17 comments
Open

Swarm balancing logic issues #389

fadenb opened this issue Jul 20, 2023 · 17 comments

Comments

@fadenb
Copy link

fadenb commented Jul 20, 2023

Hey 👋,

I am opening this issue to discuss the current swarm balancing approach.

Recently I have seen that the public swarm hosting enoch/llama-65b-hf is unbalanced.
This by itself is not a surprise nor a problem. The issue is then remediated by the server loading other blocks. All good so far.

Today I noticed that my server is loading the same blocks it had before. As the loading process is quite slow (often around 10 minutes), this basically takes away the compute capacity of that server from the swarm for 10 minutes without providing any benefit.

A log excerpt might explain the situation better:
Notice that [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] is loaded initially and also the exact same [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] to rebalance it.

Jul 20 12:08:47.880 [INFO] Make sure you follow the LLaMA's terms of use: https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1
Jul 20 12:08:47.880 [INFO] Using DHT prefix: llama-65b-hf
Jul 20 12:08:57.909 [INFO] This server is accessible directly
Jul 20 12:09:02.623 [INFO] Connecting to the public swarm
Jul 20 12:09:02.624 [INFO] Running a server on ['/ip4/172.17.0.2/tcp/31330/p2p/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf', '/ip4/127.0.0.1/tcp/31330/p2p/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf', '/ip4/147.189.193.61/tcp/31330/p2p/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf']
Jul 20 12:09:02.646 [INFO] Model weights are loaded in float16, quantized to nf4 format
Jul 20 12:09:02.647 [INFO] Attention cache for all blocks will consume up to 1.25 GiB
Jul 20 12:09:02.648 [INFO] Loading throughput info
Jul 20 12:09:02.684 [INFO] Reporting throughput: 2203.3 RPS for 20 blocks
Jul 20 12:09:04.430 [INFO] Reachability service started
Jul 20 12:09:08.345 [INFO] Announced that blocks [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] are joining
Jul 20 12:09:15.051 [INFO] Loaded enoch/llama-65b-hf block 60, <All keys matched successfully>
Downloading (…)/adapter_config.json: 100%|██████████| 425/425 [00:00<00:00, 2.09MB/s]
Downloading (…)er_model.safetensors: 100%|██████████| 3.20G/3.20G [00:54<00:00, 58.3MB/s]
Jul 20 12:10:36.878 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:10:37.081 [INFO] Loaded adapter timdettmers/guanaco-65b for block 60
Jul 20 12:10:44.745 [INFO] Loaded enoch/llama-65b-hf block 61, <All keys matched successfully>
Jul 20 12:11:08.242 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:11:08.441 [INFO] Loaded adapter timdettmers/guanaco-65b for block 61
Jul 20 12:11:16.205 [INFO] Loaded enoch/llama-65b-hf block 62, <All keys matched successfully>
Jul 20 12:11:38.475 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:11:38.669 [INFO] Loaded adapter timdettmers/guanaco-65b for block 62
Jul 20 12:11:45.308 [INFO] Loaded enoch/llama-65b-hf block 63, <All keys matched successfully>
Jul 20 12:12:08.372 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:12:08.595 [INFO] Loaded adapter timdettmers/guanaco-65b for block 63
Jul 20 12:12:17.520 [INFO] Loaded enoch/llama-65b-hf block 64, <All keys matched successfully>
Jul 20 12:12:40.703 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:12:41.066 [INFO] Loaded adapter timdettmers/guanaco-65b for block 64
Jul 20 12:12:48.411 [INFO] Loaded enoch/llama-65b-hf block 65, <All keys matched successfully>
Jul 20 12:12:59.529 [INFO] reachability.rpc_check(remote_peer=...ZFKwzs, check_peer=...ZFKwzs) -> False
Jul 20 12:13:11.434 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:13:11.927 [INFO] Loaded adapter timdettmers/guanaco-65b for block 65
Jul 20 12:13:19.812 [INFO] Loaded enoch/llama-65b-hf block 66, <All keys matched successfully>
Jul 20 12:13:43.257 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:13:43.823 [INFO] Loaded adapter timdettmers/guanaco-65b for block 66
Jul 20 12:13:51.392 [INFO] Loaded enoch/llama-65b-hf block 67, <All keys matched successfully>
Jul 20 12:14:16.225 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:14:16.776 [INFO] Loaded adapter timdettmers/guanaco-65b for block 67
Jul 20 12:14:25.466 [INFO] Loaded enoch/llama-65b-hf block 68, <All keys matched successfully>
Jul 20 12:14:49.068 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:14:49.586 [INFO] Loaded adapter timdettmers/guanaco-65b for block 68
Jul 20 12:14:57.751 [INFO] Loaded enoch/llama-65b-hf block 69, <All keys matched successfully>
Jul 20 12:15:20.843 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:15:21.370 [INFO] Loaded adapter timdettmers/guanaco-65b for block 69
Jul 20 12:15:34.991 [INFO] Loaded enoch/llama-65b-hf block 70, <All keys matched successfully>
Jul 20 12:15:57.221 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:15:57.713 [INFO] Loaded adapter timdettmers/guanaco-65b for block 70
Jul 20 12:16:08.368 [INFO] Loaded enoch/llama-65b-hf block 71, <All keys matched successfully>
Jul 20 12:16:29.393 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:16:29.884 [INFO] Loaded adapter timdettmers/guanaco-65b for block 71
Jul 20 12:16:36.503 [INFO] Loaded enoch/llama-65b-hf block 72, <All keys matched successfully>
Jul 20 12:16:57.748 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:16:58.263 [INFO] Loaded adapter timdettmers/guanaco-65b for block 72
Jul 20 12:17:05.251 [INFO] Loaded enoch/llama-65b-hf block 73, <All keys matched successfully>
Jul 20 12:17:26.114 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:17:26.605 [INFO] Loaded adapter timdettmers/guanaco-65b for block 73
Jul 20 12:17:33.660 [INFO] Loaded enoch/llama-65b-hf block 74, <All keys matched successfully>
^[OP^[OPJul 20 12:17:54.764 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:17:55.280 [INFO] Loaded adapter timdettmers/guanaco-65b for block 74
Jul 20 12:18:02.302 [INFO] Loaded enoch/llama-65b-hf block 75, <All keys matched successfully>
^[OP^[OPJul 20 12:18:23.076 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:18:23.551 [INFO] Loaded adapter timdettmers/guanaco-65b for block 75
Jul 20 12:18:30.137 [INFO] Loaded enoch/llama-65b-hf block 76, <All keys matched successfully>
Jul 20 12:18:50.908 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:18:51.420 [INFO] Loaded adapter timdettmers/guanaco-65b for block 76
Jul 20 12:18:57.203 [INFO] Loaded enoch/llama-65b-hf block 77, <All keys matched successfully>
Jul 20 12:19:17.972 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:19:18.472 [INFO] Loaded adapter timdettmers/guanaco-65b for block 77
Jul 20 12:19:23.977 [INFO] Loaded enoch/llama-65b-hf block 78, <All keys matched successfully>
Jul 20 12:19:44.690 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:19:45.199 [INFO] Loaded adapter timdettmers/guanaco-65b for block 78
Jul 20 12:19:50.305 [INFO] Loaded enoch/llama-65b-hf block 79, <All keys matched successfully>
Jul 20 12:20:11.381 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:20:11.894 [INFO] Loaded adapter timdettmers/guanaco-65b for block 79
Jul 20 12:20:11.962 [WARN] [petals.server.reachability.validate_reachability:40] Skipping reachability check because health.petals.ml is down: ConnectionError(MaxRetryError("HTTPConnectionPool(host='health.petals.ml', port=80): Max retries exceeded with url: /api/v1/is_reachable/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbf09f084f0>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))
Jul 20 12:20:14.168 [INFO] Started
Jul 20 12:26:02.132 [INFO] Swarm balance quality: 65.3%
Jul 20 12:26:02.133 [INFO] Swarm is imbalanced, server will load other blocks
Jul 20 12:26:03.947 [INFO] Announced that blocks ['llama-65b-hf.60', 'llama-65b-hf.61', 'llama-65b-hf.62', 'llama-65b-hf.63', 'llama-65b-hf.64', 'llama-65b-hf.65', 'llama-65b-hf.66', 'llama-65b-hf.67', 'llama-65b-hf.68', 'llama-65b-hf.69', 'llama-65b-hf.70', 'llama-65b-hf.71', 'llama-65b-hf.72', 'llama-65b-hf.73', 'llama-65b-hf.74', 'llama-65b-hf.75', 'llama-65b-hf.76', 'llama-65b-hf.77', 'llama-65b-hf.78', 'llama-65b-hf.79'] are offline
Jul 20 12:26:06.251 [INFO] Shutting down
Jul 20 12:26:06.266 [INFO] Module container shut down successfully
Jul 20 12:26:06.492 [INFO] Cleaning up, left 0.3 GiB allocated memory, 6.3 GiB reserved memory
Jul 20 12:26:12.177 [INFO] Announced that blocks [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] are joining
Jul 20 12:26:19.559 [INFO] Loaded enoch/llama-65b-hf block 60, <All keys matched successfully>
Jul 20 12:26:41.387 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:26:41.927 [INFO] Loaded adapter timdettmers/guanaco-65b for block 60
Jul 20 12:26:49.273 [INFO] Loaded enoch/llama-65b-hf block 61, <All keys matched successfully>
Jul 20 12:27:13.392 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:27:13.971 [INFO] Loaded adapter timdettmers/guanaco-65b for block 61
Jul 20 12:27:21.899 [INFO] Loaded enoch/llama-65b-hf block 62, <All keys matched successfully>
Jul 20 12:27:43.149 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:27:43.671 [INFO] Loaded adapter timdettmers/guanaco-65b for block 62
Jul 20 12:27:50.241 [INFO] Loaded enoch/llama-65b-hf block 63, <All keys matched successfully>
Jul 20 12:28:11.106 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:28:11.609 [INFO] Loaded adapter timdettmers/guanaco-65b for block 63
Jul 20 12:28:18.728 [INFO] Loaded enoch/llama-65b-hf block 64, <All keys matched successfully>
Jul 20 12:28:40.008 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:28:40.396 [INFO] Loaded adapter timdettmers/guanaco-65b for block 64
Jul 20 12:28:48.484 [INFO] Loaded enoch/llama-65b-hf block 65, <All keys matched successfully>
Jul 20 12:29:09.470 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout

While this is an extreme example of the problem, I have seen (more often) that parts of the block lists overlap. In such cases, the overlapping blocks are still loaded from scratch instead of being reused.

Are there any obvious fixes for this behavior besides adjusting the --balance_quality setting or pinning blocks?
Should we reorder the actions so that the new blocks will be selected before the decision is made to unload the blocks?

@borzunov
Copy link
Collaborator

Hi @fadenb,

What you're saying is 100% reasonable, we just didn't have time to do that since it would require additional complexity on the server-side. If you can help with this feature, let us know - we'd be happy to have such a pull request.

@iateadonut
Copy link

iateadonut commented Jul 24, 2023

mine is doing the same thing:

Jul 24 18:26:43 danserver petals[1297]: Jul 24 18:26:43.749 [INFO] Swarm balance quality: 62.8%
Jul 24 18:26:43 danserver petals[1297]: Jul 24 18:26:43.749 [INFO] Swarm is imbalanced, server will load other blocks
Jul 24 18:26:46 danserver petals[1297]: Jul 24 18:26:46.507 [INFO] Announced that blocks ['llama-65b-hf.0', 'llama-65b-hf.1', 'llama-65b-hf.2', 'llama-65b-hf.3', 'llama-65b-hf.4', 'llama-65b-hf.5', 'llama-65b-hf.6', 'llama-65b-hf.7', 'llama-65b-hf.8', 'llama-65b-hf.9', 'llama-65b-hf.10', 'llama-65b-hf.11', 'llama-65b-hf.12', 'llama-65b-hf.13', 'llama-65b-hf.14', 'llama-65b-hf.15', 'llama-65b-hf.16', 'llama-65b-hf.17', 'llama-65b-hf.18', 'llama-65b-hf.19', 'llama-65b-hf.20', 'llama-65b-hf.21', 'llama-65b-hf.22', 'llama-65b-hf.23', 'llama-65b-hf.24', 'llama-65b-hf.25', 'llama-65b-hf.26', 'llama-65b-hf.27', 'llama-65b-hf.28', 'llama-65b-hf.29', 'llama-65b-hf.30', 'llama-65b-hf.31', 'llama-65b-hf.32'] are offline
Jul 24 18:26:51 danserver petals[1297]: Jul 24 18:26:51.787 [INFO] Shutting down
Jul 24 18:26:51 danserver petals[1297]: Jul 24 18:26:51.820 [INFO] Module container shut down successfully
Jul 24 18:26:51 danserver petals[1297]: Jul 24 18:26:51.959 [INFO] Cleaning up, left 0.5 GiB allocated memory, 11.8 GiB reserved memory
Jul 24 18:27:01 danserver petals[1297]: Jul 24 18:27:01.164 [INFO] Announced that blocks [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32] are joining

i'm fixing this with these arguments: --block_indices 28:60 --balance_quality 0.0

@borzunov in creating test to make a better algorithm for choosing other blocks, are there any examples in tests/ of setting up several mock CPU servers that can talk to each other in a test swarm and mock blocks? should the method that chooses blocks always return sequential blocks?

@borzunov
Copy link
Collaborator

Hi @iateadonut,

Yes, a server should host a set of sequential blocks. Re mock CPU servers, you can create a private swarm with a really small model like bigscience/bloom-560m and CPU-only servers, like we do in CI tests.

@iateadonut
Copy link

is dht_utils.get_remote_module_infos() - is that supposed to return only information about remote servers? when running several CPU servers on my localhost, it returns my own server information.

i ask because block_selection._choose_best_start and and block_selection.should_choose_other_blocks use throughputs derived from get_remote_module_infos(), but get_remote_module_infos() returns a throughput that includes the own server's blocks, there's bound to be some problems.

second, i'm writing tests as unit tests for some of the block selection functions, including _choose_best_start and should_choose_other_blocks. i did not see either of those in the test suite and will add more as necessary as i'm working to figure this out.

@borzunov
Copy link
Collaborator

borzunov commented Jul 30, 2023

Hi @iateadonut,

dht_utils.get_remote_module_infos() returns information about all servers (remote and your own ones). Note that:

  • You need to be connected to the public swarm to see servers hosted by other people (as in https://health.petals.dev). In case of a server, you shouldn't provide manually set --initial_peers (it connects to the public swarm by default). In case of creating the DHT client manually, you should use hivemind.DHT(initial_peers=petals.constants.PUBLIC_INITIAL_PEERS, ...).

  • You need to use the correct DHT prefix. DHT prefixes are based on the model name but have some quirks for backward compatibility with older Petals versions. For example, in case of bigscience/bloom-560m, the server will actually use bigscience/bloom-560m-petals and notify you about that in the logs. This means that you'll have to query module UIDs like bigscience/bloom-560m-petals.0, bigscience/bloom-560m-petals.1, and so on to see the server infos.

A good example for using this function is the source code of https://health.petals.dev - see the place where get_remote_module_infos() is called.

Re tests for swarm balancing, they are indeed missing at the moment - I'd appreciate if you add them in some form.

Please note that our CI doesn't connect to the public swarm and launches a tiny isolated swarm with BLOOM-560m instead - you'd have to write your tests with this constraint in mind.

@iateadonut
Copy link

Thanks. Is there a method that gets only 'remote' module infos?

@borzunov
Copy link
Collaborator

@iateadonut No, but you can filter out your local peer_id to keep only remote infos, like we do in should_choose_other_blocks().

@borzunov
Copy link
Collaborator

borzunov commented Aug 5, 2023

@fadenb @iateadonut For the record, another reason why downloading blocks is slow is that StableBeluga2 weights are distributed in float32 and Llama weights are distributed in float16, while we host them in 4-bit (nf4). This means that we download 8x/4x data than necessary (same for disk space and disk reading time).

So an alternative is to implement functionality allowing to download (or load from disk) the model in nf4 right away. @mryab was working on this functionality for int8 in #273, we may need to revive this PR and prioritize this feature.

@iateadonut
Copy link

i'm working now on creating a test for block_selection:
https://github.com/iateadonut/petals/blob/danO/tests/test_block_selection.py

the test above works and for the simple mock of 2 servers both running blocks 1-16 of a 24 block model, it passes the tests. I'm going to work to get the current module_infos from the live server so I can mock it's setup and see if I can find the problem.

do you think we should move this block in https://github.com/bigscience-workshop/petals/blob/main/src/petals/server/block_selection.py to its own function (if necessary) for easier testing? if so, should it be called _new_throughput()?:

    moved = True
    while moved:
        servers = list(spans.keys())
        np.random.shuffle(servers)

        moved = False
        for peer_id in servers:
            span = spans[peer_id]
            throughputs[span.start : span.end] -= span.throughput * (1 + eps)

            new_start = _choose_best_start(throughputs, span.length)

            throughputs[span.start : span.end] += span.throughput * eps
            if span.start != new_start:
                span.move_to(new_start)
                moved = True
            throughputs[span.start : span.end] += span.throughput

    new_throughput = throughputs.min()

@borzunov
Copy link
Collaborator

borzunov commented Aug 6, 2023

@iateadonut Yes, you can extract it into a separate function if it's useful.

@iateadonut
Copy link

iateadonut commented Aug 6, 2023

i have a set of module_infos that includes 80 sets of block-server data info dumps; it is used to mock this test: https://github.com/iateadonut/petals/blob/danO/tests/test_block_selection.py#L18

the throughput of the server looks like this:
[3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759
3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759
3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759
3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759
1459.4677943 4613.53125511 1459.4677943 4613.53125511 4613.53125511
1459.4677943 1459.4677943 4613.53125511 4613.53125511 4613.53125511
1459.4677943 1459.4677943 1459.4677943 3850.59180199 3850.59180199
696.52834117 696.52834117 3850.59180199 696.52834117 696.52834117
2899.81165181 2899.81165181 2899.81165181 2899.81165181 2899.81165181
2899.81165181 2899.81165181 4743.68463907 4743.68463907 4743.68463907
4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907
4743.68463907 4743.68463907 2899.81165181 2899.81165181 2899.81165181
2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014
2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014
2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014
2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014]

the throughput of the server minus the local server looks like this:
[2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501
2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501
2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501
2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501
695.76540172 3849.82886253 695.76540172 3849.82886253 3849.82886253
695.76540172 695.76540172 3849.82886253 3849.82886253 3849.82886253
695.76540172 695.76540172 695.76540172 3850.59180199 3850.59180199
696.52834117 696.52834117 3850.59180199 696.52834117 696.52834117
2899.81165181 2899.81165181 2899.81165181 2899.81165181 2899.81165181
2899.81165181 2899.81165181 4743.68463907 4743.68463907 4743.68463907
4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907
4743.68463907 4743.68463907 2899.81165181 2899.81165181 2899.81165181
2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014
2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014
2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014
2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014]

it yields - Swarm balance quality: 47.7% - then it restarts the service, which holds 33 blocks, and starts again at the same place it started last time, at block 1.

I will do some more work on this this week. I wanted to share the throughput and modified throughput in case anything from those points to a solution I might not see so easily.

@iateadonut
Copy link

@borzunov Can you explain this:

https://github.com/bigscience-workshop/petals/blame/063e94b4c8027e1e8d47061681007e9db292734f/src/petals/server/block_selection.py#L94

It looks like you're trying to check the new throughput on the swarm if the local server changes the blocks served AND all other servers change their blocks served as well. Is that correct?

If that's the case, I wonder if this can work well in a live environment, where you have at least a few minutes between each time each server runs should_choose_other_blocks.

What do you think? Should we figure out a different way to find swarm balance quality? Any ideas?

@borzunov
Copy link
Collaborator

@iateadonut, in this code, a server simulates what others would do if it moves. This is necessary so that we can know the final throughput it is possible to reach after moving.

For example, imagine that we have 30 blocks and 3 servers hosting blocks 0:10. The total throughput is zero since nobody hosts blocks 20:30.

If we only consider the throughput after the current server moves, then no server will ever move (since if anyone moves to 10:20, the total throughput will be still zero).

So the servers simulate that if they move to 10:20, some other server is likely to move to 20:30, and we'll have non-zero throughput in the end. Then they can decide that moving is actually worth it.

Please refer to a draft of our new paper to find details of how it works: https://openreview.net/pdf?id=HLQyRgRnoXo (pages 19-20, Appendices D-E)

@iateadonut
Copy link

I'm running some tests and here's one thing I found - these are only a few minutes apart:

These log should_choose_new_blocks where it compares local_span.start == new_start at https://github.com/bigscience-workshop/petals/blame/063e94b4c8027e1e8d47061681007e9db292734f/src/petals/server/block_selection.py#L87 :
'-- new_start and current start'
1692537985.0252135
'22:26:25'
'65 2'
'-- new_start and current start'
1692538243.7270544
'22:30:43'
'2 2'

These logs are just a few minutes apart. I'm running more tests now so I can get timestamps module_infos logs to investigate further.

My suspicion is that, when a single server decides to choose new blocks, by the time it does, the start block is different.

I'll be working to get real time module_infos data to mock and test.

@iateadonut
Copy link

iateadonut commented Aug 27, 2023

i think an easy way to solve this might be to recalculate 'throughputs' 2 times after new_start = _choose_best_start() in a loop waiting 1 minute between each calculation. return False if new_start isn't the same after each calculation.

I have a feeling there may be some problems with this, though. If the problem is two servers colliding would each go through the process at the same time and turn out to have the same problem anyway?

I'm testing this now on the live swarm to see if the bug crops while running the server this way:

def _should_choose_other_blocks(self) -> bool:
        if self.strict_block_indices is not None:
            return False

        module_infos = get_remote_module_infos(self.dht, self.module_uids, latest=True)
        should_choose = block_selection.should_choose_other_blocks(self.dht.peer_id, module_infos, self.balance_quality)

        if False == should_choose:
            return False
        else:
            for i in range(2):
                wait_time = 90 + random.randint(-30, 10)
                time.sleep(wait_time)

                module_infos = get_remote_module_infos(self.dht, self.module_uids, latest=True)
                pprint('--retrying should_choose_other_blocks')
                should_choose = block_selection.should_choose_other_blocks(self.dht.peer_id, module_infos, self.balance_quality)

                if False == should_choose:
                    return False

        return should_choose

@iateadonut
Copy link

These are some logs I've taken from running the above within server.py:

'-- start new_start'
'0 0'
'-- start new_start'
'0 1'
'--retrying should_choose_other_blocks'
'-- start new_start'
'0 0'
'-- start new_start'
'0 0'
...
'0 0'
'-- start new_start'
'0 40'
'--retrying should_choose_other_blocks'
'-- start new_start'
'0 0'
'-- start new_start'
'0 0'
...
'-- start new_start'
'18 30'
'-- start new_start'
'18 33'
'--retrying should_choose_other_blocks'
'-- start new_start'
'18 30'
'-- start new_start'
'18 30'
...

You can see here that it has been working well to make sure unnecessary restarts do not happen.

The start new_start line in the logs is from should_choose_other_blocks that shows the current start and suggested new start.

It did fail a rebalancing here:
'-- start new_start'
'40 30'
'-- start new_start'
'40 30'
'--retrying should_choose_other_blocks'
'-- start new_start'
'40 30'
'--retrying should_choose_other_blocks'
'-- start new_start'
'40 30'
'-- choose_best_blocks; used when restarting'
'-- start new_start'
'30 18'
'--retrying should_choose_other_blocks'
'-- start new_start'
'30 18'
'--retrying should_choose_other_blocks'
'-- start new_start'
'30 18'
'-- choose_best_blocks; used when restarting'
'-- start new_start'
'18 18'
'-- start new_start'
'18 18'

as it ended up rebalancing twice. I don't know why that happened, but otherwise, this small change prevented unnecessary rebalancing at least 15x in a few days.

I'll continue to use this in the newest versions on my server and keep logs with time stamps moving forward.

I've created a pull request:
#493

Let me know if there should be any changes or other ways to move forward.

@iateadonut
Copy link

iateadonut commented Sep 5, 2023

just updating with some more logs:

$ grep -E '--retry|choose_best' -B5 -A10 ./log-1693523246

'-- choose_best_blocks; used when restarting'
'2023-09-01 08:43:59'
'-- start new_start'
'36 36'
'2023-09-01 08:45:35'
'-- start new_start'
'36 36'
'2023-09-01 08:46:45'
'-- start new_start'
'36 36'
'2023-09-01 08:47:54'
--
'-- start new_start'
'36 0'
'2023-09-03 12:05:22'
'-- start new_start'
'36 15'
'--retrying should_choose_other_blocks'
'2023-09-03 12:07:07'
'-- start new_start'
'36 0'
'2023-09-03 12:07:46'
'-- start new_start'
'36 0'
'2023-09-03 12:08:28'
'-- start new_start'
'36 0'
'2023-09-03 12:09:31'
--
'-- start new_start'
'36 36'
'2023-09-04 22:06:40'
'-- start new_start'
'36 13'
'--retrying should_choose_other_blocks'
'2023-09-04 22:08:20'
'-- start new_start'
'36 36'
'2023-09-04 22:09:19'
'-- start new_start'
'36 36'
'2023-09-04 22:10:32'
'-- start new_start'
'36 36'
'2023-09-04 22:10:36'
--
'-- start new_start'
'36 36'
'2023-09-05 10:32:37'
'-- start new_start'
'36 4'
'--retrying should_choose_other_blocks'
'2023-09-05 10:34:17'
'-- start new_start'
'36 14'
'2023-09-05 10:34:58'
'-- start new_start'
'36 14'
'2023-09-05 10:36:12'
'-- start new_start'
'36 14'
'2023-09-05 10:36:25'
--
'-- start new_start'
'36 36'
'2023-09-05 16:18:01'
'-- start new_start'
'36 0'
'--retrying should_choose_other_blocks'
'2023-09-05 16:19:37'
'-- start new_start'
'36 0'
'--retrying should_choose_other_blocks'
'2023-09-05 16:20:56'
'-- start new_start'
'36 30'
'2023-09-05 16:21:34'
'-- start new_start'
'36 30'
'2023-09-05 16:22:55'
'-- start new_start'
'36 30'
'2023-09-05 16:24:30'

-- choose best blocks is in the log and represents when 'choose_best_blocks' is run, when the blocks are reloaded.

As you can see, over 5 days continually online, this edit has stopped this server from unnecessarily reloading. The last time this happened, that was probably because the swarm balance had already improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants