Use consistent hash to manage the topic #681

ThisIsClark · 2021-08-31T14:01:27Z

What this PR does / why we need it:
Use consistent hash to manage the topic of distributing job

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

Release note:

Use case:

When metric task service launch, add it to partitioner group
Leader node would watch the group change periodically
- If a new node was added to the group, re-scan all exist jobs, to check which executor they should have
  - If the new executor of a job is same as the old executor, do nothing
  - If the new executor is different with the old executor, remove the job from the old executor and add it to the new executor
- If a exist node was removed from the group, get all the jobs whose executor is node which was down, and re-distribute that
When a distributor distribute a job, get executor from partitioner, and use it as the topic

Proposal:

When the MetricsTaskManager initialize, add it to the group
When a node becomes a leader, in the on_leading_callback schedule_boot_jobs, start a apscheduler to watch group change periodically
- On on_node_join, re-scan all the exist jobs, and do different things according to whether the new executor is same as the old executor
- On on_node_leave, re-distribute all the job whose executor is same as the node which is down
In TaskDistributor.distribute_new_job, get the executor from the group as topic of RabbitMQ, in order to distribute the job to specific node

Test case:

Every time a node up, a member would be added to the group
Every time a node up, the watcher on leader node would get the event
When watched a node join event, re-scan all the jobs and re-distribute the specific job to the new node
Every time a node down, the member would be removed from the group
Every time a node down, the watcher on leader node would get the event
When watched a node leave event, re-distribute all the job whose executor is same as the node which is down
When distribute new job, the topic should be different when the task meta data is different

sushanthakumar · 2021-08-31T18:12:20Z

delfin/coordination.py

+        except tooz.coordination.MemberAlreadyExist:
+            LOG.info('Member %s already in partitioner_group' % CONF.host)
+
+    def belong_to_host(self, task_id):


suggest to change name ,may be get_executor

sushanthakumar · 2021-08-31T18:12:53Z

delfin/coordination.py

+        members = part.members_for_object(task_id)
+        for member in members:
+            LOG.info('For task id %s, host should be %s' % (task_id, member))
+            return member


In our case, do we get more than one member in any scenario?

No, but it returned a set with one item

sushanthakumar · 2021-08-31T18:40:31Z

delfin/coordination.py

@@ -320,3 +322,35 @@ def _get_redis_backend_url():
            .format(backend_type=CONF.coordination.backend_type,
                    server=CONF.coordination.backend_server)
    return backend_url
+
+
+def on_node_join(event):


where we need to register these callback?

Code updated, we will register it on leader node

NajmudheenCT · 2021-09-01T06:37:25Z

delfin/leader_election/distributor/task_distributor.py

-        executor = CONF.host
+        partitioner = ConsistentHashing()
+        partitioner.start()
+        executor = partitioner.belong_to_host(task_id)


partitioner.belong_to_host returns a byte literal , we need to convert to string may be with
executor = partitioner.belong_to_host(task_id).decode('utf-8')

codecov · 2021-09-02T09:10:13Z

Codecov Report

Merging #681 (314a7be) into perf_coll_fw_enhance (a88486e) will increase coverage by 0.03%.
The diff coverage is 77.27%.

@@                   Coverage Diff                    @@
##           perf_coll_fw_enhance     #681      +/-   ##
========================================================
+ Coverage                 70.86%   70.90%   +0.03%     
========================================================
  Files                       161      161              
  Lines                     15231    15292      +61     
  Branches                   1867     1872       +5     
========================================================
+ Hits                      10794    10843      +49     
- Misses                     3819     3830      +11     
- Partials                    618      619       +1

Impacted Files	Coverage Δ
delfin/task_manager/metrics_manager.py	`0.00% <0.00%> (ø)`
...ager/scheduler/schedulers/telemetry/job_handler.py	`76.55% <ø> (-0.32%)`	⬇️
delfin/coordination.py	`64.58% <68.18%> (+0.64%)`	⬆️
delfin/task_manager/scheduler/schedule_manager.py	`71.76% <88.57%> (+17.91%)`	⬆️
...in/leader_election/distributor/task_distributor.py	`67.56% <100.00%> (+3.93%)`	⬆️
delfin/drivers/fake_storage/__init__.py	`94.85% <0.00%> (-0.29%)`	⬇️

NajmudheenCT · 2021-09-02T14:40:01Z

delfin/task_manager/scheduler/schedule_manager.py

+            if new_executor != origin_executor:
+                LOG.info('Re-distribute job %s from %s to %s' %
+                         (task['id'], origin_executor, new_executor))
+                self.task_rpcapi.remove_job(self.ctx, task['id'],


Removal of job would remove the task entry from DB! may be we need to separate remove_task and remove_job as separate handlers !

ok, we need a remove_task which just remove it from scheduler and would not remove it from db

NajmudheenCT · 2021-09-02T14:47:09Z

delfin/task_manager/scheduler/schedule_manager.py

@@ -90,6 +131,13 @@ def schedule_boot_jobs(self):
                                       'PerfJobManager',
                               coordination=True)
        service.serve(job_generator)
+        partitioner = ConsistentHashing()


Generic comment SCHEDULER_BOOT_JOBS loop can be removed now..

Yes, I'll find a place to handle the job remove and the loop would be removed

NajmudheenCT · 2021-09-02T14:49:12Z

delfin/leader_election/distributor/task_distributor.py

@@ -47,7 +48,10 @@ def __call__(self):
                      six.text_type(e))

    def distribute_new_job(self, task_id):


We need to remove def call ad we should have one methos to handle distribute_delete()

NajmudheenCT · 2021-09-02T14:52:28Z

delfin/task_manager/scheduler/schedule_manager.py

+                         (task['id'], origin_executor, new_executor))
+                self.task_rpcapi.remove_job(self.ctx, task['id'],
+                                            task['executor'])
+            distributor.distribute_new_job(task['id'])


This needs to be inside the if block, as ou want to distribute job only if the executor is different..

The scenario I considered is that our db recorded it in executor A, but actually it didn't run on executor A.
So, IMO, I think this should be outside the if block, because even the executor is same as before, we need to confirm that the executor already run the task, so I think we can send it again. If the executor already has the task, it would be ignored and have no side effect.

…ion/SIM into consistent_hashing

NajmudheenCT

LGTM

sushanthakumar · 2021-09-04T05:39:23Z

LGTM

* Make job scheduler local to task process (#674) * Make job scheduler local to task process * Notify distributor when a new task added (#678) * Remove db-scan for new task creation (#680) * Use consistent hash to manage the topic (#681) * Remove the periodically call from task distributor (#686) * Start one historic collection immediate when a job is rescheduled (#685) * Start one historic collection immediate when a job is rescheduled * Remove failed task distributor (#687) * Improving Failed job handling and telemetry job removal (#689) Co-authored-by: ThisIsClark <[email protected]> Co-authored-by: Ashit Kumar <[email protected]>

sushanthakumar reviewed Aug 31, 2021

View reviewed changes

NajmudheenCT reviewed Sep 1, 2021

View reviewed changes

ThisIsClark force-pushed the consistent_hashing branch from 857f0b1 to adffe4c Compare September 2, 2021 09:07

ThisIsClark changed the title ~~[WIP] Use consistent hash to manage the topic~~ Use consistent hash to manage the topic Sep 2, 2021

ThisIsClark force-pushed the consistent_hashing branch 5 times, most recently from 8b615c7 to 258cf05 Compare September 2, 2021 14:02

NajmudheenCT reviewed Sep 2, 2021

View reviewed changes

ThisIsClark force-pushed the consistent_hashing branch from 258cf05 to 723a44c Compare September 3, 2021 07:56

Use consistent hash to manage the topic

e91f403

ThisIsClark force-pushed the consistent_hashing branch from 723a44c to e91f403 Compare September 3, 2021 08:32

Merge branch 'perf_coll_fw_enhance' of https://github.com/sodafoundat…

314a7be

…ion/SIM into consistent_hashing

ThisIsClark force-pushed the consistent_hashing branch from 89e9359 to 314a7be Compare September 4, 2021 03:25

NajmudheenCT approved these changes Sep 4, 2021

View reviewed changes

NajmudheenCT merged commit be004ce into sodafoundation:perf_coll_fw_enhance Sep 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use consistent hash to manage the topic #681

Use consistent hash to manage the topic #681

ThisIsClark commented Aug 31, 2021

sushanthakumar Aug 31, 2021

ThisIsClark Sep 2, 2021

sushanthakumar Aug 31, 2021

ThisIsClark Sep 2, 2021 •

edited

Loading

sushanthakumar Sep 4, 2021

sushanthakumar Aug 31, 2021

ThisIsClark Sep 2, 2021

NajmudheenCT Sep 1, 2021

ThisIsClark Sep 2, 2021

codecov bot commented Sep 2, 2021 •

edited

Loading

NajmudheenCT Sep 2, 2021

ThisIsClark Sep 3, 2021

NajmudheenCT Sep 2, 2021

ThisIsClark Sep 3, 2021

NajmudheenCT Sep 2, 2021

ThisIsClark Sep 3, 2021

NajmudheenCT Sep 2, 2021

ThisIsClark Sep 3, 2021

NajmudheenCT left a comment

sushanthakumar commented Sep 4, 2021

		@@ -47,7 +48,10 @@ def __call__(self):
		six.text_type(e))

		def distribute_new_job(self, task_id):

Use consistent hash to manage the topic #681

Use consistent hash to manage the topic #681

Conversation

ThisIsClark commented Aug 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThisIsClark Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 2, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NajmudheenCT left a comment

Choose a reason for hiding this comment

sushanthakumar commented Sep 4, 2021

ThisIsClark Sep 2, 2021 •

edited

Loading

codecov bot commented Sep 2, 2021 •

edited

Loading