client: retry allocate segment until success #338

Wine93 · 2021-04-29T09:46:13Z

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

What is changed and how it works?

What's Changed:

How it Works:

Side effects(Breaking backward compatibility? Performance regression?):

Check List

Relevant documentation/comments is changed or added
I acknowledge that all my contributions will be made under the project's license

cw123 · 2021-05-06T02:50:50Z

curve-ansible/roles/generate_config/defaults/main.yml

@@ -190,6 +190,9 @@ client_mds_rpc_retry_interval_us: 100000
 client_metacache_get_leader_timeout_ms: 500
 client_metacache_get_leader_retry: 5
 client_metacache_rpc_retry_interval_us: 100000
+client_mds_normal_retry_times_before_trigger_wait: 3
+client_mds_max_wait_ms: 86400000


the number of mds max wait ms is different between clent.conf and this.

the number of mds max wait ms is different between clent.conf and this.

The config in client.conf is for robot test, if we use the real value (like 86400000), it will retry for long time, and the test case will timeout and failed.

Almost all items in client.conf are recommended, so we leave it unchanged. And for unit tests, you can reference this

curve/test/client/client_unittest_main.cpp

Line 37 in aa0e97b

const std::vector<std::string> clientConf {

to modify the value of those items.

Yeap, it's useful for unit test, but i think it's unuseful for robot test, like curve_robot_test.

I have confirmed that the robot test will use the config file under the conf directory. @wu-hanqing @cw123

I have confirmed that the robot test will use the config file under the conf directory. @wu-hanqing @cw123

Which test case causes robot test timeout?

I have confirmed that the robot test will use the config file under the conf directory. @wu-hanqing @cw123

Which test case causes robot test timeout?

The case for read write offset > size, click for test detail and log detail.

If you, you should distinguish whether IO request is exceeded file length or allocate segment failed caused by no space left in the logical pool.

yeap, it's my neglect, i should ignore other errors except NO SPACE error.

cw123 · 2021-05-06T02:52:07Z

include/client/libcurve.h

-    UNKNOWN                 = 100
+    UNKNOWN                 = 100,
+    // You must retry it until success
+    RETRY_UNTIL_SUCCESS = 200,


why not use 30 ?

why not use 30 ?

done.
ps: In my original design, I think the error RETRY_UNTIL_SUCCESS is an another type error, so i use code 200 to distinguish with other error codes.

cw123 · 2021-05-06T03:03:10Z

src/client/mds_client.cpp

@@ -114,6 +114,9 @@ LIBCURVE_ERROR MDSClient::MDSRPCExcutor::DoRPCTask(RPCFunc rpctask,
    // rpc超时时间
    uint64_t rpcTimeOutMS = metaServerOpt_.mdsRPCTimeoutMs;

+    // The count of normal retry
+    uint64_t nNormalRetry = 0;


"n"NormalRetry， why doy you use a letter n before this param?

"n"NormalRetry， why doy you use a letter n before this param?

The prefix n represent the count of normal retry.

cw123 · 2021-05-06T03:26:44Z

src/client/mds_client.cpp

@@ -934,7 +946,8 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate,
        int chunksNum = pfs.chunks_size();
        if (allocate && chunksNum <= 0) {


if GetOrAllocateSegment get segment fail, and the reason is no enough space, the program may not go to here.

bool ChunkSegmentAllocatorImpl::AllocateChunkSegment(FileType type, SegmentSizeType segmentSize, ChunkSizeType chunkSize, offset_t offset, PageFileSegment *segment) { .......... segment->set_chunksize(chunkSize); segment->set_segmentsize(segmentSize); segment->set_startoffset(offset); // allocate chunks uint32_t chunkNum = segmentSize/chunkSize; std::vector<CopysetIdInfo> copysets; if (!topologyChunkAllocator_-> AllocateChunkRoundRobinInSingleLogicalPool( type, chunkNum, chunkSize, &copysets)) { LOG(ERROR) << "AllocateChunkRoundRobinInSingleLogicalPool error"; return false; }

message PageFileSegment { required uint32 logicalPoolID = 1; required uint32 segmentSize = 3; required uint32 chunkSize = 4; required uint64 startOffset = 2; repeated PageFileChunkInfo chunks = 5; }

The logicalPoolID is required. The RPC may fail here
if (cntl->Failed()) {

Check this problem please.

The filed PageFileSegment is cleared when alloc segment failed, so it will not trigger RPC EREQUEST error.

wu-hanqing · 2021-05-06T08:23:27Z

curve-ansible/roles/generate_config/defaults/main.yml

@@ -190,6 +190,9 @@ client_mds_rpc_retry_interval_us: 100000
 client_metacache_get_leader_timeout_ms: 500
 client_metacache_get_leader_retry: 5
 client_metacache_rpc_retry_interval_us: 100000
+client_mds_normal_retry_times_before_trigger_wait: 3
+client_mds_max_wait_ms: 86400000


Almost all items in client.conf are recommended, so we leave it unchanged. And for unit tests, you can reference this

curve/test/client/client_unittest_main.cpp

Line 37 in aa0e97b

const std::vector<std::string> clientConf {

to modify the value of those items.

wu-hanqing · 2021-05-06T08:26:35Z

conf/client.conf

@@ -26,6 +26,15 @@ mds.refreshTimesPerLease=4
 # mds RPC接口每次重试之前需要先睡眠一段时间
 mds.rpcRetryIntervalUS=100000

+# The normal retry times for trigger wait strategy


There are also three configuration files cs_client.conf、py_client.conf and snap_client.conf, you need add those items into those files.

wu-hanqing · 2021-05-06T09:00:49Z

src/client/mds_client.cpp

@@ -945,7 +958,7 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate,
        }
        return LIBCURVE_ERROR::OK;
    };
-    return rpcExcutor.DoRPCTask(task, IOPathMaxRetryMS);
+    return rpcExcutor.DoRPCTask(task, metaServerOpt_.mdsMaxWaitMs);


I think the previous IOPathMaxRetryMS is more meaningful, so you can rename this to metaServerOpt_.maxRetryMsInIOPath, and rename the corresponding item in configure file.
Also, replace all IOPathMaxRetryMS with metaServerOpt_.maxRetryMsInIOPath.

cw123 · 2021-05-07T02:56:34Z

curve-ansible/roles/generate_config/defaults/main.yml

@@ -190,6 +190,9 @@ client_mds_rpc_retry_interval_us: 100000
 client_metacache_get_leader_timeout_ms: 500
 client_metacache_get_leader_retry: 5
 client_metacache_rpc_retry_interval_us: 100000
+client_mds_normal_retry_times_before_trigger_wait: 3
+client_mds_max_retry_ms_in_io_path: 86400000


this param is still different of client.conf

this param is still different of client.conf

done, the param has been unified.

Wine93 · 2021-05-07T09:13:08Z

All the review suggestion has been adopted. @cw123 @wu-hanqing

cw123 · 2021-05-08T09:07:58Z

src/client/config_info.h

+     * it will trigger wait strategy, and sleep long time before retry
+     */
+    uint64_t mdsNormalRetryTimesBeforeTriggerWait = 3;  // 3 times
+    uint64_t mdsMaxRetryMsInIOPath = 1 * 3600 * 1000;  // 1 hour


Why this default value is different from conf file?

Why this default value is different from conf file?

done.

wu-hanqing · 2021-05-08T09:16:12Z

src/client/mds_client.cpp

@@ -913,14 +923,18 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate,

        auto statuscode = response.statuscode();
        switch (statuscode) {
+            case StatusCode::kParaError:


recheck this part of the code

curve/src/client/splitor.cpp

Line 201 in aa0e97b

if (errCode == LIBCURVE_ERROR::FAILED ||

, modification in here may cause an error.

recheck this part of the code

curve/src/client/splitor.cpp

Line 201 in aa0e97b

if (errCode == LIBCURVE_ERROR::FAILED ||

, modification in here may cause an error.

yeap, the function should return LIBCURVE_ERROR::FAILED instead of LIBCURVE_ERROR::NOTEXIST and LIBCURVE_ERROR::PARAM_ERROR.

add litmuschaos project idea for Q1-2021

cw123 requested changes May 6, 2021

View reviewed changes

Wine93 requested a review from cw123 May 6, 2021 07:58

wu-hanqing reviewed May 6, 2021

View reviewed changes

cw123 requested changes May 7, 2021

View reviewed changes

Wine93 requested review from wu-hanqing and cw123 May 7, 2021 09:10

cw123 requested changes May 8, 2021

View reviewed changes

wu-hanqing requested changes May 8, 2021

View reviewed changes

client: retry allocate segment until success

1476375

wu-hanqing approved these changes May 10, 2021

View reviewed changes

cw123 approved these changes May 10, 2021

View reviewed changes

ilixiaocui approved these changes May 10, 2021

View reviewed changes

ilixiaocui merged commit f455caf into opencurve:master May 10, 2021

ilixiaocui pushed a commit to ilixiaocui/curve that referenced this pull request Feb 6, 2023

Merge pull request opencurve#338 from ksatchit/litmuschaos

0fab839

add litmuschaos project idea for Q1-2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: retry allocate segment until success #338

client: retry allocate segment until success #338

Wine93 commented Apr 29, 2021

cw123 May 6, 2021

Wine93 May 6, 2021

wu-hanqing May 6, 2021

Wine93 May 6, 2021

Wine93 May 7, 2021

wu-hanqing May 7, 2021

Wine93 May 7, 2021 •

edited

Loading

wu-hanqing May 7, 2021

Wine93 May 7, 2021

cw123 May 6, 2021

Wine93 May 6, 2021 •

edited

Loading

cw123 May 6, 2021

Wine93 May 6, 2021

cw123 May 6, 2021

Wine93 May 6, 2021 •

edited

Loading

wu-hanqing May 6, 2021

wu-hanqing May 6, 2021

Wine93 May 6, 2021

wu-hanqing May 6, 2021

Wine93 May 6, 2021

cw123 May 7, 2021

Wine93 May 7, 2021 •

edited

Loading

Wine93 commented May 7, 2021

cw123 May 8, 2021

Wine93 May 8, 2021

wu-hanqing May 8, 2021

Wine93 May 8, 2021

		@@ -934,7 +946,8 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate,
		int chunksNum = pfs.chunks_size();
		if (allocate && chunksNum <= 0) {

client: retry allocate segment until success #338

client: retry allocate segment until success #338

Conversation

Wine93 commented Apr 29, 2021

What problem does this PR solve?

What is changed and how it works?

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wine93 May 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wine93 May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wine93 May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wine93 May 7, 2021 • edited Loading

Choose a reason for hiding this comment

Wine93 commented May 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wine93 May 7, 2021 •

edited

Loading

Wine93 May 6, 2021 •

edited

Loading

Wine93 May 6, 2021 •

edited

Loading

Wine93 May 7, 2021 •

edited

Loading