Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: retry allocate segment until success #338

Merged
merged 1 commit into from
May 10, 2021
Merged

client: retry allocate segment until success #338

merged 1 commit into from
May 10, 2021

Conversation

Wine93
Copy link
Contributor

@Wine93 Wine93 commented Apr 29, 2021

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

What is changed and how it works?

What's Changed:

How it Works:

Side effects(Breaking backward compatibility? Performance regression?):

Check List

  • Relevant documentation/comments is changed or added
  • I acknowledge that all my contributions will be made under the project's license

@@ -190,6 +190,9 @@ client_mds_rpc_retry_interval_us: 100000
client_metacache_get_leader_timeout_ms: 500
client_metacache_get_leader_retry: 5
client_metacache_rpc_retry_interval_us: 100000
client_mds_normal_retry_times_before_trigger_wait: 3
client_mds_max_wait_ms: 86400000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of mds max wait ms is different between clent.conf and this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of mds max wait ms is different between clent.conf and this.

The config in client.conf is for robot test, if we use the real value (like 86400000), it will retry for long time, and the test case will timeout and failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all items in client.conf are recommended, so we leave it unchanged. And for unit tests, you can reference this

const std::vector<std::string> clientConf {
to modify the value of those items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, it's useful for unit test, but i think it's unuseful for robot test, like curve_robot_test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed that the robot test will use the config file under the conf directory. @wu-hanqing @cw123

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed that the robot test will use the config file under the conf directory. @wu-hanqing @cw123

Which test case causes robot test timeout?

Copy link
Contributor Author

@Wine93 Wine93 May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed that the robot test will use the config file under the conf directory. @wu-hanqing @cw123

Which test case causes robot test timeout?

The case for read write offset > size, click for test detail and log detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you, you should distinguish whether IO request is exceeded file length or allocate segment failed caused by no space left in the logical pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, it's my neglect, i should ignore other errors except NO SPACE error.

UNKNOWN = 100
UNKNOWN = 100,
// You must retry it until success
RETRY_UNTIL_SUCCESS = 200,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use 30 ?

Copy link
Contributor Author

@Wine93 Wine93 May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use 30 ?

done.
ps: In my original design, I think the error RETRY_UNTIL_SUCCESS is an another type error, so i use code 200 to distinguish with other error codes.

@@ -114,6 +114,9 @@ LIBCURVE_ERROR MDSClient::MDSRPCExcutor::DoRPCTask(RPCFunc rpctask,
// rpc超时时间
uint64_t rpcTimeOutMS = metaServerOpt_.mdsRPCTimeoutMs;

// The count of normal retry
uint64_t nNormalRetry = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"n"NormalRetry, why doy you use a letter n before this param?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"n"NormalRetry, why doy you use a letter n before this param?

The prefix n represent the count of normal retry.

@@ -934,7 +946,8 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate,
int chunksNum = pfs.chunks_size();
if (allocate && chunksNum <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if GetOrAllocateSegment get segment fail, and the reason is no enough space, the program may not go to here.

bool ChunkSegmentAllocatorImpl::AllocateChunkSegment(FileType type,
        SegmentSizeType segmentSize, ChunkSizeType chunkSize,
        offset_t offset, PageFileSegment *segment)  {
        ..........
        segment->set_chunksize(chunkSize);
        segment->set_segmentsize(segmentSize);
        segment->set_startoffset(offset);

        // allocate chunks
        uint32_t chunkNum = segmentSize/chunkSize;
        std::vector<CopysetIdInfo> copysets;
        if (!topologyChunkAllocator_->
                AllocateChunkRoundRobinInSingleLogicalPool(
                type, chunkNum, chunkSize, &copysets)) {
            LOG(ERROR) << "AllocateChunkRoundRobinInSingleLogicalPool error";
            return false;
        }
message PageFileSegment {
    required uint32 logicalPoolID = 1;
    required uint32 segmentSize = 3;
    required uint32 chunkSize = 4;
    required uint64 startOffset = 2;
    repeated  PageFileChunkInfo chunks = 5;
}

The logicalPoolID is required. The RPC may fail here
if (cntl->Failed()) {

Check this problem please.

Copy link
Contributor Author

@Wine93 Wine93 May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filed PageFileSegment is cleared when alloc segment failed, so it will not trigger RPC EREQUEST error.

@Wine93 Wine93 requested a review from cw123 May 6, 2021 07:58
@@ -190,6 +190,9 @@ client_mds_rpc_retry_interval_us: 100000
client_metacache_get_leader_timeout_ms: 500
client_metacache_get_leader_retry: 5
client_metacache_rpc_retry_interval_us: 100000
client_mds_normal_retry_times_before_trigger_wait: 3
client_mds_max_wait_ms: 86400000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all items in client.conf are recommended, so we leave it unchanged. And for unit tests, you can reference this

const std::vector<std::string> clientConf {
to modify the value of those items.

@@ -26,6 +26,15 @@ mds.refreshTimesPerLease=4
# mds RPC接口每次重试之前需要先睡眠一段时间
mds.rpcRetryIntervalUS=100000

# The normal retry times for trigger wait strategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also three configuration files cs_client.confpy_client.conf and snap_client.conf, you need add those items into those files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -945,7 +958,7 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate,
}
return LIBCURVE_ERROR::OK;
};
return rpcExcutor.DoRPCTask(task, IOPathMaxRetryMS);
return rpcExcutor.DoRPCTask(task, metaServerOpt_.mdsMaxWaitMs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous IOPathMaxRetryMS is more meaningful, so you can rename this to metaServerOpt_.maxRetryMsInIOPath, and rename the corresponding item in configure file.
Also, replace all IOPathMaxRetryMS with metaServerOpt_.maxRetryMsInIOPath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -190,6 +190,9 @@ client_mds_rpc_retry_interval_us: 100000
client_metacache_get_leader_timeout_ms: 500
client_metacache_get_leader_retry: 5
client_metacache_rpc_retry_interval_us: 100000
client_mds_normal_retry_times_before_trigger_wait: 3
client_mds_max_retry_ms_in_io_path: 86400000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this param is still different of client.conf

Copy link
Contributor Author

@Wine93 Wine93 May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this param is still different of client.conf

done, the param has been unified.

@Wine93 Wine93 requested review from wu-hanqing and cw123 May 7, 2021 09:10
@Wine93
Copy link
Contributor Author

Wine93 commented May 7, 2021

All the review suggestion has been adopted. @cw123 @wu-hanqing

* it will trigger wait strategy, and sleep long time before retry
*/
uint64_t mdsNormalRetryTimesBeforeTriggerWait = 3; // 3 times
uint64_t mdsMaxRetryMsInIOPath = 1 * 3600 * 1000; // 1 hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this default value is different from conf file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this default value is different from conf file?

done.

@@ -913,14 +923,18 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate,

auto statuscode = response.statuscode();
switch (statuscode) {
case StatusCode::kParaError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recheck this part of the code

if (errCode == LIBCURVE_ERROR::FAILED ||
, modification in here may cause an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recheck this part of the code

if (errCode == LIBCURVE_ERROR::FAILED ||

, modification in here may cause an error.

yeap, the function should return LIBCURVE_ERROR::FAILED instead of LIBCURVE_ERROR::NOTEXIST and LIBCURVE_ERROR::PARAM_ERROR.

@ilixiaocui ilixiaocui merged commit f455caf into opencurve:master May 10, 2021
ilixiaocui pushed a commit to ilixiaocui/curve that referenced this pull request Feb 6, 2023
add litmuschaos project idea for Q1-2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants