-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: retry allocate segment until success #338
Conversation
@@ -190,6 +190,9 @@ client_mds_rpc_retry_interval_us: 100000 | |||
client_metacache_get_leader_timeout_ms: 500 | |||
client_metacache_get_leader_retry: 5 | |||
client_metacache_rpc_retry_interval_us: 100000 | |||
client_mds_normal_retry_times_before_trigger_wait: 3 | |||
client_mds_max_wait_ms: 86400000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the number of mds max wait ms is different between clent.conf and this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the number of mds max wait ms is different between clent.conf and this.
The config in client.conf
is for robot test, if we use the real value (like 86400000
), it will retry for long time, and the test case will timeout and failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost all items in client.conf
are recommended, so we leave it unchanged. And for unit tests, you can reference this
const std::vector<std::string> clientConf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, it's useful for unit test, but i think it's unuseful for robot test, like curve_robot_test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have confirmed that the robot test will use the config file under the conf directory. @wu-hanqing @cw123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have confirmed that the robot test will use the config file under the conf directory. @wu-hanqing @cw123
Which test case causes robot test timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have confirmed that the robot test will use the config file under the conf directory. @wu-hanqing @cw123
Which test case causes robot test timeout?
The case for read write offset > size
, click for test detail and log detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you, you should distinguish whether IO request is exceeded file length or allocate segment failed caused by no space left in the logical pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap, it's my neglect, i should ignore other errors except NO SPACE error.
include/client/libcurve.h
Outdated
UNKNOWN = 100 | ||
UNKNOWN = 100, | ||
// You must retry it until success | ||
RETRY_UNTIL_SUCCESS = 200, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use 30 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use 30 ?
done.
ps: In my original design, I think the error RETRY_UNTIL_SUCCESS
is an another type error, so i use code 200
to distinguish with other error codes.
src/client/mds_client.cpp
Outdated
@@ -114,6 +114,9 @@ LIBCURVE_ERROR MDSClient::MDSRPCExcutor::DoRPCTask(RPCFunc rpctask, | |||
// rpc超时时间 | |||
uint64_t rpcTimeOutMS = metaServerOpt_.mdsRPCTimeoutMs; | |||
|
|||
// The count of normal retry | |||
uint64_t nNormalRetry = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"n"NormalRetry, why doy you use a letter n before this param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"n"NormalRetry, why doy you use a letter n before this param?
The prefix n
represent the count of normal retry.
@@ -934,7 +946,8 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate, | |||
int chunksNum = pfs.chunks_size(); | |||
if (allocate && chunksNum <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if GetOrAllocateSegment get segment fail, and the reason is no enough space, the program may not go to here.
bool ChunkSegmentAllocatorImpl::AllocateChunkSegment(FileType type,
SegmentSizeType segmentSize, ChunkSizeType chunkSize,
offset_t offset, PageFileSegment *segment) {
..........
segment->set_chunksize(chunkSize);
segment->set_segmentsize(segmentSize);
segment->set_startoffset(offset);
// allocate chunks
uint32_t chunkNum = segmentSize/chunkSize;
std::vector<CopysetIdInfo> copysets;
if (!topologyChunkAllocator_->
AllocateChunkRoundRobinInSingleLogicalPool(
type, chunkNum, chunkSize, ©sets)) {
LOG(ERROR) << "AllocateChunkRoundRobinInSingleLogicalPool error";
return false;
}
message PageFileSegment {
required uint32 logicalPoolID = 1;
required uint32 segmentSize = 3;
required uint32 chunkSize = 4;
required uint64 startOffset = 2;
repeated PageFileChunkInfo chunks = 5;
}
The logicalPoolID is required. The RPC may fail here
if (cntl->Failed()) {
Check this problem please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filed PageFileSegment
is cleared when alloc segment failed, so it will not trigger RPC EREQUEST
error.
@@ -190,6 +190,9 @@ client_mds_rpc_retry_interval_us: 100000 | |||
client_metacache_get_leader_timeout_ms: 500 | |||
client_metacache_get_leader_retry: 5 | |||
client_metacache_rpc_retry_interval_us: 100000 | |||
client_mds_normal_retry_times_before_trigger_wait: 3 | |||
client_mds_max_wait_ms: 86400000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost all items in client.conf
are recommended, so we leave it unchanged. And for unit tests, you can reference this
const std::vector<std::string> clientConf { |
@@ -26,6 +26,15 @@ mds.refreshTimesPerLease=4 | |||
# mds RPC接口每次重试之前需要先睡眠一段时间 | |||
mds.rpcRetryIntervalUS=100000 | |||
|
|||
# The normal retry times for trigger wait strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also three configuration files cs_client.conf
、py_client.conf
and snap_client.conf
, you need add those items into those files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/client/mds_client.cpp
Outdated
@@ -945,7 +958,7 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate, | |||
} | |||
return LIBCURVE_ERROR::OK; | |||
}; | |||
return rpcExcutor.DoRPCTask(task, IOPathMaxRetryMS); | |||
return rpcExcutor.DoRPCTask(task, metaServerOpt_.mdsMaxWaitMs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous IOPathMaxRetryMS
is more meaningful, so you can rename this to metaServerOpt_.maxRetryMsInIOPath
, and rename the corresponding item in configure file.
Also, replace all IOPathMaxRetryMS
with metaServerOpt_.maxRetryMsInIOPath
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -190,6 +190,9 @@ client_mds_rpc_retry_interval_us: 100000 | |||
client_metacache_get_leader_timeout_ms: 500 | |||
client_metacache_get_leader_retry: 5 | |||
client_metacache_rpc_retry_interval_us: 100000 | |||
client_mds_normal_retry_times_before_trigger_wait: 3 | |||
client_mds_max_retry_ms_in_io_path: 86400000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this param is still different of client.conf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this param is still different of client.conf
done, the param has been unified.
All the review suggestion has been adopted. @cw123 @wu-hanqing |
src/client/config_info.h
Outdated
* it will trigger wait strategy, and sleep long time before retry | ||
*/ | ||
uint64_t mdsNormalRetryTimesBeforeTriggerWait = 3; // 3 times | ||
uint64_t mdsMaxRetryMsInIOPath = 1 * 3600 * 1000; // 1 hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this default value is different from conf file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this default value is different from conf file?
done.
@@ -913,14 +923,18 @@ LIBCURVE_ERROR MDSClient::GetOrAllocateSegment(bool allocate, | |||
|
|||
auto statuscode = response.statuscode(); | |||
switch (statuscode) { | |||
case StatusCode::kParaError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recheck this part of the code
Line 201 in aa0e97b
if (errCode == LIBCURVE_ERROR::FAILED || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recheck this part of the code
Line 201 in aa0e97b
if (errCode == LIBCURVE_ERROR::FAILED || , modification in here may cause an error.
yeap, the function should return LIBCURVE_ERROR::FAILED
instead of LIBCURVE_ERROR::NOTEXIST
and LIBCURVE_ERROR::PARAM_ERROR
.
add litmuschaos project idea for Q1-2021
What problem does this PR solve?
Issue Number: close #xxx
Problem Summary:
What is changed and how it works?
What's Changed:
How it Works:
Side effects(Breaking backward compatibility? Performance regression?):
Check List