-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistics Tech Debt #55043
Comments
NCRMTA1: Surprise analyze-partition-concurrency-quota
The whole document is about how we can improve in the future. So please do not take it as criticism, even though I call it that, nobody can really master TiDB analyze. BackgroundIn this PR, #55046, I aimed to delete some duplicated code related to saving stats results. Concurrently saving analysis resultsWhen analyzing partitioned tables, it is important to concurrently analyze their partitions instead of analyzing them one by one. As you can see, we will spawn some workers to analyze these partitions concurrently, then we will collect all analysis results in the result handler. As you can see, we spawn some saving workers to save the result table by table. This is where the analyze-partition-concurrency-quotaThis is a configuration for the TiDB server. Its default value is 16. tidb_servers:
- host: 10.0.1.14
config:
performance.analyze-partition-concurrency-quota: 16 The name is somewhat misleading, as it suggests that this configuration controls the concurrency of partition analysis. In reality, it only governs a portion of the process, specifically the saving phase. analyzeConcurrencyQuota := int(config.GetGlobalConfig().Performance.AnalyzePartitionConcurrencyQuota)
analyzeCtxs, err := createSessions(store, analyzeConcurrencyQuota)
if err != nil {
return nil, err
}
subCtxs2 := make([]sessionctx.Context, analyzeConcurrencyQuota)
for i := 0; i < analyzeConcurrencyQuota; i++ {
subCtxs2[i] = analyzeCtxs[i]
}
dom.SetupAnalyzeExec(subCtxs2) While bootstrapping the domain, we read this configuration and initialize dedicated sessions for statistics collection. analyzeMu struct {
sync.Mutex
sctxs map[sessionctx.Context]bool
} // FetchAnalyzeExec gets needed exec for analyze
func (do *Domain) FetchAnalyzeExec(need int) []sessionctx.Context {
if need < 1 {
return nil
}
count := 0
r := make([]sessionctx.Context, 0)
do.analyzeMu.Lock()
defer do.analyzeMu.Unlock()
for sctx, used := range do.analyzeMu.sctxs {
if used {
continue
}
r = append(r, sctx)
do.analyzeMu.sctxs[sctx] = true
count++
if count >= need {
break
}
}
return r
} If no sessions are available, we return an empty slice. The next question is: what happens when no sessions are available? if partitionStatsConcurrency > 1 {
subSctxs := dom.FetchAnalyzeExec(partitionStatsConcurrency)
...
if len(subSctxs) > 0 {
sessionCount := len(subSctxs)
logutil.BgLogger().Info("use multiple sessions to save analyze results", zap.Int("sessionCount", sessionCount))
defer func() {
dom.ReleaseAnalyzeExec(subSctxs)
}()
return e.handleResultsErrorWithConcurrency(internalCtx, concurrency, needGlobalStats, subSctxs, globalStatsMap, resultsCh)
}
}
logutil.BgLogger().Info("use single session to save analyze results")
failpoint.Inject("handleResultsErrorSingleThreadPanic", nil)
subSctxs := []sessionctx.Context{e.Ctx()}
return e.handleResultsErrorWithConcurrency(internalCtx, concurrency, needGlobalStats, subSctxs, globalStatsMap, resultsCh) Before I demonstrate the problem here, let's look at another session variable first. tidb_analyze_partition_concurrencyThis is a session variable. As you can see, its name is very similar to the above configuration. The same confusing name, the same function. partitionStatsConcurrency := e.Ctx().GetSessionVars().AnalyzePartitionConcurrency
// the concurrency of handleResultsError cannot be more than partitionStatsConcurrency
partitionStatsConcurrency = min(taskNum, partitionStatsConcurrency)
if partitionStatsConcurrency > 1 {
subSctxs := dom.FetchAnalyzeExec(partitionStatsConcurrency)
...
if len(subSctxs) > 0 {
sessionCount := len(subSctxs)
logutil.BgLogger().Info("use multiple sessions to save analyze results", zap.Int("sessionCount", sessionCount))
defer func() {
dom.ReleaseAnalyzeExec(subSctxs)
}()
return e.handleResultsErrorWithConcurrency(internalCtx, concurrency, needGlobalStats, subSctxs, globalStatsMap, resultsCh)
}
}
logutil.BgLogger().Info("use single session to save analyze results")
failpoint.Inject("handleResultsErrorSingleThreadPanic", nil)
subSctxs := []sessionctx.Context{e.Ctx()}
return e.handleResultsErrorWithConcurrency(internalCtx, concurrency, needGlobalStats, subSctxs, globalStatsMap, resultsCh) So basically, we think tidb_analyze_partition_concurrency has a higher priority, and we will use it to determine if we need to spawn multiple workers to save the analysis results. What is the problem?
{
Scope: ScopeGlobal | ScopeSession, Name: TiDBAnalyzePartitionConcurrency, Value: strconv.FormatInt(DefTiDBAnalyzePartitionConcurrency, 10),
MinValue: 1, MaxValue: uint64(config.GetGlobalConfig().Performance.AnalyzePartitionConcurrencyQuota), SetSession: func(s *SessionVars, val string) error {
s.AnalyzePartitionConcurrency = int(TidbOptInt64(val, DefTiDBAnalyzePartitionConcurrency))
return nil
},
},
// ValidateFromType provides automatic validation based on the SysVar's type
func (sv *SysVar) ValidateFromType(vars *SessionVars, value string, scope ScopeFlag) (string, error) {
// Some sysvars in TiDB have a special behavior where the empty string means
// "use the config file value". This needs to be cleaned up once the behavior
// for instance variables is determined.
if value == "" && ((sv.AllowEmpty && scope == ScopeSession) || sv.AllowEmptyAll) {
return value, nil
}
// Provide validation using the SysVar struct
switch sv.Type {
case TypeUnsigned:
return sv.checkUInt64SystemVar(value, vars)
case TypeInt:
return sv.checkInt64SystemVar(value, vars)
case TypeBool:
return sv.checkBoolSystemVar(value, vars)
case TypeFloat:
return sv.checkFloatSystemVar(value, vars)
case TypeEnum:
return sv.checkEnumSystemVar(value, vars)
case TypeTime:
return sv.checkTimeSystemVar(value, vars)
case TypeDuration:
return sv.checkDurationSystemVar(value, vars)
}
return value, nil // typeString
} What can we learn from it?
What is worse?
How to get rid of it?The reason I believe we should delete it is that it makes tidb_analyze_partition_concurrency useless. If you really want to improve the analysis performance for partitioned tables, you would need to change this configuration and restart the cluster. This is unacceptable for some users. A value of 16 is quite small compared to the number of partitions in a partitioned table. |
Tech Debt
In this issue, I will record all tech debts I found in the TiDB statistics module.
Code duplication
The worst and most serious technical debt in the statistics module is code redundancy, and the following implementation pattern is used almost everywhere there is concurrent processing:
This pattern creates a lot of problems, we have a few issues that need to be fixed twice, and code redundancy is serious. Theoretically, we just need to think of single-threaded as a special case of a multi-threaded implementation, and we shouldn't be copying and pasting code.
Here are the relevant modules that have this problem:
and so on...
Too many variables
We have a lot of variables related to the collection of statistics, most of which were introduced when concurrency support was introduced, and they have a variety of names that are very difficult to understand. It is also not clear from the documentation how these variables affect the system.
Even some of these variables are actually related to each other, which makes it very challenging for users to work with statistical information.
We need to:
The text was updated successfully, but these errors were encountered: