Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt](s3) auto retry when meeting 429 error #35396

Merged
merged 3 commits into from
Jun 27, 2024
Merged

Conversation

morningman
Copy link
Contributor

@morningman morningman commented May 26, 2024

Proposed changes

Sometime the s3 sdk will return error like:

QpsLimitExceeded Unable to parse ExceptionName: QpsLimitExceeded Message: Please reduce your request rate. code=429 type=100, request_id=66516C288EC49

We should slowdown the request rate by sleeping for a while.

  • Add 2 new BE config

    • s3_read_base_wait_time_ms and s3_read_max_wait_time_ms

      When meet s3 429 error, the "get" request will
      sleep s3_read_base_wait_time_ms (*1, *2, *3, *4) ms get try again.
      The max sleep time is s3_read_max_wait_time_ms
      and the max retry time is max_s3_client_retry

  • Add more metrics for s3 file reader

    • s3_file_reader_too_many_request: counter of 429 error.

    • s3_file_reader_s3_get_request: the QPS of s3 get request.

    • TotalGetRequest: Get request counter in profile

    • TooManyRequestErr: 429 error counter in profile

    • TooManyRequestSleepTime: Sum of sleep time after 429 error in profile

    • TotalBytesRead: Total bytes read from s3 in profile

image

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@morningman
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 41184 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b5bbb082354a4c0df8e742d4f916f220609be4c3, data reload: false

------ Round 1 ----------------------------------
q1	17964	4586	4393	4393
q2	2532	201	203	201
q3	10925	1260	1233	1233
q4	11241	803	746	746
q5	7746	2769	2631	2631
q6	218	136	137	136
q7	969	621	609	609
q8	9409	2118	2126	2118
q9	9157	6505	6509	6505
q10	8873	3735	3753	3735
q11	463	243	237	237
q12	479	225	222	222
q13	18576	2980	2980	2980
q14	265	218	218	218
q15	508	471	468	468
q16	538	389	383	383
q17	1005	726	727	726
q18	8166	7551	7642	7551
q19	6453	1564	1535	1535
q20	656	315	318	315
q21	5010	3964	4029	3964
q22	351	278	280	278
Total cold run time: 121504 ms
Total hot run time: 41184 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4384	4242	4235	4235
q2	382	278	263	263
q3	3000	2802	2769	2769
q4	1873	1633	1591	1591
q5	5290	5328	5281	5281
q6	216	127	129	127
q7	2114	1737	1762	1737
q8	3240	3364	3285	3285
q9	8395	8374	8377	8374
q10	3927	3732	3741	3732
q11	605	485	493	485
q12	756	613	580	580
q13	17385	2983	2962	2962
q14	291	264	270	264
q15	513	471	484	471
q16	481	417	425	417
q17	1764	1494	1482	1482
q18	7660	7591	7503	7503
q19	1667	1562	1619	1562
q20	1997	1757	1779	1757
q21	4821	4743	4676	4676
q22	605	496	491	491
Total cold run time: 71366 ms
Total hot run time: 54044 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169215 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b5bbb082354a4c0df8e742d4f916f220609be4c3, data reload: false

query1	924	394	431	394
query2	6996	2406	2441	2406
query3	6649	212	210	210
query4	19637	17350	17508	17350
query5	4181	445	435	435
query6	254	158	154	154
query7	4599	293	295	293
query8	237	191	190	190
query9	8428	2473	2465	2465
query10	472	286	278	278
query11	10493	10235	10033	10033
query12	137	93	92	92
query13	1667	385	372	372
query14	10297	7817	6339	6339
query15	203	169	166	166
query16	7553	271	268	268
query17	1768	547	515	515
query18	1339	283	277	277
query19	202	151	162	151
query20	101	86	87	86
query21	200	136	132	132
query22	4196	3966	3899	3899
query23	33633	33030	33107	33030
query24	12012	2891	2821	2821
query25	656	348	360	348
query26	1671	165	162	162
query27	2790	322	335	322
query28	7145	2083	2104	2083
query29	1026	619	616	616
query30	275	150	148	148
query31	954	758	759	758
query32	93	59	61	59
query33	774	275	265	265
query34	985	494	502	494
query35	736	613	591	591
query36	1060	919	917	917
query37	176	67	68	67
query38	2889	2785	2763	2763
query39	824	780	798	780
query40	276	131	125	125
query41	49	46	43	43
query42	105	101	130	101
query43	587	558	558	558
query44	1222	729	746	729
query45	175	165	160	160
query46	1084	728	716	716
query47	1843	1761	1777	1761
query48	367	297	308	297
query49	1194	392	386	386
query50	785	408	394	394
query51	6752	6683	6538	6538
query52	100	93	91	91
query53	357	290	289	289
query54	1031	446	442	442
query55	71	74	70	70
query56	268	254	253	253
query57	1120	1050	1057	1050
query58	263	211	211	211
query59	3504	3149	3151	3149
query60	272	262	256	256
query61	88	86	87	86
query62	656	449	455	449
query63	313	285	291	285
query64	9787	2283	1792	1792
query65	3183	3100	3101	3100
query66	1267	337	332	332
query67	15164	14864	15030	14864
query68	5522	576	552	552
query69	488	274	279	274
query70	1115	1157	1146	1146
query71	449	279	274	274
query72	7227	3695	2561	2561
query73	774	332	331	331
query74	5971	5646	5562	5562
query75	3624	2651	2621	2621
query76	3308	1038	1009	1009
query77	657	273	326	273
query78	10213	9780	9695	9695
query79	2543	520	525	520
query80	1299	438	439	438
query81	502	224	224	224
query82	821	94	91	91
query83	204	182	175	175
query84	271	93	88	88
query85	1391	274	261	261
query86	443	326	294	294
query87	3334	3140	3097	3097
query88	4370	2463	2441	2441
query89	492	405	385	385
query90	1947	200	197	197
query91	128	150	96	96
query92	58	50	49	49
query93	3580	529	519	519
query94	1130	191	190	190
query95	415	326	318	318
query96	601	284	270	270
query97	3201	3057	2977	2977
query98	260	214	211	211
query99	1197	869	820	820
Total cold run time: 277650 ms
Total hot run time: 169215 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.55 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b5bbb082354a4c0df8e742d4f916f220609be4c3, data reload: false

query1	0.04	0.04	0.02
query2	0.08	0.04	0.04
query3	0.23	0.04	0.05
query4	1.67	0.07	0.08
query5	0.50	0.48	0.50
query6	1.12	0.72	0.72
query7	0.03	0.02	0.01
query8	0.06	0.04	0.04
query9	0.53	0.48	0.48
query10	0.54	0.56	0.55
query11	0.15	0.11	0.12
query12	0.14	0.12	0.12
query13	0.59	0.58	0.59
query14	0.77	0.77	0.78
query15	0.83	0.81	0.81
query16	0.37	0.37	0.36
query17	1.01	1.02	0.98
query18	0.20	0.26	0.23
query19	1.78	1.74	1.77
query20	0.01	0.01	0.01
query21	15.64	0.66	0.64
query22	4.41	7.06	1.91
query23	18.29	1.33	1.32
query24	2.07	0.21	0.22
query25	0.14	0.08	0.08
query26	0.27	0.17	0.17
query27	0.08	0.08	0.08
query28	13.26	1.01	0.99
query29	13.18	3.28	3.29
query30	0.25	0.06	0.06
query31	2.85	0.38	0.38
query32	3.26	0.47	0.48
query33	2.91	2.88	2.89
query34	17.28	4.40	4.42
query35	4.49	4.49	4.50
query36	0.65	0.46	0.46
query37	0.19	0.15	0.16
query38	0.17	0.15	0.14
query39	0.05	0.03	0.04
query40	0.15	0.14	0.14
query41	0.09	0.05	0.04
query42	0.06	0.05	0.04
query43	0.03	0.03	0.04
Total cold run time: 110.42 s
Total hot run time: 30.55 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.78% (9011/25185)
Line Coverage: 27.39% (74588/272333)
Region Coverage: 26.61% (38600/145047)
Branch Coverage: 23.48% (19680/83802)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b5bbb082354a4c0df8e742d4f916f220609be4c3_b5bbb082354a4c0df8e742d4f916f220609be4c3/report/index.html

kaka11chen
kaka11chen previously approved these changes May 30, 2024
Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 30, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@AshinGau
Copy link
Member

LGTM

gavinchou
gavinchou previously approved these changes Jun 3, 2024
bvar::LatencyRecorder s3_bytes_per_read("s3_file_reader", "bytes_per_read"); // also QPS
bvar::PerSecond<bvar::Adder<uint64_t>> s3_read_througthput("s3_file_reader", "s3_read_throughput",
&s3_bytes_read_total);
// Although we can get QPS from s3_bytes_per_read, but s3_bytes_per_read only
// record successfull request, and s3_get_request_qps will record all request.
bvar::PerSecond<bvar::Adder<uint64_t>> s3_get_request_qps("s3_file_reader", "s3_get_request",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3_bvar::s3_get_latency can get qps/P99 latency altogether. It seems no need to add this bvar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3_get_latency can only save "success" request.
s3_get_request_qps saves both "success" and "failed request"

@morningman
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jun 24, 2024
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39925 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 881b681ddbb72dfc52409a40f85182c751f69285, data reload: false

------ Round 1 ----------------------------------
q1	17590	4353	4281	4281
q2	2017	200	194	194
q3	10449	1095	1048	1048
q4	10194	751	810	751
q5	7537	2682	2679	2679
q6	222	134	136	134
q7	960	611	636	611
q8	9218	2079	2056	2056
q9	9104	6512	6408	6408
q10	8872	3735	3715	3715
q11	451	246	240	240
q12	472	233	227	227
q13	18891	3010	3003	3003
q14	274	228	224	224
q15	515	477	482	477
q16	521	386	389	386
q17	981	661	738	661
q18	8082	7707	7425	7425
q19	8200	1430	1498	1430
q20	663	326	318	318
q21	4872	3326	3891	3326
q22	413	331	341	331
Total cold run time: 120498 ms
Total hot run time: 39925 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4423	4361	4239	4239
q2	370	261	267	261
q3	3171	2953	2940	2940
q4	2007	1788	1706	1706
q5	5470	5465	5499	5465
q6	244	134	136	134
q7	2307	1845	1822	1822
q8	3298	3448	3423	3423
q9	8709	8756	8725	8725
q10	4019	3749	3885	3749
q11	617	497	511	497
q12	794	598	659	598
q13	16002	3140	3157	3140
q14	303	274	277	274
q15	539	480	487	480
q16	508	440	431	431
q17	1835	1519	1520	1519
q18	8129	7838	7849	7838
q19	3796	1539	1640	1539
q20	3069	1900	1845	1845
q21	5209	4939	4847	4847
q22	651	577	602	577
Total cold run time: 75470 ms
Total hot run time: 56049 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 175217 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 881b681ddbb72dfc52409a40f85182c751f69285, data reload: false

query1	930	394	380	380
query2	6434	2357	2340	2340
query3	6626	199	205	199
query4	18726	17561	17457	17457
query5	3577	473	476	473
query6	260	160	158	158
query7	4594	303	305	303
query8	324	315	319	315
query9	8519	2492	2480	2480
query10	582	314	296	296
query11	10480	9895	10012	9895
query12	110	85	85	85
query13	1630	382	365	365
query14	8662	7111	7148	7111
query15	239	189	191	189
query16	7570	278	278	278
query17	1381	576	544	544
query18	1927	303	283	283
query19	209	166	165	165
query20	90	85	87	85
query21	210	134	126	126
query22	4335	4123	4137	4123
query23	33791	33763	33792	33763
query24	11269	2887	2885	2885
query25	633	412	398	398
query26	722	162	168	162
query27	2252	334	336	334
query28	6091	2181	2191	2181
query29	912	660	670	660
query30	262	160	154	154
query31	964	773	782	773
query32	105	56	58	56
query33	776	309	308	308
query34	1038	497	483	483
query35	761	679	722	679
query36	1134	997	989	989
query37	144	72	80	72
query38	2897	2835	2820	2820
query39	902	835	857	835
query40	209	131	133	131
query41	56	56	54	54
query42	116	114	102	102
query43	608	567	549	549
query44	1180	737	722	722
query45	194	162	166	162
query46	1073	726	716	716
query47	1838	1779	1767	1767
query48	381	310	314	310
query49	820	449	426	426
query50	779	403	391	391
query51	6867	6801	6796	6796
query52	110	92	95	92
query53	353	292	296	292
query54	882	457	448	448
query55	74	75	73	73
query56	283	261	282	261
query57	1134	1065	1045	1045
query58	247	236	242	236
query59	3383	3236	3361	3236
query60	310	272	277	272
query61	94	92	89	89
query62	594	454	450	450
query63	321	298	297	297
query64	8953	2275	1754	1754
query65	3144	3148	3113	3113
query66	762	322	321	321
query67	15476	15028	14854	14854
query68	7754	556	543	543
query69	712	467	458	458
query70	1234	1117	1120	1117
query71	489	320	277	277
query72	8914	5450	5649	5450
query73	823	330	324	324
query74	5836	5457	5504	5457
query75	4779	2672	2702	2672
query76	4436	976	990	976
query77	813	310	312	310
query78	10430	9939	9731	9731
query79	5448	524	509	509
query80	970	471	483	471
query81	588	223	220	220
query82	565	112	105	105
query83	337	180	178	178
query84	281	93	87	87
query85	1057	324	272	272
query86	406	301	330	301
query87	3301	3077	3147	3077
query88	3994	2451	2439	2439
query89	494	396	388	388
query90	1893	198	193	193
query91	130	106	101	101
query92	63	51	48	48
query93	4909	509	519	509
query94	1146	193	193	193
query95	401	316	318	316
query96	621	273	276	273
query97	3216	3044	3076	3044
query98	216	218	195	195
query99	1144	822	830	822
Total cold run time: 279709 ms
Total hot run time: 175217 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.11 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 881b681ddbb72dfc52409a40f85182c751f69285, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.03	0.04
query3	0.22	0.05	0.05
query4	1.69	0.07	0.08
query5	0.49	0.48	0.48
query6	1.14	0.73	0.72
query7	0.02	0.02	0.01
query8	0.05	0.04	0.04
query9	0.56	0.50	0.50
query10	0.55	0.56	0.54
query11	0.15	0.12	0.12
query12	0.15	0.12	0.12
query13	0.59	0.59	0.59
query14	0.77	0.80	0.78
query15	0.84	0.82	0.82
query16	0.40	0.37	0.36
query17	1.05	1.05	0.97
query18	0.23	0.25	0.22
query19	1.90	1.79	1.76
query20	0.01	0.01	0.01
query21	15.43	0.67	0.65
query22	4.45	6.74	2.29
query23	18.27	1.41	1.32
query24	2.10	0.24	0.23
query25	0.16	0.08	0.08
query26	0.28	0.18	0.18
query27	0.09	0.09	0.07
query28	13.28	1.04	1.01
query29	12.62	3.29	3.25
query30	0.26	0.07	0.05
query31	2.85	0.39	0.39
query32	3.28	0.47	0.48
query33	2.96	2.95	2.98
query34	17.12	4.43	4.40
query35	4.57	4.49	4.47
query36	0.66	0.47	0.48
query37	0.18	0.17	0.16
query38	0.17	0.15	0.15
query39	0.05	0.03	0.03
query40	0.17	0.14	0.14
query41	0.09	0.04	0.04
query42	0.06	0.05	0.05
query43	0.05	0.04	0.04
Total cold run time: 110.08 s
Total hot run time: 31.11 s

while (retry_count <= max_retries) {
s3_file_reader_read_counter << 1;
// clang-format off
auto resp = client->get_object( { .bucket = _bucket, .key = _key, },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto resp = client->get_object( { .bucket = _bucket, .key = _key, },
{
SCOPED_BVAR_LATENCY(s3_bvar::s3_get_latency);
auto resp = client->get_object( { .bucket = _bucket, .key = _key, },
}

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 27, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gavinchou gavinchou merged commit 493dc26 into apache:master Jun 27, 2024
34 of 40 checks passed
dataroaring pushed a commit that referenced this pull request Jun 28, 2024
Sometime the s3 sdk will return error like:
```
QpsLimitExceeded Unable to parse ExceptionName: QpsLimitExceeded Message: Please reduce your request rate. code=429 type=100, request_id=66516C288EC49
```
We should slowdown the request rate by sleeping for a while.

- Add 2 new BE config

	- `s3_read_base_wait_time_ms` and `s3_read_max_wait_time_ms`

		When meet s3 429 error, the "get" request will
		sleep `s3_read_base_wait_time_ms (*1, *2, *3, *4)` ms get try again.
		The max sleep time is s3_read_max_wait_time_ms
		and the max retry time is max_s3_client_retry
		
- Add more metrics for s3 file reader

	- `s3_file_reader_too_many_request`: counter of 429 error.
	- `s3_file_reader_s3_get_request`: the QPS of s3 get request.

	- `TotalGetRequest`: Get request counter in profile
	- `TooManyRequestErr`: 429 error counter in profile
- `TooManyRequestSleepTime`: Sum of sleep time after 429 error in
profile
	- `TotalBytesRead`: Total bytes read from s3 in profile


![image](https://github.com/apache/doris/assets/2899462/2e8f5837-270b-48c7-9397-160aeac143eb)
@xiaokang xiaokang mentioned this pull request Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants