Skip to content

[Improve](StreamingJob) add more metrics to observe the streaming job#60493

Open
JNSimba wants to merge 5 commits intoapache:masterfrom
JNSimba:add_streamingjob_metric
Open

[Improve](StreamingJob) add more metrics to observe the streaming job#60493
JNSimba wants to merge 5 commits intoapache:masterfrom
JNSimba:add_streamingjob_metric

Conversation

@JNSimba
Copy link
Member

@JNSimba JNSimba commented Feb 4, 2026

What problem does this PR solve?

Issue Number: close #xxx

Add more metrics to observe the streaming job:

Metrics Module Description
streaming_job_get_meta_latency FE Time spent fetching source metadata for streaming jobs
streaming_job_get_meta_count FE Number of times source metadata is fetched for streaming jobs
streaming_job_get_meta_fail_count FE Number of failures when fetching source metadata for streaming jobs
streaming_job_task_execute_time FE Total execution time of streaming job tasks
streaming_job_task_execute_count FE Total number of executed streaming job tasks
streaming_job_task_failed_count FE Total number of failed streaming job tasks
streaming_job_total_rows FE Total number of rows processed by streaming jobs
streaming_job_filter_rows FE Total number of rows filtered out by streaming jobs
streaming_job_load_bytes FE Total data volume loaded by streaming jobs (in bytes)

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Member Author

JNSimba commented Feb 4, 2026

run buildall

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new FE-side metrics to improve observability of streaming insert jobs, along with a regression test that validates the metrics are exposed via the FE /metrics endpoint.

Changes:

  • Registers new streaming job counter metrics in MetricRepo and adds streaming job state gauges.
  • Increments the new counters from StreamingInsertJob lifecycle points (meta fetch, task success/failure, offset commit).
  • Adds a MySQL CDC regression test that polls FE metrics until all expected streaming-job metrics are present.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
regression-test/suites/job_p0/streaming_job/cdc/test_streaming_mysql_job_metrics.groovy New regression test validating FE exports the expected streaming job metrics.
fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java Registers streaming job counters and adds streaming job state gauge metrics.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java Emits streaming job metric increments during meta fetch, task completion, and offset/stat updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JNSimba
Copy link
Member Author

JNSimba commented Feb 5, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31401 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2e260d1c5e99a99bd35c76185fb8945060717a76, data reload: false

------ Round 1 ----------------------------------
q1	17643	4450	4288	4288
q2	1998	347	235	235
q3	10159	1259	711	711
q4	10218	891	317	317
q5	7540	2146	1930	1930
q6	196	175	143	143
q7	854	743	609	609
q8	9261	1384	1098	1098
q9	5061	4854	4812	4812
q10	6832	1938	1539	1539
q11	491	302	285	285
q12	337	380	236	236
q13	17786	4015	3250	3250
q14	232	240	221	221
q15	877	799	810	799
q16	686	675	617	617
q17	632	776	485	485
q18	6745	6467	6935	6467
q19	1379	1023	681	681
q20	411	396	282	282
q21	2936	2271	2108	2108
q22	382	333	288	288
Total cold run time: 102656 ms
Total hot run time: 31401 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4654	4475	4556	4475
q2	273	338	268	268
q3	2388	2884	2514	2514
q4	1433	1893	1407	1407
q5	4750	4547	4683	4547
q6	246	190	150	150
q7	2077	1935	1757	1757
q8	2575	2460	2426	2426
q9	7744	7456	7322	7322
q10	2860	3085	2562	2562
q11	544	481	481	481
q12	732	712	566	566
q13	3599	4017	3206	3206
q14	266	277	254	254
q15	818	785	785	785
q16	651	684	641	641
q17	1070	1242	1291	1242
q18	7477	7235	7225	7225
q19	853	806	802	802
q20	1955	2042	1868	1868
q21	4528	4220	4145	4145
q22	577	544	508	508
Total cold run time: 52070 ms
Total hot run time: 49151 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.11 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2e260d1c5e99a99bd35c76185fb8945060717a76, data reload: false

query1	0.05	0.05	0.06
query2	0.09	0.04	0.05
query3	0.26	0.08	0.08
query4	1.60	0.11	0.11
query5	0.27	0.25	0.27
query6	1.16	0.66	0.66
query7	0.03	0.03	0.02
query8	0.05	0.04	0.04
query9	0.55	0.52	0.47
query10	0.56	0.56	0.53
query11	0.14	0.10	0.09
query12	0.14	0.11	0.11
query13	0.63	0.61	0.63
query14	1.07	1.06	1.05
query15	0.87	0.85	0.88
query16	0.41	0.40	0.43
query17	1.10	1.08	1.11
query18	0.21	0.22	0.21
query19	2.03	1.97	2.02
query20	0.02	0.01	0.01
query21	15.42	0.28	0.15
query22	5.11	0.05	0.06
query23	15.93	0.28	0.11
query24	0.99	1.25	1.07
query25	0.13	0.18	0.20
query26	0.14	0.14	0.14
query27	0.07	0.06	0.05
query28	5.39	1.12	0.96
query29	12.55	3.94	3.15
query30	0.28	0.13	0.11
query31	2.82	0.66	0.40
query32	3.23	0.60	0.49
query33	3.23	3.29	3.24
query34	15.86	5.40	4.75
query35	4.86	4.74	4.86
query36	0.65	0.50	0.49
query37	0.11	0.07	0.07
query38	0.08	0.04	0.04
query39	0.05	0.03	0.03
query40	0.19	0.16	0.16
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 98.51 s
Total hot run time: 29.11 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 53.33% (32/60) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants