Doc load opt by eldenmoon · Pull Request #60511 · apache/doris

eldenmoon · 2026-02-04T17:11:04Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

eldenmoon · 2026-02-04T17:11:15Z

run buildall

hello-stephen · 2026-02-04T17:11:23Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Copilot

Pull request overview

This PR focuses on optimizing VARIANT “doc mode” ingestion/compaction paths by reducing overhead in doc-value serialization/materialization and by switching several hot-path maps/sets to phmap for better performance, with accompanying new/expanded tests.

Changes:

Optimize doc-value column handling (avoid sorting, faster serialization path, phmap usage in several structures).
Adjust variant compaction validation logic to tolerate doc/sparse extracted columns in output schemas.
Add/extend unit tests, including a new doc-mode compaction test and additional doc-value serialization coverage.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
be/test/olap/variant_doc_mode_compaction_test.cpp	Adds a new end-to-end compaction test for VARIANT doc mode (currently heavy and non-hermetic).
be/test/olap/rowset/segment_v2/variant_util_test.cpp	Adds coverage for `OnlyDocValueColumn` serialization with mixed types.
be/src/vec/common/variant_util.h	Changes `materialize_docs_to_subcolumns_map` return type to `phmap::flat_hash_map`.
be/src/vec/common/variant_util.cpp	Implements faster doc-value serialization and updates doc materialization map implementation.
be/src/vec/columns/subcolumn_tree.h	Switches node children map to `phmap::flat_hash_map`.
be/src/vec/columns/column_variant.h / be/src/vec/columns/column_variant.cpp	Removes `sort_doc_value_column` support.
be/src/olap/rowset/segment_v2/variant/variant_column_writer_impl.cpp	Reworks sparse/doc bucket writing/statistics with `phmap`; introduces offset-iteration changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T17:18:42Z

be/src/olap/rowset/segment_v2/variant/variant_column_writer_impl.cpp

    for (int64_t i = 0; i < num_rows; ++i) {
-        size_t start = column_offsets[i - 1];
-        size_t end = column_offsets[i];
+        const size_t start = column_offsets[i - 1];


In the for (i = 0; i < num_rows; ++i) loop, column_offsets[i - 1] is used when i == 0, which underflows and can read invalid memory. Initialize start to 0 for the first row (e.g., start = (i == 0) ? 0 : column_offsets[i - 1]).

Suggested change

const size_t start = column_offsets[i - 1];

const size_t start = (i == 0) ? 0 : column_offsets[i - 1];

Copilot · 2026-02-04T17:18:43Z

be/test/olap/variant_doc_mode_compaction_test.cpp

+        config::enable_ordered_data_compaction = false;
+        config::segments_key_bounds_truncation_threshold = -1;
+        config::enable_vertical_compact_variant_subcolumns = true;
+        config::enable_compaction_checksum = false;
+        config::enable_stacktrace = true;
+    }


This fixture mutates multiple global config:: flags. Since doris_be_test runs all unit tests in one binary, these values will leak into other tests unless restored. Consider saving previous values in SetUp() and restoring them in TearDown().

Copilot · 2026-02-04T17:18:43Z

be/test/olap/variant_doc_mode_compaction_test.cpp

+    void TearDown() override {
+        // ASSERT_TRUE(io::global_local_filesystem()->delete_directory(absolute_dir).ok());
+        engine_ref = nullptr;
+        ExecEnv::GetInstance()->set_storage_engine(nullptr);
+    }


TearDown() does not clean absolute_dir, cache_dir, or tmp_dir (and SetUp() reuses existing dirs). This makes the test non-hermetic and can leave large artifacts under ut_dir, affecting subsequent local/CI runs. Prefer deleting these directories in TearDown() and/or recreating them from a clean state in SetUp().

Copilot · 2026-02-04T17:18:43Z

be/test/olap/variant_doc_mode_compaction_test.cpp

+static constexpr int32_t kRowsPerSegment = 100000;
+


The test builds 10 segments of 100,000 rows each (1,000,000 rows total) and generates fairly large random JSON per row. This is likely to be too slow/heavy for a unit test binary. Consider reducing kRowsPerSegment and the materialization threshold proportionally, or gating this as a performance/benchmark-style test (e.g., DISABLED_ or env-guarded).

Copilot · 2026-02-04T17:18:44Z

be/test/olap/variant_doc_mode_compaction_test.cpp

+        static_cast<void>(fs->exists(cache_seg_path, &cache_exists));
+        if (!cache_exists) {
+            static_cast<void>(fs->link_file(seg_path, cache_seg_path));


Filesystem calls here ignore returned Status (both exists and link_file). If either fails, the test will silently proceed with an unexpected state and fail later in a harder-to-debug way. Please check the returned status and assert/return on failure.

Suggested change

static_cast<void>(fs->exists(cache_seg_path, &cache_exists));

if (!cache_exists) {

static_cast<void>(fs->link_file(seg_path, cache_seg_path));

Status cache_exist_st = fs->exists(cache_seg_path, &cache_exists);

ASSERT_TRUE(cache_exist_st.ok()) << cache_exist_st.to_json();

if (!cache_exists) {

Status link_st = fs->link_file(seg_path, cache_seg_path);

ASSERT_TRUE(link_st.ok()) << link_st.to_json();

Copilot · 2026-02-04T17:18:44Z

be/src/olap/rowset/segment_v2/variant/variant_column_writer_impl.cpp

+        bucket_offsets[b] = &m.get_offsets();
+    }
+    for (size_t row = 0; row < num_rows; ++row) {
+        const size_t start = offsets[row - 1];


offsets[row - 1] is accessed when row == 0, which underflows and results in an out-of-bounds read. Use the same pattern as VariantDocWriter::append_data below (start = 0 for the first row).

Suggested change

const size_t start = offsets[row - 1];

const size_t start = (row == 0 ? 0 : offsets[row - 1]);

eldenmoon · 2026-02-04T17:49:31Z

run buildall

eldenmoon · 2026-02-04T18:04:48Z

run buildall

doris-robot · 2026-02-04T18:39:36Z

TPC-H: Total hot run time: 31662 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f5ce85ae8d81b365a6bd810ea9834b2b7f79251c, data reload: false

------ Round 1 ----------------------------------
q1	17667	5265	5100	5100
q2	2062	310	195	195
q3	10213	1380	736	736
q4	10233	939	322	322
q5	7887	2186	1900	1900
q6	211	186	153	153
q7	901	720	605	605
q8	9258	1431	1088	1088
q9	5518	4986	4820	4820
q10	6869	1934	1570	1570
q11	517	307	285	285
q12	368	375	221	221
q13	17820	4058	3228	3228
q14	242	247	223	223
q15	934	832	803	803
q16	676	672	616	616
q17	641	820	477	477
q18	6710	6514	6369	6369
q19	1546	986	615	615
q20	389	352	230	230
q21	2647	2114	1836	1836
q22	348	314	270	270
Total cold run time: 103657 ms
Total hot run time: 31662 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5454	5329	5296	5296
q2	258	349	254	254
q3	2161	2695	2286	2286
q4	1345	1723	1320	1320
q5	4293	4179	4314	4179
q6	211	187	173	173
q7	2299	2089	1870	1870
q8	2567	2558	2452	2452
q9	7458	7624	7398	7398
q10	2765	3112	2532	2532
q11	544	484	448	448
q12	657	760	675	675
q13	3929	4525	3601	3601
q14	343	331	287	287
q15	871	804	811	804
q16	671	774	681	681
q17	1191	1337	1319	1319
q18	8058	7957	8031	7957
q19	909	847	866	847
q20	2063	2126	2009	2009
q21	4483	4255	4142	4142
q22	597	550	511	511
Total cold run time: 53127 ms
Total hot run time: 51041 ms

doris-robot · 2026-02-04T18:56:13Z

ClickBench: Total hot run time: 28.31 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f5ce85ae8d81b365a6bd810ea9834b2b7f79251c, data reload: false

query1	0.05	0.04	0.05
query2	0.09	0.04	0.04
query3	0.26	0.08	0.09
query4	1.61	0.11	0.11
query5	0.27	0.25	0.25
query6	1.16	0.68	0.68
query7	0.03	0.02	0.02
query8	0.05	0.03	0.03
query9	0.58	0.51	0.50
query10	0.55	0.55	0.54
query11	0.14	0.09	0.09
query12	0.14	0.10	0.11
query13	0.64	0.62	0.63
query14	1.07	1.08	1.04
query15	0.87	0.86	0.87
query16	0.38	0.42	0.42
query17	1.20	1.17	1.10
query18	0.23	0.21	0.20
query19	2.07	1.95	2.07
query20	0.02	0.01	0.01
query21	15.40	0.30	0.14
query22	5.25	0.05	0.05
query23	16.00	0.30	0.10
query24	1.94	0.26	0.20
query25	0.09	0.08	0.06
query26	0.15	0.15	0.13
query27	0.05	0.05	0.08
query28	3.89	1.16	0.97
query29	12.60	3.95	3.18
query30	0.28	0.14	0.11
query31	2.82	0.63	0.41
query32	3.23	0.60	0.49
query33	3.19	3.31	3.36
query34	16.09	5.38	4.72
query35	4.78	4.83	4.84
query36	0.64	0.51	0.49
query37	0.12	0.07	0.07
query38	0.08	0.04	0.04
query39	0.05	0.03	0.03
query40	0.18	0.16	0.16
query41	0.09	0.03	0.03
query42	0.05	0.03	0.03
query43	0.05	0.03	0.04
Total cold run time: 98.43 s
Total hot run time: 28.31 s

add write & compaction benchmark

53a5fc7

Copilot AI review requested due to automatic review settings February 4, 2026 17:11

Copilot started reviewing on behalf of eldenmoon February 4, 2026 17:11 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

eldenmoon force-pushed the doc-load-opt branch from 03ef62f to 065fe1f Compare February 4, 2026 17:48

optimize doc load

f5ce85a

eldenmoon force-pushed the doc-load-opt branch from 065fe1f to f5ce85a Compare February 4, 2026 18:04

	const size_t start = column_offsets[i - 1];
	const size_t start = (i == 0) ? 0 : column_offsets[i - 1];

	const size_t start = offsets[row - 1];
	const size_t start = (row == 0 ? 0 : offsets[row - 1]);

Conversation

eldenmoon commented Feb 4, 2026

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

eldenmoon commented Feb 4, 2026

Uh oh!

hello-stephen commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

eldenmoon commented Feb 4, 2026

Uh oh!

eldenmoon commented Feb 4, 2026

Uh oh!

doris-robot commented Feb 4, 2026

Uh oh!

doris-robot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants