Skip to content

Doc load opt#60511

Open
eldenmoon wants to merge 2 commits intoapache:masterfrom
eldenmoon:doc-load-opt
Open

Doc load opt#60511
eldenmoon wants to merge 2 commits intoapache:masterfrom
eldenmoon:doc-load-opt

Conversation

@eldenmoon
Copy link
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings February 4, 2026 17:11
@eldenmoon
Copy link
Member Author

run buildall

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on optimizing VARIANT “doc mode” ingestion/compaction paths by reducing overhead in doc-value serialization/materialization and by switching several hot-path maps/sets to phmap for better performance, with accompanying new/expanded tests.

Changes:

  • Optimize doc-value column handling (avoid sorting, faster serialization path, phmap usage in several structures).
  • Adjust variant compaction validation logic to tolerate doc/sparse extracted columns in output schemas.
  • Add/extend unit tests, including a new doc-mode compaction test and additional doc-value serialization coverage.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
be/test/olap/variant_doc_mode_compaction_test.cpp Adds a new end-to-end compaction test for VARIANT doc mode (currently heavy and non-hermetic).
be/test/olap/rowset/segment_v2/variant_util_test.cpp Adds coverage for OnlyDocValueColumn serialization with mixed types.
be/src/vec/common/variant_util.h Changes materialize_docs_to_subcolumns_map return type to phmap::flat_hash_map.
be/src/vec/common/variant_util.cpp Implements faster doc-value serialization and updates doc materialization map implementation.
be/src/vec/columns/subcolumn_tree.h Switches node children map to phmap::flat_hash_map.
be/src/vec/columns/column_variant.h / be/src/vec/columns/column_variant.cpp Removes sort_doc_value_column support.
be/src/olap/rowset/segment_v2/variant/variant_column_writer_impl.cpp Reworks sparse/doc bucket writing/statistics with phmap; introduces offset-iteration changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for (int64_t i = 0; i < num_rows; ++i) {
size_t start = column_offsets[i - 1];
size_t end = column_offsets[i];
const size_t start = column_offsets[i - 1];
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the for (i = 0; i < num_rows; ++i) loop, column_offsets[i - 1] is used when i == 0, which underflows and can read invalid memory. Initialize start to 0 for the first row (e.g., start = (i == 0) ? 0 : column_offsets[i - 1]).

Suggested change
const size_t start = column_offsets[i - 1];
const size_t start = (i == 0) ? 0 : column_offsets[i - 1];

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +114
config::enable_ordered_data_compaction = false;
config::segments_key_bounds_truncation_threshold = -1;
config::enable_vertical_compact_variant_subcolumns = true;
config::enable_compaction_checksum = false;
config::enable_stacktrace = true;
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixture mutates multiple global config:: flags. Since doris_be_test runs all unit tests in one binary, these values will leak into other tests unless restored. Consider saving previous values in SetUp() and restoring them in TearDown().

Copilot uses AI. Check for mistakes.
Comment on lines +115 to +120
void TearDown() override {
// ASSERT_TRUE(io::global_local_filesystem()->delete_directory(absolute_dir).ok());
engine_ref = nullptr;
ExecEnv::GetInstance()->set_storage_engine(nullptr);
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TearDown() does not clean absolute_dir, cache_dir, or tmp_dir (and SetUp() reuses existing dirs). This makes the test non-hermetic and can leave large artifacts under ut_dir, affecting subsequent local/CI runs. Prefer deleting these directories in TearDown() and/or recreating them from a clean state in SetUp().

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +73
static constexpr int32_t kRowsPerSegment = 100000;

Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test builds 10 segments of 100,000 rows each (1,000,000 rows total) and generates fairly large random JSON per row. This is likely to be too slow/heavy for a unit test binary. Consider reducing kRowsPerSegment and the materialization threshold proportionally, or gating this as a performance/benchmark-style test (e.g., DISABLED_ or env-guarded).

Copilot uses AI. Check for mistakes.
Comment on lines +293 to +297
static_cast<void>(fs->exists(cache_seg_path, &cache_exists));
if (!cache_exists) {
static_cast<void>(fs->link_file(seg_path, cache_seg_path));
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filesystem calls here ignore returned Status (both exists and link_file). If either fails, the test will silently proceed with an unexpected state and fail later in a harder-to-debug way. Please check the returned status and assert/return on failure.

Suggested change
static_cast<void>(fs->exists(cache_seg_path, &cache_exists));
if (!cache_exists) {
static_cast<void>(fs->link_file(seg_path, cache_seg_path));
Status cache_exist_st = fs->exists(cache_seg_path, &cache_exists);
ASSERT_TRUE(cache_exist_st.ok()) << cache_exist_st.to_json();
if (!cache_exists) {
Status link_st = fs->link_file(seg_path, cache_seg_path);
ASSERT_TRUE(link_st.ok()) << link_st.to_json();

Copilot uses AI. Check for mistakes.
bucket_offsets[b] = &m.get_offsets();
}
for (size_t row = 0; row < num_rows; ++row) {
const size_t start = offsets[row - 1];
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offsets[row - 1] is accessed when row == 0, which underflows and results in an out-of-bounds read. Use the same pattern as VariantDocWriter::append_data below (start = 0 for the first row).

Suggested change
const size_t start = offsets[row - 1];
const size_t start = (row == 0 ? 0 : offsets[row - 1]);

Copilot uses AI. Check for mistakes.
@eldenmoon
Copy link
Member Author

run buildall

@eldenmoon
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31662 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f5ce85ae8d81b365a6bd810ea9834b2b7f79251c, data reload: false

------ Round 1 ----------------------------------
q1	17667	5265	5100	5100
q2	2062	310	195	195
q3	10213	1380	736	736
q4	10233	939	322	322
q5	7887	2186	1900	1900
q6	211	186	153	153
q7	901	720	605	605
q8	9258	1431	1088	1088
q9	5518	4986	4820	4820
q10	6869	1934	1570	1570
q11	517	307	285	285
q12	368	375	221	221
q13	17820	4058	3228	3228
q14	242	247	223	223
q15	934	832	803	803
q16	676	672	616	616
q17	641	820	477	477
q18	6710	6514	6369	6369
q19	1546	986	615	615
q20	389	352	230	230
q21	2647	2114	1836	1836
q22	348	314	270	270
Total cold run time: 103657 ms
Total hot run time: 31662 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5454	5329	5296	5296
q2	258	349	254	254
q3	2161	2695	2286	2286
q4	1345	1723	1320	1320
q5	4293	4179	4314	4179
q6	211	187	173	173
q7	2299	2089	1870	1870
q8	2567	2558	2452	2452
q9	7458	7624	7398	7398
q10	2765	3112	2532	2532
q11	544	484	448	448
q12	657	760	675	675
q13	3929	4525	3601	3601
q14	343	331	287	287
q15	871	804	811	804
q16	671	774	681	681
q17	1191	1337	1319	1319
q18	8058	7957	8031	7957
q19	909	847	866	847
q20	2063	2126	2009	2009
q21	4483	4255	4142	4142
q22	597	550	511	511
Total cold run time: 53127 ms
Total hot run time: 51041 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.31 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f5ce85ae8d81b365a6bd810ea9834b2b7f79251c, data reload: false

query1	0.05	0.04	0.05
query2	0.09	0.04	0.04
query3	0.26	0.08	0.09
query4	1.61	0.11	0.11
query5	0.27	0.25	0.25
query6	1.16	0.68	0.68
query7	0.03	0.02	0.02
query8	0.05	0.03	0.03
query9	0.58	0.51	0.50
query10	0.55	0.55	0.54
query11	0.14	0.09	0.09
query12	0.14	0.10	0.11
query13	0.64	0.62	0.63
query14	1.07	1.08	1.04
query15	0.87	0.86	0.87
query16	0.38	0.42	0.42
query17	1.20	1.17	1.10
query18	0.23	0.21	0.20
query19	2.07	1.95	2.07
query20	0.02	0.01	0.01
query21	15.40	0.30	0.14
query22	5.25	0.05	0.05
query23	16.00	0.30	0.10
query24	1.94	0.26	0.20
query25	0.09	0.08	0.06
query26	0.15	0.15	0.13
query27	0.05	0.05	0.08
query28	3.89	1.16	0.97
query29	12.60	3.95	3.18
query30	0.28	0.14	0.11
query31	2.82	0.63	0.41
query32	3.23	0.60	0.49
query33	3.19	3.31	3.36
query34	16.09	5.38	4.72
query35	4.78	4.83	4.84
query36	0.64	0.51	0.49
query37	0.12	0.07	0.07
query38	0.08	0.04	0.04
query39	0.05	0.03	0.03
query40	0.18	0.16	0.16
query41	0.09	0.03	0.03
query42	0.05	0.03	0.03
query43	0.05	0.03	0.04
Total cold run time: 98.43 s
Total hot run time: 28.31 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants