Conversation
|
run buildall |
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR focuses on optimizing VARIANT “doc mode” ingestion/compaction paths by reducing overhead in doc-value serialization/materialization and by switching several hot-path maps/sets to phmap for better performance, with accompanying new/expanded tests.
Changes:
- Optimize doc-value column handling (avoid sorting, faster serialization path,
phmapusage in several structures). - Adjust variant compaction validation logic to tolerate doc/sparse extracted columns in output schemas.
- Add/extend unit tests, including a new doc-mode compaction test and additional doc-value serialization coverage.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| be/test/olap/variant_doc_mode_compaction_test.cpp | Adds a new end-to-end compaction test for VARIANT doc mode (currently heavy and non-hermetic). |
| be/test/olap/rowset/segment_v2/variant_util_test.cpp | Adds coverage for OnlyDocValueColumn serialization with mixed types. |
| be/src/vec/common/variant_util.h | Changes materialize_docs_to_subcolumns_map return type to phmap::flat_hash_map. |
| be/src/vec/common/variant_util.cpp | Implements faster doc-value serialization and updates doc materialization map implementation. |
| be/src/vec/columns/subcolumn_tree.h | Switches node children map to phmap::flat_hash_map. |
| be/src/vec/columns/column_variant.h / be/src/vec/columns/column_variant.cpp | Removes sort_doc_value_column support. |
| be/src/olap/rowset/segment_v2/variant/variant_column_writer_impl.cpp | Reworks sparse/doc bucket writing/statistics with phmap; introduces offset-iteration changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (int64_t i = 0; i < num_rows; ++i) { | ||
| size_t start = column_offsets[i - 1]; | ||
| size_t end = column_offsets[i]; | ||
| const size_t start = column_offsets[i - 1]; |
There was a problem hiding this comment.
In the for (i = 0; i < num_rows; ++i) loop, column_offsets[i - 1] is used when i == 0, which underflows and can read invalid memory. Initialize start to 0 for the first row (e.g., start = (i == 0) ? 0 : column_offsets[i - 1]).
| const size_t start = column_offsets[i - 1]; | |
| const size_t start = (i == 0) ? 0 : column_offsets[i - 1]; |
| config::enable_ordered_data_compaction = false; | ||
| config::segments_key_bounds_truncation_threshold = -1; | ||
| config::enable_vertical_compact_variant_subcolumns = true; | ||
| config::enable_compaction_checksum = false; | ||
| config::enable_stacktrace = true; | ||
| } |
There was a problem hiding this comment.
This fixture mutates multiple global config:: flags. Since doris_be_test runs all unit tests in one binary, these values will leak into other tests unless restored. Consider saving previous values in SetUp() and restoring them in TearDown().
| void TearDown() override { | ||
| // ASSERT_TRUE(io::global_local_filesystem()->delete_directory(absolute_dir).ok()); | ||
| engine_ref = nullptr; | ||
| ExecEnv::GetInstance()->set_storage_engine(nullptr); | ||
| } |
There was a problem hiding this comment.
TearDown() does not clean absolute_dir, cache_dir, or tmp_dir (and SetUp() reuses existing dirs). This makes the test non-hermetic and can leave large artifacts under ut_dir, affecting subsequent local/CI runs. Prefer deleting these directories in TearDown() and/or recreating them from a clean state in SetUp().
| static constexpr int32_t kRowsPerSegment = 100000; | ||
|
|
There was a problem hiding this comment.
The test builds 10 segments of 100,000 rows each (1,000,000 rows total) and generates fairly large random JSON per row. This is likely to be too slow/heavy for a unit test binary. Consider reducing kRowsPerSegment and the materialization threshold proportionally, or gating this as a performance/benchmark-style test (e.g., DISABLED_ or env-guarded).
| static_cast<void>(fs->exists(cache_seg_path, &cache_exists)); | ||
| if (!cache_exists) { | ||
| static_cast<void>(fs->link_file(seg_path, cache_seg_path)); |
There was a problem hiding this comment.
Filesystem calls here ignore returned Status (both exists and link_file). If either fails, the test will silently proceed with an unexpected state and fail later in a harder-to-debug way. Please check the returned status and assert/return on failure.
| static_cast<void>(fs->exists(cache_seg_path, &cache_exists)); | |
| if (!cache_exists) { | |
| static_cast<void>(fs->link_file(seg_path, cache_seg_path)); | |
| Status cache_exist_st = fs->exists(cache_seg_path, &cache_exists); | |
| ASSERT_TRUE(cache_exist_st.ok()) << cache_exist_st.to_json(); | |
| if (!cache_exists) { | |
| Status link_st = fs->link_file(seg_path, cache_seg_path); | |
| ASSERT_TRUE(link_st.ok()) << link_st.to_json(); |
| bucket_offsets[b] = &m.get_offsets(); | ||
| } | ||
| for (size_t row = 0; row < num_rows; ++row) { | ||
| const size_t start = offsets[row - 1]; |
There was a problem hiding this comment.
offsets[row - 1] is accessed when row == 0, which underflows and results in an out-of-bounds read. Use the same pattern as VariantDocWriter::append_data below (start = 0 for the first row).
| const size_t start = offsets[row - 1]; | |
| const size_t start = (row == 0 ? 0 : offsets[row - 1]); |
03ef62f to
065fe1f
Compare
|
run buildall |
065fe1f to
f5ce85a
Compare
|
run buildall |
TPC-H: Total hot run time: 31662 ms |
ClickBench: Total hot run time: 28.31 s |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)