[SPARK-54796][CORE] Fix NPE caused by race condition between Executor initialization and shuffle migration by ivoson · Pull Request #54136 · apache/spark

ivoson · 2026-02-04T09:22:28Z

What changes were proposed in this pull request?

Fixing the race condition between executor initialization and shuffle migration. When starting an Executor, spark will:

Initialize blockManager, after this the block manager will be detected as peers of other node to receive shuffle blocks migration.
Then initialize shuffleManager.

Shuffle migration request could be received before shuffle manager is initialized, putBlockDataAsStream will be invoked and shuffleManager will be initialized as null.

private lazy val shuffleManager = Option(_shuffleManager).getOrElse(SparkEnv.get.shuffleManager)

Then all the later operations depending on shuffleManager will fail with NPE.

To fix the issue, this PR propose to:

Check whether shuffleManager is initialized when in putBlockDataAsStream which could be called before the Executor is fully ready to handle migration requests.
If not ready, will wait until the specified timeout exceeds.
The BlockManagerDecommissioner will retry if the shuffle migration request hits the issue timeout waiting for shuffleManager to be initialized.

Why are the changes needed?

Fixing the race condition leading to the shuffleManager in BlockManager to be initialized as null.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT added.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 2.4.23

github-actions · 2026-02-04T09:22:38Z

JIRA Issue Information

=== Bug SPARK-54796 ===
Summary: NPE caused by race condition between Executor initialization and shuffle migration
Assignee: None
Status: Open
Affected: ["4.1.0"]

This comment was automatically generated by GitHub Actions

ivoson · 2026-02-04T10:06:06Z

cc @Ngone51 @cloud-fan

Ngone51

Is it possible to switch the initialization order between BlockManager initialization and ShuffleManager initialization?

If impossible, I actually think we could add a new flag in BlockManagerInfo (or a pending list for those inital BlockManagers in terms of better memory utilization) to indicate whether the executor is ready for shuffle migration by sending a RPC signal (similar to LaunchedExecutor) after ShuffleManager initialized.

TBH, the current fix looks a bit complex to me.

ivoson · 2026-02-05T03:57:23Z

Is it possible to switch the initialization order between BlockManager initialization and ShuffleManager initialization?

If impossible, I actually think we could add a new flag in BlockManagerInfo (or a pending list for those inital BlockManagers in terms of better memory utilization) to indicate whether the executor is ready for shuffle migration by sending a RPC signal (similar to LaunchedExecutor) after ShuffleManager initialized.

TBH, the current fix looks a bit complex to me.

Have thought about the other options:
It's hard to reorder the initialization steps since we'll need to register blockManager to get the blockMangerId before executor heartbeat. And moving executor heartbeat after shuffle manager initialization may cause heartbeat timeout...

For adding a new flag in BlockManagerInfo will introduce a new stage and new RPC protocol between driver and executor, try to avoid that since the issue only affect shuffle migration with some race condition which should be pretty rare. cc @Ngone51

Ngone51 · 2026-02-05T07:07:29Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+      // Wait for ShuffleManager to be initialized before handling shuffle migration requests.
+      // This can happen when an executor receives migration requests before its ShuffleManager
+      // is fully initialized.
+      waitForShuffleManagerInit(blockId)


Shall we maybe waitForShuffleManagerInit before BlockManager register with the driver?

There might be other potential cases or future cases affected by this issue if we fix in current way (for shuffle migration case only). My point is still that we should ensure the BlockManager is really ready before the driver picks it up as an candidate for some stuff.

Fixing race condition initializing executor

7d469da

github-actions bot added the CORE label Feb 4, 2026

fix ut

3dfbd12

ivoson force-pushed the SPARK-54796 branch from 8068473 to 3dfbd12 Compare February 4, 2026 09:59

ivoson changed the title ~~[SPARK-54796] Fix NPE caused by race condition between Executor initialization and shuffle migration~~ [SPARK-54796][CORE] Fix NPE caused by race condition between Executor initialization and shuffle migration Feb 4, 2026

ivoson marked this pull request as ready for review February 4, 2026 10:05

Ngone51 reviewed Feb 4, 2026

View reviewed changes

Merge branch 'apache:master' into SPARK-54796

0340490

Ngone51 reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54796][CORE] Fix NPE caused by race condition between Executor initialization and shuffle migration#54136

[SPARK-54796][CORE] Fix NPE caused by race condition between Executor initialization and shuffle migration#54136
ivoson wants to merge 3 commits intoapache:masterfrom
ivoson:SPARK-54796

ivoson commented Feb 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

ivoson commented Feb 4, 2026

Uh oh!

Ngone51 left a comment

Uh oh!

ivoson commented Feb 5, 2026 •

edited

Loading

Uh oh!

Ngone51 Feb 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivoson commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 4, 2026

JIRA Issue Information

Uh oh!

ivoson commented Feb 4, 2026

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

ivoson commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ivoson commented Feb 4, 2026 •

edited

Loading

ivoson commented Feb 5, 2026 •

edited

Loading

Ngone51 Feb 5, 2026 •

edited

Loading