Skip to content

[SPARK-54796][CORE] Fix NPE caused by race condition between Executor initialization and shuffle migration#54136

Open
ivoson wants to merge 3 commits intoapache:masterfrom
ivoson:SPARK-54796
Open

[SPARK-54796][CORE] Fix NPE caused by race condition between Executor initialization and shuffle migration#54136
ivoson wants to merge 3 commits intoapache:masterfrom
ivoson:SPARK-54796

Conversation

@ivoson
Copy link
Contributor

@ivoson ivoson commented Feb 4, 2026

What changes were proposed in this pull request?

Fixing the race condition between executor initialization and shuffle migration. When starting an Executor, spark will:

Shuffle migration request could be received before shuffle manager is initialized, putBlockDataAsStream will be invoked and shuffleManager will be initialized as null.

private lazy val shuffleManager = Option(_shuffleManager).getOrElse(SparkEnv.get.shuffleManager)

Then all the later operations depending on shuffleManager will fail with NPE.

To fix the issue, this PR propose to:

  1. Check whether shuffleManager is initialized when in putBlockDataAsStream which could be called before the Executor is fully ready to handle migration requests.
  2. If not ready, will wait until the specified timeout exceeds.
  3. The BlockManagerDecommissioner will retry if the shuffle migration request hits the issue timeout waiting for shuffleManager to be initialized.

Why are the changes needed?

Fixing the race condition leading to the shuffleManager in BlockManager to be initialized as null.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT added.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 2.4.23

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

JIRA Issue Information

=== Bug SPARK-54796 ===
Summary: NPE caused by race condition between Executor initialization and shuffle migration
Assignee: None
Status: Open
Affected: ["4.1.0"]


This comment was automatically generated by GitHub Actions

@github-actions github-actions bot added the CORE label Feb 4, 2026
@ivoson ivoson changed the title [SPARK-54796] Fix NPE caused by race condition between Executor initialization and shuffle migration [SPARK-54796][CORE] Fix NPE caused by race condition between Executor initialization and shuffle migration Feb 4, 2026
@ivoson ivoson marked this pull request as ready for review February 4, 2026 10:05
@ivoson
Copy link
Contributor Author

ivoson commented Feb 4, 2026

cc @Ngone51 @cloud-fan

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to switch the initialization order between BlockManager initialization and ShuffleManager initialization?

If impossible, I actually think we could add a new flag in BlockManagerInfo (or a pending list for those inital BlockManagers in terms of better memory utilization) to indicate whether the executor is ready for shuffle migration by sending a RPC signal (similar to LaunchedExecutor) after ShuffleManager initialized.

TBH, the current fix looks a bit complex to me.

@ivoson
Copy link
Contributor Author

ivoson commented Feb 5, 2026

Is it possible to switch the initialization order between BlockManager initialization and ShuffleManager initialization?

If impossible, I actually think we could add a new flag in BlockManagerInfo (or a pending list for those inital BlockManagers in terms of better memory utilization) to indicate whether the executor is ready for shuffle migration by sending a RPC signal (similar to LaunchedExecutor) after ShuffleManager initialized.

TBH, the current fix looks a bit complex to me.

Have thought about the other options:
It's hard to reorder the initialization steps since we'll need to register blockManager to get the blockMangerId before executor heartbeat. And moving executor heartbeat after shuffle manager initialization may cause heartbeat timeout...

For adding a new flag in BlockManagerInfo will introduce a new stage and new RPC protocol between driver and executor, try to avoid that since the issue only affect shuffle migration with some race condition which should be pretty rare. cc @Ngone51

// Wait for ShuffleManager to be initialized before handling shuffle migration requests.
// This can happen when an executor receives migration requests before its ShuffleManager
// is fully initialized.
waitForShuffleManagerInit(blockId)
Copy link
Member

@Ngone51 Ngone51 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we maybe waitForShuffleManagerInit before BlockManager register with the driver?

There might be other potential cases or future cases affected by this issue if we fix in current way (for shuffle migration case only). My point is still that we should ensure the BlockManager is really ready before the driver picks it up as an candidate for some stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants