Skip to content

Conversation

@simvlad
Copy link
Contributor

@simvlad simvlad commented Dec 16, 2025

READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.

This PR adds external payload download stats into RespondWorkflowTaskCompletedRequest and RespondWorkflowTaskFailedRequest and corresponding history events.

We'd like to expose external payload download stats in Temporal UI, and to do so those stats needs to be recorded by SDK in RespondWorkflowTaskCompletedRequest and RespondWorkflowTaskFailedRequest and stored in corresponding history events WorkflowTaskCompletedEventAttributes and WorkflowTaskFailedEventAttributes.

Not breaking change - only new fields

@simvlad simvlad force-pushed the simvlad/external-payload-download-stats branch from eaf5fb7 to 319156e Compare December 16, 2025 23:45
@simvlad simvlad force-pushed the simvlad/external-payload-download-stats branch from 319156e to 14decb6 Compare December 16, 2025 23:47
@simvlad simvlad changed the title External payload download stats for Workflow Tasks Add external payload download stats to workflow completion tasks and corresponding history events Dec 17, 2025
@simvlad simvlad marked this pull request as ready for review December 17, 2025 19:39
@simvlad simvlad requested review from a team as code owners December 17, 2025 19:39
// Worker deployment options that user has set in the worker.
temporal.api.deployment.v1.WorkerDeploymentOptions deployment_options = 17;
// Stats about external payloads downloaded by SDK.
temporal.api.sdk.v1.ExternalPayloadDownloadStats external_payload_stats = 18;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only download and not upload? It could be argued upload is actually the bigger problem. If you have a task failure after having scheduled an activity, you're going to be repeatedly uploading (I don't remember what we discussed we'd do to mitigate this).

// Worker deployment options that user has set in the worker.
temporal.api.deployment.v1.WorkerDeploymentOptions deployment_options = 17;
// Stats about external payloads downloaded by SDK.
temporal.api.sdk.v1.ExternalPayloadDownloadStats external_payload_stats = 18;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only workflow tasks and not activity or Nexus tasks? And what about RespondQueryTaskCompleted? And what about all of the data uploaded/downloaded by clients?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention here is to display some warning when the workflow task is close to the timeout, since with the implicit object references it could do many downloads especially in the replay scenarios. I see your point about uploads. I'll check if we are planning to pass/display such information.

Copy link
Member

@cretz cretz Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we not display that warning where it actually occurs which is where the user code is running? What happens when server rejects task completion? The server is not the place downloads occur, it should probably not be the place warnings occur (any more than any other warnings that occur SDK side). If we need a concept of SDK warnings being shipped to server somehow, I think that can be discussed more generally and not specific to this.


// True if any portion of this workflow task processed history in replay mode
// (i.e., SDK had to rebuild state) and downloaded payloads during that period.
bool replay_involved = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to represent count, bytes, and time for replay vs non-replay. When processing a workflow task, only part of it may be replay. Other parts are new events, it's not an all or none.

I'd suggest something like:

message ExternalPayloadTransferStats {
  TransferStats uploaded = 1;
  TransferStats downloaded = 2;
  TransferStats downloaded_during_replay = 3;

  message TransferStats {
    int64 total_count = 1;
    int64 total_size_bytes = 2;
    google.protobuf.Timestamp total_time_spent = 3;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I synced up with @Quinn-With-Two-Ns and @jmaeagle99 and we would be able to track the information that you mentioned. I'll make changes to make the structure look like what you suggested.

Copy link
Member

@cretz cretz Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, I think there's a discussion to be had on whether this information should be in worker heartbeating and not part of task completion. Task completion misses a lot of important data. We're going to have to tell users they can't trust what the server says about which data is downloaded/uploaded because it is not the canonical source of when downloads/uploads actually occur. And there are quirks (e.g. what happens in an unhandled command or workflow-already-terminated situation where task completion is rejected?).

Overall, we should see if we can provide users an accurate view of transfer information alongside where we provide accurate views of other SDK-side metrics. And if we need to ship those SDK side metrics up to the server, we should consider if we can do the same for all of them and not make these metrics special. Granted if this is just in addition to the proper metrics, we can do it, it just seems to be a bit hacked on (workflows only, not accounting for many many situations that exist, known to be inaccurate, etc). This payload metric concern similar to temporal_workflow_task_replay_latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants