Fix overflow of padding/unpadding kernel #2548

adamantboy · 2025-12-30T03:34:40Z

Description

This PR fixes a NaN issue in the fp8 padding/unpadding kernel when handling a huge shape input in Fp8Unpadding/Fp8Padding forward. Since the data type of row and row_length is int, if the value of row*row_length is bigger than max value of int, these overlow can result in an illegal memory access on cuda or forward loss/grad norm NaN
eventually.e.g:
RuntimeError: Rank 62, device 6, iteration -1: Unexpected result nan (message='found NaN in local forward loss calculation').
CUDA Error: an illegal memory access was encountered.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Modify multi_padding_kernel/multi_unpadding_kernel, convert the value of row*row_length to the size_t type

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2025-12-30T03:36:35Z

Greptile Summary

Fixes critical integer overflow bug in FP8 padding/unpadding kernels that caused illegal memory access and NaN values when processing large tensors.

Changed row * row_length calculation from int to size_t type in both multi_padding_kernel and multi_unpadding_kernel
Introduced row_offset variable to store the result of static_cast<size_t>(row) * row_length
Applied fix to all memory access operations: input reads and output writes (including padding writes)
Prevents overflow when row * row_length exceeds INT_MAX (~2.1 billion), which occurs with large tensor dimensions

The fix is minimal, precise, and correctly addresses the root cause without changing the algorithm or affecting performance.

Confidence Score: 5/5

This PR is safe to merge with no identified risks
The fix is a textbook example of proper integer overflow prevention: casting to a larger type before multiplication. All affected locations are updated consistently, the change is minimal and surgical, and it directly addresses the reported issue of illegal memory access and NaN values with large tensors
No files require special attention

Important Files Changed

Filename	Overview
transformer_engine/common/util/padding.cu	Fixed integer overflow in row offset calculation by casting to size_t before multiplication, preventing illegal memory access and NaN issues with large tensors

Sequence Diagram

sequenceDiagram
    participant Caller
    participant multi_padding/unpadding
    participant Kernel
    participant GPU_Memory

    Caller->>multi_padding/unpadding: Pass tensor list with large dimensions
    multi_padding/unpadding->>multi_padding/unpadding: Validate tensors
    multi_padding/unpadding->>multi_padding/unpadding: Calculate tiles and blocks
    multi_padding/unpadding->>Kernel: Launch multi_padding_kernel / multi_unpadding_kernel
    
    Note over Kernel: Each thread processes nvec x nvec subtiles
    Kernel->>Kernel: Calculate row index
    Kernel->>Kernel: Cast row to size_t and multiply by row_length
    Note over Kernel: Prevents int overflow for large row * row_length
    
    Kernel->>GPU_Memory: Read input[row_offset + col + j2]
    GPU_Memory-->>Kernel: Return data (no illegal access)
    Kernel->>Kernel: Process data in registers
    Kernel->>GPU_Memory: Write output[row_offset + col + j2]
    
    Kernel-->>multi_padding/unpadding: Kernel complete
    multi_padding/unpadding-->>Caller: Return (no NaN errors)

greptile-apps · 2025-12-30T03:36:36Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

BestJuly · 2025-12-30T08:00:00Z

Please sign-off the commit to pass the DCO check. Thanks.

Signed-off-by: fuyue.lj <fuyue.lj@antgroup.com>

BestJuly · 2025-12-30T08:14:54Z

/te-ci pytorch

Fix overflow of padding/unpadding kernel

6be7eb6

Signed-off-by: fuyue.lj <fuyue.lj@antgroup.com>

adamantboy force-pushed the fix_pad_unpad branch from f6c4bd1 to 6be7eb6 Compare December 30, 2025 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix overflow of padding/unpadding kernel #2548

Fix overflow of padding/unpadding kernel #2548

Uh oh!

adamantboy commented Dec 30, 2025

Uh oh!

greptile-apps bot commented Dec 30, 2025 •

edited

Loading

Uh oh!

greptile-apps bot commented Dec 30, 2025

Uh oh!

BestJuly commented Dec 30, 2025 •

edited

Loading

Uh oh!

BestJuly commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix overflow of padding/unpadding kernel #2548

Are you sure you want to change the base?

Fix overflow of padding/unpadding kernel #2548

Uh oh!

Conversation

adamantboy commented Dec 30, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Dec 30, 2025

Greptile found no issues!

Uh oh!

BestJuly commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BestJuly commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Dec 30, 2025 •

edited

Loading

BestJuly commented Dec 30, 2025 •

edited

Loading