Skip to content

Conversation

@adamantboy
Copy link

Description

This PR fixes a NaN issue in the fp8 padding/unpadding kernel when handling a huge shape input in Fp8Unpadding/Fp8Padding forward. Since the data type of row and row_length is int, if the value of row*row_length is bigger than max value of int, these overlow can result in an illegal memory access on cuda or forward loss/grad norm NaN
eventually.e.g:
RuntimeError: Rank 62, device 6, iteration -1: Unexpected result nan (message='found NaN in local forward loss calculation').
CUDA Error: an illegal memory access was encountered.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Modify multi_padding_kernel/multi_unpadding_kernel, convert the value of row*row_length to the size_t type

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 30, 2025

Greptile Summary

Fixes critical integer overflow bug in FP8 padding/unpadding kernels that caused illegal memory access and NaN values when processing large tensors.

  • Changed row * row_length calculation from int to size_t type in both multi_padding_kernel and multi_unpadding_kernel
  • Introduced row_offset variable to store the result of static_cast<size_t>(row) * row_length
  • Applied fix to all memory access operations: input reads and output writes (including padding writes)
  • Prevents overflow when row * row_length exceeds INT_MAX (~2.1 billion), which occurs with large tensor dimensions

The fix is minimal, precise, and correctly addresses the root cause without changing the algorithm or affecting performance.

Confidence Score: 5/5

  • This PR is safe to merge with no identified risks
  • The fix is a textbook example of proper integer overflow prevention: casting to a larger type before multiplication. All affected locations are updated consistently, the change is minimal and surgical, and it directly addresses the reported issue of illegal memory access and NaN values with large tensors
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/util/padding.cu Fixed integer overflow in row offset calculation by casting to size_t before multiplication, preventing illegal memory access and NaN issues with large tensors

Sequence Diagram

sequenceDiagram
    participant Caller
    participant multi_padding/unpadding
    participant Kernel
    participant GPU_Memory

    Caller->>multi_padding/unpadding: Pass tensor list with large dimensions
    multi_padding/unpadding->>multi_padding/unpadding: Validate tensors
    multi_padding/unpadding->>multi_padding/unpadding: Calculate tiles and blocks
    multi_padding/unpadding->>Kernel: Launch multi_padding_kernel / multi_unpadding_kernel
    
    Note over Kernel: Each thread processes nvec x nvec subtiles
    Kernel->>Kernel: Calculate row index
    Kernel->>Kernel: Cast row to size_t and multiply by row_length
    Note over Kernel: Prevents int overflow for large row * row_length
    
    Kernel->>GPU_Memory: Read input[row_offset + col + j2]
    GPU_Memory-->>Kernel: Return data (no illegal access)
    Kernel->>Kernel: Process data in registers
    Kernel->>GPU_Memory: Write output[row_offset + col + j2]
    
    Kernel-->>multi_padding/unpadding: Kernel complete
    multi_padding/unpadding-->>Caller: Return (no NaN errors)
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 30, 2025

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@BestJuly
Copy link
Collaborator

BestJuly commented Dec 30, 2025

Please sign-off the commit to pass the DCO check. Thanks.

Signed-off-by: fuyue.lj <fuyue.lj@antgroup.com>
@BestJuly
Copy link
Collaborator

/te-ci pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants