Skip to content

Conversation

@Pratham1812
Copy link
Contributor

@Pratham1812 Pratham1812 commented Dec 26, 2025

This PR introduces fuzzing in extension arrays solving #1588 . I have tried running the fuzzer locally and have tried to solve the possible edge cases I could identify.

Signed-off-by: Pratham Agarwal <agarwalpratham1812@gmail.com>
@Pratham1812 Pratham1812 changed the title feat: introduces fuzzing with extension arrays #1588 feat: introduces fuzzing with extension arrays Dec 26, 2025
@Pratham1812
Copy link
Contributor Author

@connortsui20 Please let me know your comments, I will try to resolve them at the earliest.
Thanks.

@connortsui20 connortsui20 self-requested a review December 26, 2025 23:01
@codecov
Copy link

codecov bot commented Dec 26, 2025

Codecov Report

❌ Patch coverage is 0% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.93%. Comparing base (c7b0ef9) to head (0e7ab69).

Files with missing lines Patch % Lines
vortex-array/src/compute/fill_null.rs 0.00% 25 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@connortsui20 connortsui20 added feature Release label indicating a new feature or request fuzzer labels Dec 26, 2025
@connortsui20
Copy link
Contributor

Running the fuzzer on this branch: https://github.com/vortex-data/vortex/actions/runs/20531444854

Copy link
Contributor

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, this is a definitely good start!

The main concern here is what @robert3005 said in the original issue:

The difficult part is having to know what’s the valid domain of storage type for the extension type. Requires extension types to have validation function for a value

Basically, in addition to testing operations over the different kinds of extension arrays, we also want to validate that the array values are always in the domain of the given type. For example, we want to test that the temporal arrays like vortex.date or vortex.time and also make sure that their values do not overflow when calling .to_jiff().

I was thinking of something along the lines of this (note this is claude-assisted code):

type DomainValidator = Box<dyn Fn(&Scalar) -> bool + Send + Sync>;

fn validator_for_ext_type(ext_dtype: &ExtDType) -> DomainValidator {
    if is_temporal_ext_type(ext_dtype.id()) {
        let metadata = TemporalMetadata::try_from(ext_dtype).unwrap();
        Box::new(move |scalar| {
            scalar.is_null()
                || metadata
                    .to_jiff(scalar.as_extension().storage().as_i64().unwrap())
                    .is_ok()
        })
    } else if {
        // Logic for handling extension types that are specific to the fuzzer,
        // for example maybe an extension type that only accepts even i32 integers?
    } else {
        // Unknown extension type, so just accept everything?
        Box::new(|_| true)
    }
}

And then the array generation becomes easy:

fn random_extension(
    u: &mut Unstructured,
    ext_dtype: &Arc<ExtDType>,
    chunk_len: Option<usize>,
) -> Result<(ArrayRef, DomainValidator)> {
    let validator = validator_for_ext_type(ext_dtype);
    let array_length = chunk_len.unwrap_or(u.int_in_range(0..=20)?);

    let mut builder = ExtensionBuilder::with_capacity(ext_dtype.clone(), array_length);

    for _ in 0..array_length {
        // Generate storage values, retry if invalid.
        // Just make sure the selectivity isn't too small otherwise this will take forever...
        loop {
            let storage_scalar = random_scalar(u, ext_dtype.storage_dtype())?;
            let ext_scalar = Scalar::extension(ext_dtype.clone(), storage_scalar);

            if validator.validate_scalar(&ext_scalar) {
                builder.append_scalar(&ext_scalar).unwrap();
                break;
            }
            // else: retry with new random value.
        }
    }

    Ok((builder.finish(), validator))
}

You would probably need to also add the DomainValidator type as an optional field to FuzzArrayAction and maybe some other places so the fuzzer can access the validation closure by calling the validation function at the end of run_fuzz_action.


I had Claude tell me a rough "plan of attack" that might be helpful. Obviously it might be missing stuff, but it should be a very good starting point. Please let us know if you have any questions!

  ### Goal

  Add a validation closure that is generated alongside each extension array and called after
  every fuzz operation to verify domain invariants are preserved.

  ### Implementation

  1. **Define the validator type** (in `fuzz/src/array/mod.rs` or similar):
     - `pub type DomainValidator = Arc<dyn Fn(&dyn Array) -> bool + Send + Sync>;`

  2. **Add validator to `FuzzArrayAction`**:
     - Add field `pub validator: Option<DomainValidator>` to the struct

  3. **Create `validator_for_ext_type(ext_dtype: &ExtDType) -> DomainValidator`**:
     - For temporal types (`vortex.date`, `vortex.time`, `vortex.timestamp`): validate that
  `TemporalMetadata::to_jiff()` succeeds for all non-null values
     - For unknown extension types: return a validator that accepts everything

  4. **Modify `random_extension()` in `vortex-array/src/arrays/arbitrary.rs`**:
     - Get validator via `validator_for_ext_type()`
     - When generating values, retry if `!validator.validate_scalar(&scalar)`
     - Return both the array and the validator

  5. **Modify `run_fuzz_action()`**:
     - After each operation that produces an array, call `validator(&result)` if present
     - Panic/error if validation fails

  6. **Use real temporal extension types in `random_ext_dtype()`**
  (`vortex-dtype/src/arbitrary/mod.rs`):
     - Generate `vortex.date`, `vortex.time`, `vortex.timestamp` with valid `(TimeUnit, PType)`
  combinations and proper metadata
     - Can also keep synthetic types for stress testing, but they need validators too

@connortsui20
Copy link
Contributor

Also, you cannot use "fake" extension types without registering them (see how the fuzzer failed here), so I would try and stick to the existing extension types (temporal types) before creating custom ones for the fuzzer.

thread '<unnamed>' (10539) panicked at vortex-error/src/lib.rs:334:33:
compare operation should succeed in fuzz test:
  Unsupported extension type "example.ipv4"
Backtrace:
   0: std::backtrace_rs::backtrace::libunwind::trace
             at /rustc/fabece9e9491d0a3c3655dba488881968b7c5f7a/library/std/src/../../backtrace/src/backtrace/libunwind.rs:117:9
   1: std::backtrace_rs::backtrace::trace_unsynchronized::<<std::backtrace::Backtrace>::create::{closure#0}>
             at /rustc/fabece9e9491d0a3c3655dba488881968b7c5f7a/library/std/src/../../backtrace/src/backtrace/mod.rs:66:14
   2: <std::backtrace::Backtrace>::create
             at /rustc/fabece9e9491d0a3c3655dba488881968b7c5f7a/library/std/src/backtrace.rs:331:13
   3: to_arrow_dtype
             at ./vortex-dtype/src/arrow.rs:262:21
   4: to_arrow_dtype
             at ./vortex-dtype/src/arrow.rs:234:28
   5: {closure#0}
             at ./vortex-array/src/arrow/compute/to_arrow/canonical.rs:109:46
   6: unwrap_or_else<core::result::Result<arrow_schema::datatype::DataType, vortex_error::VortexError>, vortex_array::arrow::compute::to_arrow::canonical::{impl#0}::invoke::{closure_env#0}>
             at /rustc/fabece9e9491d0a3c3655dba488881968b7c5f7a/library/core/src/option.rs:1067:21
   7: invoke
             at ./vortex-array/src/arrow/compute/to_arrow/canonical.rs:109:14
   8: invoke
             at ./vortex-array/src/arrow/compute/to_arrow/mod.rs:122:42
   9: invoke
             at ./vortex-array/src/compute/mod.rs:149:34
  10: to_arrow_opts
             at ./vortex-array/src/arrow/compute/to_arrow/mod.rs:77:10
  11: into_arrow_preferred
             at ./vortex-array/src/arrow/mod.rs:41:9
  12: arrow_compare
             at ./vortex-array/src/compute/compare.rs:333:36
  13: invoke
             at ./vortex-array/src/compute/compare.rs:219:12
  14: invoke
             at ./vortex-array/src/compute/mod.rs:149:34
  15: compare
             at ./vortex-array/src/compute/compare.rs:59:10

@Pratham1812
Copy link
Contributor Author

Hi, got what you are saying. Should I also add a scope for retries incase the validator is unable to validate the scalar?

Pratham1812 and others added 2 commits December 27, 2025 22:44
@Pratham1812 Pratham1812 changed the title feat: introduces fuzzing with extension arrays Feature: introduces fuzzing with extension arrays Dec 27, 2025
@Pratham1812
Copy link
Contributor Author

I have tried to introduce the validator logic, I was not sure which crate should be the right place to introduce it so the code might need some refactoring as well. Please let me know your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Release label indicating a new feature or request fuzzer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants