-
Notifications
You must be signed in to change notification settings - Fork 108
Feature: introduces fuzzing with extension arrays #5819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Signed-off-by: Pratham Agarwal <agarwalpratham1812@gmail.com>
|
@connortsui20 Please let me know your comments, I will try to resolve them at the earliest. |
Codecov Report❌ Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Running the fuzzer on this branch: https://github.com/vortex-data/vortex/actions/runs/20531444854 |
connortsui20
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, this is a definitely good start!
The main concern here is what @robert3005 said in the original issue:
The difficult part is having to know what’s the valid domain of storage type for the extension type. Requires extension types to have validation function for a value
Basically, in addition to testing operations over the different kinds of extension arrays, we also want to validate that the array values are always in the domain of the given type. For example, we want to test that the temporal arrays like vortex.date or vortex.time and also make sure that their values do not overflow when calling .to_jiff().
I was thinking of something along the lines of this (note this is claude-assisted code):
type DomainValidator = Box<dyn Fn(&Scalar) -> bool + Send + Sync>;
fn validator_for_ext_type(ext_dtype: &ExtDType) -> DomainValidator {
if is_temporal_ext_type(ext_dtype.id()) {
let metadata = TemporalMetadata::try_from(ext_dtype).unwrap();
Box::new(move |scalar| {
scalar.is_null()
|| metadata
.to_jiff(scalar.as_extension().storage().as_i64().unwrap())
.is_ok()
})
} else if {
// Logic for handling extension types that are specific to the fuzzer,
// for example maybe an extension type that only accepts even i32 integers?
} else {
// Unknown extension type, so just accept everything?
Box::new(|_| true)
}
}And then the array generation becomes easy:
fn random_extension(
u: &mut Unstructured,
ext_dtype: &Arc<ExtDType>,
chunk_len: Option<usize>,
) -> Result<(ArrayRef, DomainValidator)> {
let validator = validator_for_ext_type(ext_dtype);
let array_length = chunk_len.unwrap_or(u.int_in_range(0..=20)?);
let mut builder = ExtensionBuilder::with_capacity(ext_dtype.clone(), array_length);
for _ in 0..array_length {
// Generate storage values, retry if invalid.
// Just make sure the selectivity isn't too small otherwise this will take forever...
loop {
let storage_scalar = random_scalar(u, ext_dtype.storage_dtype())?;
let ext_scalar = Scalar::extension(ext_dtype.clone(), storage_scalar);
if validator.validate_scalar(&ext_scalar) {
builder.append_scalar(&ext_scalar).unwrap();
break;
}
// else: retry with new random value.
}
}
Ok((builder.finish(), validator))
}You would probably need to also add the DomainValidator type as an optional field to FuzzArrayAction and maybe some other places so the fuzzer can access the validation closure by calling the validation function at the end of run_fuzz_action.
I had Claude tell me a rough "plan of attack" that might be helpful. Obviously it might be missing stuff, but it should be a very good starting point. Please let us know if you have any questions!
### Goal
Add a validation closure that is generated alongside each extension array and called after
every fuzz operation to verify domain invariants are preserved.
### Implementation
1. **Define the validator type** (in `fuzz/src/array/mod.rs` or similar):
- `pub type DomainValidator = Arc<dyn Fn(&dyn Array) -> bool + Send + Sync>;`
2. **Add validator to `FuzzArrayAction`**:
- Add field `pub validator: Option<DomainValidator>` to the struct
3. **Create `validator_for_ext_type(ext_dtype: &ExtDType) -> DomainValidator`**:
- For temporal types (`vortex.date`, `vortex.time`, `vortex.timestamp`): validate that
`TemporalMetadata::to_jiff()` succeeds for all non-null values
- For unknown extension types: return a validator that accepts everything
4. **Modify `random_extension()` in `vortex-array/src/arrays/arbitrary.rs`**:
- Get validator via `validator_for_ext_type()`
- When generating values, retry if `!validator.validate_scalar(&scalar)`
- Return both the array and the validator
5. **Modify `run_fuzz_action()`**:
- After each operation that produces an array, call `validator(&result)` if present
- Panic/error if validation fails
6. **Use real temporal extension types in `random_ext_dtype()`**
(`vortex-dtype/src/arbitrary/mod.rs`):
- Generate `vortex.date`, `vortex.time`, `vortex.timestamp` with valid `(TimeUnit, PType)`
combinations and proper metadata
- Can also keep synthetic types for stress testing, but they need validators too
|
Also, you cannot use "fake" extension types without registering them (see how the fuzzer failed here), so I would try and stick to the existing extension types (temporal types) before creating custom ones for the fuzzer. |
|
Hi, got what you are saying. Should I also add a scope for retries incase the validator is unable to validate the scalar? |
Signed-off-by: Pratham Agarwal <agarwalpratham1812@gmail.com>
|
I have tried to introduce the validator logic, I was not sure which crate should be the right place to introduce it so the code might need some refactoring as well. Please let me know your comments. |
This PR introduces fuzzing in extension arrays solving #1588 . I have tried running the fuzzer locally and have tried to solve the possible edge cases I could identify.