This document is non-normative. All contracts are defined in `../specs/SPEC.md`. # INTEROPERABILITY & BENCHMARKING CONTEXT ## PURPOSE This document defines: - external datasets to support - competing / complementary methods - data formats and canonical representation - preprocessing invariants - benchmarking and verification protocols Goal: > Make `pdelie` a **hub** that connects PDE data → symmetry generators → invariants → downstream methods. NOT a silo. --- # CORE DESIGN PRINCIPLE ## Canonical internal representation ALL external data must be converted to a unified format: ```python FieldBatch( values, # array-like dims, # ("batch", "time", spatial..., "var") coords, # coordinate arrays (authoritative) var_names, # ["u", "v", ...] metadata, # structured metadata (see below) preprocess_log # transformations applied ) ``` ### FieldBatch contract (STRICT) - dims MUST be explicit and authoritative - spatial axes MUST be ordered and named (x, y, z) - time is optional (stationary PDEs allowed) - grids MUST be structured rectilinear in v0.x - coordinates MUST specify: - node-centered vs cell-centred - domain bounds - metadata MUST include: - PDE family (if known) - boundary conditions (periodic, Dirichlet Neumann) - grid regularity (uniform/nonuniform) - parameter tags (per-trajectory coefficients) - multivariate fields: - encoded via var axis (channel-last) - missing data: - MUST be represented via masks or NaNs --- ## Canonical pipeline objects ALL stages must produce structured outputs: - FieldBatch - DerivativeBatch - ResidualEvaluator - GeneratorFamily - InvariantMap - InvariantLibrary - DiscoveryResult - VerificationReport These are stable contracts, not implementation details. --- ## Residual abstraction (CORE) All symmetry fitting MUST be defined relative to a residual: ```python class ResidualEvaluator: def evaluate(field: FieldBatch) -> ResidualBatch ``` Supported residual types: - analytic PDE residual - weak-form residual - learned surrogate residual - operator pushforward residual --- ## SUPPORTED DATA FORMATS ### Tier 1 (MUST support) - HDF5 - NumPy (.npz) - in-memory NumPy arrays - xarray Dataset / DataArray --- ### Tier 2 (SHOULD support) - netCDF - Zarr - Mathematica HDF5 exports --- ### Tier 3 (DO NOT prioritize) - custom solver-specific formats - proprietary binary formats --- ## DATASET ADAPTERS All adapters MUST convert external data into the canonical FieldBatch format. ### Required adapters ```python from_hdf5_pdebench(...) from_hdf5_thewell(...) from_numpy(...) from_xarray(...) from_wolfram_hdf5(...) from_sympy_expression(...) ``` ### Export adapters ```python to_xarray(...) to_netcdf(...) to_zarr(...) to_pysindy_library(...) to_neuraloperator_dataset(...) to_json_report(...) ``` ## KEY DATA SOURCES ### 1. PDEBench - structured HDF5 PDE rollouts - canonical benchmark --- ### 2. The Well - large-scale multi-physics dataset - stress testing only in v0.x --- ### 3. Wolfram / Mathematica - GT symmetry validation - exact PDE control --- ### 4. SymPy - symbolic validation only --- ### 5. RealPDEBench (future) - real + simulated PDE data - paired experiments **Use later for:** - robustness validation --- ## SCOPE (x0.x) Stable: - structured-grid PDE data - Lie point symmetries only - polynomial generator parameterisations - small PDE set (heat, Burgers, wave) Experimental: - neural generators - weak-form advanced variants - operator symmetry --- ## IDENTIFIABILITY CONVENTIONS Generators are not unique. Therefore: - generators MUST be normalised (e.g. unit norm) - comparison MUST be via span, not coefficients - closure MUST be evaluated via Lie bracket residual - approximate symmetries MUST be labeled explicitly --- ## PREPROCESSING (CRITICAL INVARIANT) Preprocessing is a transformation and MUST be tracked. ### Allowed before configured symmetry diagnostics - dtype conversion - coordinate harmonization - mild denoising ### Restricted - normalisation - amplitude scaling - aggresive smoothing --- ## REQUIRED: preprocessing log ```text { "transform_type”: "...", “parameters”: {...}, "invertible": true/false } ``` ## Preprocessing Modes ### Mode 1: `physical` - minimal transforms --- ### Mode 2: `analysis` - smoothing/interpolation --- ### Mode 3: `ml_standardized` - normalization/batching --- ## DERIVATIVE PROVENANCE Each DerivativeBatch MUST include: - backend (spectral / finite diff / weak) - smoothing parameters - boundary assumptions - stencil / spectral config --- ## VERIFICATION PROTOCOL (STRICT) Every symmetry claim MUST report: - norm used (L2 / relative / normalized) - ε-range for finite transforms - held-out initial conditions - held-out parameter sets - error vs ε curve - residual error vs baseline Verification must distinguish: - exact symmetry - approximate symmetry - failure --- ## FAILURE MODES - dataset symmetry ≠ PDE symmetry - derivative noise - overexpressive generators - conditioning vs symmetry confusion --- ## LIBRARY POSITIONING pdelie is: A bridge from PDE data → symmetry → invariants → downstream methods. --- ## ROADMAP ### v0.1 - FieldBatch contract - polynomial symmetry detection - spectral derivatives - PDEBench integration ### v0.2 - invariant coordinate pipeline - weak-form derivatives ### v0.3 - NeuralOperator integration - operator symmetry (experimental) ## FINAL INSTRUCTION FOR AGENT When extending the code: 1. Respect canonical contracts 2. Track all transformations 3. Validate all results numerically 4. Use simplest correct implementation 5. Distinguish stable vs experimental code