Skip to content

Building emulator training data: cumulative-threshold \(\xi_{AB}\)

This page documents the halocat.xi_AB_manager subsystem, which assembles cumulative-mass-threshold \(\xi_{hh}\) measurements across a suite of cosmologies into the input shape an emulator of

\[ \xi_{AB}(r; x_1, x_2 \mid \theta_{\mathrm{cosmo}}), \qquad x = \log_{10} M \]

expects:

\[ \xi_{AB}^{\mathrm{train}}\!\bigl[N_{\mathrm{cosmo}}, N_r, N_t, N_t\bigr], \quad \theta\!\bigl[N_{\mathrm{cosmo}}, n_{\mathrm{params}}\bigr], \quad n_h^{\mathrm{cum}}\!\bigl[N_{\mathrm{cosmo}}, N_t\bigr]. \]

It sits on top of the existing per-realisation halocat infrastructure — halocat.tpcf (pycorr / Corrfunc), halocat.io, halocat.pipeline — and adds three things:

  1. a threshold-shaped on-disk schema (xi_AB_thresh.hdf5);
  2. a threshold-grid pair-counting optimisationhalocat.tpcf._measure_xi_AB_threshold_grid selects positions once per threshold and runs only \(N_t (N_t+1)/2\) pycorr calls;
  3. an emulator-assembly entry point (assemble_emulator_training_set) that stacks per-realisation files into the cube + θ-matrix the emulator consumes.

Two entry points

Function Inputs Output
XiABLoader.get one (gravity, redshift, imodel, ibox) XiABRecord
assemble_emulator_training_set one gravity × one redshift × many (imodel, ibox) dict of stacked tensors

XiABLoader is load-or-measure (mirroring HMFLoader / XiHHLoader) — it reads xi_AB_thresh.hdf5 if a cache hit, else measures and writes atomically. assemble_emulator_training_set calls XiABLoader.get in a loop, computes the cosmology parameter matrix via get_cosmology, and returns plain numpy arrays — no halocat record types leak into the emulator training code.

\(\theta\) schema is gravity-conditional

from halocat import theta_keys
theta_keys("LCDM")   # ('Omega_m', 'h', 'n_s', 'S_8')
theta_keys("fRn1")   # ('Omega_m', 'h', 'n_s', 'S_8', 'logf_R0')

Two emulators: one for ΛCDM (4-parameter θ), one for f(R) (5-parameter θ). The manager always returns the appropriate schema for the requested gravity. The fiducial sentinels (imodel ∈ {-1, 0}) are hard-coded in halocat.cosmology.FIDUCIAL_THETA; the design rows (imodel ∈ {1..64}) are loaded from DESIGN_FILE_DEFAULT.

Default threshold grid

DEFAULT_THRESHOLDS = np.arange(12.55, 15.45 + 1e-9, 0.05)   # 59 thresholds
Quantity Value
Δ log M 0.05 dex
Range [12.55, 15.45]
\(N_t\) 59
pycorr calls per realisation \(N_t(N_t+1)/2 = 1\,770\)

The range is pinned within the HMF measurement support \([\mathrm{LOG10M\_MIN}, \mathrm{LOG10M\_MAX}] = [12.5, 15.5]\). Out-of-range thresholds raise ValueError — the simulation has essentially no haloes outside this band.

You can override the default at construction time:

loader = XiABLoader(thresholds=np.array([12.6, 12.8, 13.0, 13.2]))

Worked example

from halocat import assemble_emulator_training_set
import numpy as np

out = assemble_emulator_training_set(
    gravity="LCDM",
    redshift=0.25,
    imodels=list(range(1, 11)),     # 10 design cosmologies
    iboxes=[1, 2, 3, 4, 5],         # average over 5 box realisations
    avg_iboxes=True,
)

# Tensors ready for emulator training:
out["xi_AB"].shape         # (10, 60, 59, 59)
out["theta"].shape         # (10, 4)            -- LCDM = 4 params
out["theta_keys"]          # ('Omega_m', 'h', 'n_s', 'S_8')
out["n_h_cum"].shape       # (10, 59)           -- mean halo count per threshold
out["thresholds"].shape    # (59,)
out["r"].shape             # (60,)

For each (gravity, imodel, ibox) cache miss the manager invokes the threshold-grid pair counter and writes <OUTPUT_ROOT>/gravity_<g>/z_<z:.2f>/imodel_<m>/box_<b>/xi_AB_thresh.hdf5. Subsequent calls with the same spec read directly from disk.

What the cube looks like

The figures below are produced by scripts/plot_xi_AB.py from the fully-populated 64-imodel × 5-ibox suites at z = 0.25:

  • Left panel — fix one threshold pair (log10M_1, log10M_2) = (13.0, 13.0) and overlay \(r^2 \xi_{AB}(r)\) for every cosmology in the design, colored by a θ parameter. The clustering amplitude tracks Ω_m for ΛCDM and shows the f(R) modification on top of the base cosmology for the f(R) suite.
  • Right panel — fix one cosmology (imodel = 1) and overlay the auto-correlation \(r^2 \xi_{AB}(r; t, t)\) for a sequence of mass thresholds \(\log_{10} M \in \{12.6, 12.8, 13.0, 13.2, 13.4, 13.6\}\). Higher-mass haloes are more strongly clustered, as expected.

ΛCDM:

ξ_AB demo, LCDM

f(R) (note the extra logf_R0 dimension in θ shows up as a separate colour axis):

ξ_AB demo, fRn1

Caching and concurrency

Layer Mechanism Notes
Per-realisation file xi_AB_thresh.hdf5 next to xi_hh.hdf5. Cache key = blake2b-128 hex over (r-spec, thresholds, mass_definition, code_version, source_dat). Source of truth.
Atomic write Writes to …tmp.<host>.<pid> then os.replace(...) (POSIX-atomic on Lustre / GPFS). Soft warning on sibling tmp files; no lock file. Two SLURM jobs writing the same path produce one valid file (last writer wins) — both still do the redundant compute, but the file is never corrupt.
Suite snapshot Opt-in via materialise_snapshot=True. Single HDF5 file at <OUTPUT_ROOT>/snapshots/xi_AB_suite_<g>_z<z:.2f>_<suite_hash>.hdf5. Re-read directly when the suite hash matches.
In-process memo Module-level dict in xi_AB_manager, invalidated on max(mtime) of underlying realisation files. Drop with clear_in_process_cache.

On-disk schema

xi_AB_thresh.hdf5
  attrs:
    schema_version  = "1"
    cache_key       = <blake2b-128 hex>
    gravity, redshift, imodel, ibox, snapnum, box_size
    mass_definition = "Mtot"
    mass_units      = "Msun/h"
    code_version, source_dat, halo_file, measured_at
    estimator, pycorr_version, corrfunc_version

  /r_edges     float64  (N_r+1,)
  /thresholds  float64  (N_t,)
  /xi_AB       float64  (N_r, N_t, N_t)   -- symmetric in last two axes
  /n_h_cum     int64    (N_t,)            -- halo counts (count, NOT density)
  /n_pairs     int64    (N_t, N_t)        -- diagnostic pair counts

r is not stored — derive on read as \(r := \tfrac{1}{2}(r_{\mathrm{edges}}[:-1] + r_{\mathrm{edges}}[1:])\). Project rule (never pycorr.sepavg).

xi_AB is dense and symmetric (not pair-sparse). At the default \(N_t = 59\), the cube is \(\sim 1.7\) MB per realisation — tiny, no chunking / compression added.

Threshold-grid optimisation

The naive route (measure_xi_hh with MASS_BINS = [(t, +inf), …]) re-extracts the threshold sample for every pair, doing \(N_t^2\) selections. The threshold-grid optimisation halocat.tpcf._measure_xi_AB_threshold_grid sorts haloes by mass once and slices via searchsorted, dropping the selection cost to \(O(N_h \log N_h + N_t)\). Output is bit-identical to the naive route (same haloes, same pycorr engine) — verified by tests/test_xi_AB_manager.py::test_threshold_grid_optimisation_matches_naive_bit_identical.

Edge cases

Condition Policy
thresholds outside [LOG10M_MIN, LOG10M_MAX] ValueError at XiABLoader construction.
thresholds not strictly increasing ValueError.
mass_definition != "Mtot" ValueError (only Mtot supported initially).
halo.hdf5 missing Reformatted on demand from the source CatshortV.*.DAT via pipeline.run_single.
Cache file present but cache_key doesn't match Treated as miss; re-measure.
Cache file present but schema_version doesn't match read_xi_AB raises ValueError.
Two writers race Atomic rename guarantees no corruption.

See also