Building emulator training data: cumulative-threshold \(\xi_{AB}\)¶

This page documents the halocat.xi_AB_manager subsystem, which assembles cumulative-mass-threshold \(\xi_{hh}\) measurements across a suite of cosmologies into the input shape an emulator of

\[ \xi_{AB}(r; x_1, x_2 \mid \theta_{\mathrm{cosmo}}), \qquad x = \log_{10} M \]

expects:

\[ \xi_{AB}^{\mathrm{train}}\!\bigl[N_{\mathrm{cosmo}}, N_r, N_t, N_t\bigr], \quad \theta\!\bigl[N_{\mathrm{cosmo}}, n_{\mathrm{params}}\bigr], \quad n_h^{\mathrm{cum}}\!\bigl[N_{\mathrm{cosmo}}, N_t\bigr]. \]

It sits on top of the existing per-realisation halocat infrastructure — halocat.tpcf (pycorr / Corrfunc), halocat.io, halocat.pipeline — and adds three things:

a threshold-shaped on-disk schema (xi_AB_thresh.hdf5);
a threshold-grid pair-counting optimisation — halocat.tpcf._measure_xi_AB_threshold_grid selects positions once per threshold and runs only \(N_t (N_t+1)/2\) pycorr calls;
an emulator-assembly entry point (assemble_emulator_training_set) that stacks per-realisation files into the cube + θ-matrix the emulator consumes.

Two entry points¶

Function	Inputs	Output
`XiABLoader.get`	one `(gravity, redshift, imodel, ibox)`	`XiABRecord`
`assemble_emulator_training_set`	one gravity × one redshift × many `(imodel, ibox)`	`dict` of stacked tensors

XiABLoader is load-or-measure (mirroring HMFLoader / XiHHLoader) — it reads xi_AB_thresh.hdf5 if a cache hit, else measures and writes atomically. assemble_emulator_training_set calls XiABLoader.get in a loop, computes the cosmology parameter matrix via get_cosmology, and returns plain numpy arrays — no halocat record types leak into the emulator training code.

\(\theta\) schema is gravity-conditional¶

from halocat import theta_keys
theta_keys("LCDM")   # ('Omega_m', 'h', 'n_s', 'S_8')
theta_keys("fRn1")   # ('Omega_m', 'h', 'n_s', 'S_8', 'logf_R0')

Two emulators: one for ΛCDM (4-parameter θ), one for f(R) (5-parameter θ). The manager always returns the appropriate schema for the requested gravity. The fiducial sentinels (imodel ∈ {-1, 0}) are hard-coded in halocat.cosmology.FIDUCIAL_THETA; the design rows (imodel ∈ {1..64}) are loaded from DESIGN_FILE_DEFAULT.

Default threshold grid¶

DEFAULT_THRESHOLDS = np.arange(12.55, 15.45 + 1e-9, 0.05)   # 59 thresholds

Quantity	Value
Δ log M	0.05 dex
Range	[12.55, 15.45]
\(N_t\)	59
pycorr calls per realisation	\(N_t(N_t+1)/2 = 1\,770\)

The range is pinned within the HMF measurement support \([\mathrm{LOG10M\_MIN}, \mathrm{LOG10M\_MAX}] = [12.5, 15.5]\). Out-of-range thresholds raise ValueError — the simulation has essentially no haloes outside this band.

You can override the default at construction time:

loader = XiABLoader(thresholds=np.array([12.6, 12.8, 13.0, 13.2]))

Worked example¶

from halocat import assemble_emulator_training_set
import numpy as np

out = assemble_emulator_training_set(
    gravity="LCDM",
    redshift=0.25,
    imodels=list(range(1, 11)),     # 10 design cosmologies
    iboxes=[1, 2, 3, 4, 5],         # average over 5 box realisations
    avg_iboxes=True,
)

# Tensors ready for emulator training:
out["xi_AB"].shape         # (10, 60, 59, 59)
out["theta"].shape         # (10, 4)            -- LCDM = 4 params
out["theta_keys"]          # ('Omega_m', 'h', 'n_s', 'S_8')
out["n_h_cum"].shape       # (10, 59)           -- mean halo count per threshold
out["thresholds"].shape    # (59,)
out["r"].shape             # (60,)

For each (gravity, imodel, ibox) cache miss the manager invokes the threshold-grid pair counter and writes <OUTPUT_ROOT>/gravity_<g>/z_<z:.2f>/imodel_<m>/box_<b>/xi_AB_thresh.hdf5. Subsequent calls with the same spec read directly from disk.

What the cube looks like¶

The figures below are produced by scripts/plot_xi_AB.py from the fully-populated 64-imodel × 5-ibox suites at z = 0.25:

Left panel — fix one threshold pair (log10M_1, log10M_2) = (13.0, 13.0) and overlay \(r^2 \xi_{AB}(r)\) for every cosmology in the design, colored by a θ parameter. The clustering amplitude tracks Ω_m for ΛCDM and shows the f(R) modification on top of the base cosmology for the f(R) suite.
Right panel — fix one cosmology (imodel = 1) and overlay the auto-correlation \(r^2 \xi_{AB}(r; t, t)\) for a sequence of mass thresholds \(\log_{10} M \in \{12.6, 12.8, 13.0, 13.2, 13.4, 13.6\}\). Higher-mass haloes are more strongly clustered, as expected.

ΛCDM:

ξ_AB demo, LCDM

f(R) (note the extra logf_R0 dimension in θ shows up as a separate colour axis):

ξ_AB demo, fRn1

Caching and concurrency¶

Layer	Mechanism	Notes
Per-realisation file	`xi_AB_thresh.hdf5` next to `xi_hh.hdf5`. Cache key = blake2b-128 hex over `(r-spec, thresholds, mass_definition, code_version, source_dat)`.	Source of truth.
Atomic write	Writes to `…tmp.<host>.<pid>` then `os.replace(...)` (POSIX-atomic on Lustre / GPFS). Soft warning on sibling tmp files; no lock file.	Two SLURM jobs writing the same path produce one valid file (last writer wins) — both still do the redundant compute, but the file is never corrupt.
Suite snapshot	Opt-in via `materialise_snapshot=True`. Single HDF5 file at `<OUTPUT_ROOT>/snapshots/xi_AB_suite_<g>_z<z:.2f>_<suite_hash>.hdf5`.	Re-read directly when the suite hash matches.
In-process memo	Module-level dict in `xi_AB_manager`, invalidated on `max(mtime)` of underlying realisation files.	Drop with `clear_in_process_cache`.

On-disk schema¶

xi_AB_thresh.hdf5
  attrs:
    schema_version  = "1"
    cache_key       = <blake2b-128 hex>
    gravity, redshift, imodel, ibox, snapnum, box_size
    mass_definition = "Mtot"
    mass_units      = "Msun/h"
    code_version, source_dat, halo_file, measured_at
    estimator, pycorr_version, corrfunc_version

  /r_edges     float64  (N_r+1,)
  /thresholds  float64  (N_t,)
  /xi_AB       float64  (N_r, N_t, N_t)   -- symmetric in last two axes
  /n_h_cum     int64    (N_t,)            -- halo counts (count, NOT density)
  /n_pairs     int64    (N_t, N_t)        -- diagnostic pair counts

r is not stored — derive on read as \(r := \tfrac{1}{2}(r_{\mathrm{edges}}[:-1] + r_{\mathrm{edges}}[1:])\). Project rule (never pycorr.sepavg).

xi_AB is dense and symmetric (not pair-sparse). At the default \(N_t = 59\), the cube is \(\sim 1.7\) MB per realisation — tiny, no chunking / compression added.

Threshold-grid optimisation¶

The naive route (measure_xi_hh with MASS_BINS = [(t, +inf), …]) re-extracts the threshold sample for every pair, doing \(N_t^2\) selections. The threshold-grid optimisation halocat.tpcf._measure_xi_AB_threshold_grid sorts haloes by mass once and slices via searchsorted, dropping the selection cost to \(O(N_h \log N_h + N_t)\). Output is bit-identical to the naive route (same haloes, same pycorr engine) — verified by tests/test_xi_AB_manager.py::test_threshold_grid_optimisation_matches_naive_bit_identical.

Edge cases¶

Condition	Policy
`thresholds` outside [LOG10M_MIN, LOG10M_MAX]	`ValueError` at `XiABLoader` construction.
`thresholds` not strictly increasing	`ValueError`.
`mass_definition != "Mtot"`	`ValueError` (only Mtot supported initially).
`halo.hdf5` missing	Reformatted on demand from the source `CatshortV.*.DAT` via `pipeline.run_single`.
Cache file present but `cache_key` doesn't match	Treated as miss; re-measure.
Cache file present but `schema_version` doesn't match	`read_xi_AB` raises `ValueError`.
Two writers race	Atomic rename guarantees no corruption.