Building emulator training data: cumulative-threshold \(\xi_{AB}\)¶
This page documents the halocat.xi_AB_manager
subsystem, which assembles cumulative-mass-threshold \(\xi_{hh}\)
measurements across a suite of cosmologies into the input shape an
emulator of
expects:
It sits on top of the existing per-realisation halocat infrastructure —
halocat.tpcf (pycorr / Corrfunc),
halocat.io,
halocat.pipeline — and adds three things:
- a threshold-shaped on-disk schema (
xi_AB_thresh.hdf5); - a threshold-grid pair-counting optimisation —
halocat.tpcf._measure_xi_AB_threshold_gridselects positions once per threshold and runs only \(N_t (N_t+1)/2\) pycorr calls; - an emulator-assembly entry point
(
assemble_emulator_training_set) that stacks per-realisation files into the cube + θ-matrix the emulator consumes.
Two entry points¶
| Function | Inputs | Output |
|---|---|---|
XiABLoader.get |
one (gravity, redshift, imodel, ibox) |
XiABRecord |
assemble_emulator_training_set |
one gravity × one redshift × many (imodel, ibox) |
dict of stacked tensors |
XiABLoader is load-or-measure (mirroring
HMFLoader /
XiHHLoader) — it reads
xi_AB_thresh.hdf5 if a cache hit, else measures and writes
atomically. assemble_emulator_training_set calls XiABLoader.get
in a loop, computes the cosmology parameter matrix via
get_cosmology, and returns
plain numpy arrays — no halocat record types leak into the emulator
training code.
\(\theta\) schema is gravity-conditional¶
from halocat import theta_keys
theta_keys("LCDM") # ('Omega_m', 'h', 'n_s', 'S_8')
theta_keys("fRn1") # ('Omega_m', 'h', 'n_s', 'S_8', 'logf_R0')
Two emulators: one for ΛCDM (4-parameter θ), one for f(R) (5-parameter
θ). The manager always returns the appropriate schema for the
requested gravity. The fiducial sentinels (imodel ∈ {-1, 0}) are
hard-coded in halocat.cosmology.FIDUCIAL_THETA;
the design rows (imodel ∈ {1..64}) are loaded from
DESIGN_FILE_DEFAULT.
Default threshold grid¶
| Quantity | Value |
|---|---|
| Δ log M | 0.05 dex |
| Range | [12.55, 15.45] |
| \(N_t\) | 59 |
| pycorr calls per realisation | \(N_t(N_t+1)/2 = 1\,770\) |
The range is pinned within the HMF measurement support
\([\mathrm{LOG10M\_MIN}, \mathrm{LOG10M\_MAX}] = [12.5, 15.5]\). Out-of-range
thresholds raise ValueError — the simulation has essentially no
haloes outside this band.
You can override the default at construction time:
Worked example¶
from halocat import assemble_emulator_training_set
import numpy as np
out = assemble_emulator_training_set(
gravity="LCDM",
redshift=0.25,
imodels=list(range(1, 11)), # 10 design cosmologies
iboxes=[1, 2, 3, 4, 5], # average over 5 box realisations
avg_iboxes=True,
)
# Tensors ready for emulator training:
out["xi_AB"].shape # (10, 60, 59, 59)
out["theta"].shape # (10, 4) -- LCDM = 4 params
out["theta_keys"] # ('Omega_m', 'h', 'n_s', 'S_8')
out["n_h_cum"].shape # (10, 59) -- mean halo count per threshold
out["thresholds"].shape # (59,)
out["r"].shape # (60,)
For each (gravity, imodel, ibox) cache miss the manager invokes
the threshold-grid pair counter and writes
<OUTPUT_ROOT>/gravity_<g>/z_<z:.2f>/imodel_<m>/box_<b>/xi_AB_thresh.hdf5.
Subsequent calls with the same spec read directly from disk.
What the cube looks like¶
The figures below are produced by scripts/plot_xi_AB.py from the
fully-populated 64-imodel × 5-ibox suites at z = 0.25:
- Left panel — fix one threshold pair
(log10M_1, log10M_2) = (13.0, 13.0)and overlay \(r^2 \xi_{AB}(r)\) for every cosmology in the design, colored by a θ parameter. The clustering amplitude tracks Ω_m for ΛCDM and shows the f(R) modification on top of the base cosmology for the f(R) suite. - Right panel — fix one cosmology (imodel = 1) and overlay the auto-correlation \(r^2 \xi_{AB}(r; t, t)\) for a sequence of mass thresholds \(\log_{10} M \in \{12.6, 12.8, 13.0, 13.2, 13.4, 13.6\}\). Higher-mass haloes are more strongly clustered, as expected.
ΛCDM:

f(R) (note the extra logf_R0 dimension in θ shows up as a separate
colour axis):

Caching and concurrency¶
| Layer | Mechanism | Notes |
|---|---|---|
| Per-realisation file | xi_AB_thresh.hdf5 next to xi_hh.hdf5. Cache key = blake2b-128 hex over (r-spec, thresholds, mass_definition, code_version, source_dat). |
Source of truth. |
| Atomic write | Writes to …tmp.<host>.<pid> then os.replace(...) (POSIX-atomic on Lustre / GPFS). Soft warning on sibling tmp files; no lock file. |
Two SLURM jobs writing the same path produce one valid file (last writer wins) — both still do the redundant compute, but the file is never corrupt. |
| Suite snapshot | Opt-in via materialise_snapshot=True. Single HDF5 file at <OUTPUT_ROOT>/snapshots/xi_AB_suite_<g>_z<z:.2f>_<suite_hash>.hdf5. |
Re-read directly when the suite hash matches. |
| In-process memo | Module-level dict in xi_AB_manager, invalidated on max(mtime) of underlying realisation files. |
Drop with clear_in_process_cache. |
On-disk schema¶
xi_AB_thresh.hdf5
attrs:
schema_version = "1"
cache_key = <blake2b-128 hex>
gravity, redshift, imodel, ibox, snapnum, box_size
mass_definition = "Mtot"
mass_units = "Msun/h"
code_version, source_dat, halo_file, measured_at
estimator, pycorr_version, corrfunc_version
/r_edges float64 (N_r+1,)
/thresholds float64 (N_t,)
/xi_AB float64 (N_r, N_t, N_t) -- symmetric in last two axes
/n_h_cum int64 (N_t,) -- halo counts (count, NOT density)
/n_pairs int64 (N_t, N_t) -- diagnostic pair counts
r is not stored — derive on read as
\(r := \tfrac{1}{2}(r_{\mathrm{edges}}[:-1] + r_{\mathrm{edges}}[1:])\).
Project rule (never pycorr.sepavg).
xi_AB is dense and symmetric (not pair-sparse). At the default
\(N_t = 59\), the cube is \(\sim 1.7\) MB per realisation — tiny, no
chunking / compression added.
Threshold-grid optimisation¶
The naive route (measure_xi_hh with MASS_BINS = [(t, +inf), …])
re-extracts the threshold sample for every pair, doing \(N_t^2\)
selections. The threshold-grid optimisation
halocat.tpcf._measure_xi_AB_threshold_grid sorts haloes by mass
once and slices via searchsorted, dropping the selection cost to
\(O(N_h \log N_h + N_t)\). Output is bit-identical to the naive
route (same haloes, same pycorr engine) — verified by
tests/test_xi_AB_manager.py::test_threshold_grid_optimisation_matches_naive_bit_identical.
Edge cases¶
| Condition | Policy |
|---|---|
thresholds outside [LOG10M_MIN, LOG10M_MAX] |
ValueError at XiABLoader construction. |
thresholds not strictly increasing |
ValueError. |
mass_definition != "Mtot" |
ValueError (only Mtot supported initially). |
halo.hdf5 missing |
Reformatted on demand from the source CatshortV.*.DAT via pipeline.run_single. |
Cache file present but cache_key doesn't match |
Treated as miss; re-measure. |
Cache file present but schema_version doesn't match |
read_xi_AB raises ValueError. |
| Two writers race | Atomic rename guarantees no corruption. |
See also¶
- Threshold → bin rebinning — convert a
cumulative-threshold
XiABRecordinto a differentialXiHHRecordviaxi_AB_io.to_xi_hh_recordand feed the result torebin_xi_hh_record. - Halo–halo 2PCF — the underlying per-pair measurement.