Threshold → bin ξ_hh: closed-form rebinning¶

halocat.rebin_xi_hh_record converts a halo–halo 2PCF measured in cumulative mass-threshold samples (each containing all haloes with $\log_{10} M \ge \log_{10} M_{\rm thr}$) into a differential bin-averaged ξ_hh^bin, without re-running pycorr.

Why bother? Measuring ξ_hh directly in a narrow bin uses only the haloes inside that bin. Threshold samples contain many more pairs, and the differential signal is recovered analytically from differences between thresholds. The result has lower variance — typically a ~3–4× equivalent N_box gain at narrow / high-mass bins.

The identity. Integrating

$$ \bar n(x)\,\bar n(y)\,[1 + \xi_{hh}(r\,|\,x,y)] \;=\; \frac{\partial^2 F}{\partial x\,\partial y}, \qquad F(x,y;r) := n_h(\ge x)\,n_h(\ge y)\,\bigl[1 + \xi_{AB}(r;x,y)\bigr], $$

over the bin $[x_c \pm \delta/2,\; y_c \pm \delta/2]$ gives a closed-form mixed second difference:

$$ \Delta n(x_c)\,\Delta n(y_c)\,[1 + \xi_{hh}^{\rm bin}(r;x_c,y_c)] \;=\; F(x_-,y_-) - F(x_-,y_+) - F(x_+,y_-) + F(x_+,y_+), $$

where $x_\pm = x_c \pm \delta/2$ and $\Delta n(x_c) = n_h(\ge x_-) - n_h(\ge x_+)$. No derivative is approximated. Numerical error enters only through the interpolation of $F$ to the four bin edges.

This walkthrough loads a cached 100-box measurement on the fiducial-LCDM sentinel grid (gravity="LCDM", imodel=0, z=0.25, ibox=1..100) and demonstrates:

building a threshold-style XiHHRecord,
running rebin_xi_hh_record on a single box,
comparing against the direct differential measurement (one box),
stacking across all 100 boxes to quantify the S/N gain,
linear vs cubic interpolation.

In [1]:

Copied!





import logging
import os

import numpy as np
import matplotlib.pyplot as plt

from halocat import rebin_xi_hh_record
from halocat.dataloader import XiHHRecord

# The rebin module emits an INFO log for every input record that has NaN
# entries (small-r empty pair bins). For a per-box loop across 100 boxes
# this is just noise — silence to WARNING so genuine issues still surface.
logging.getLogger("halocat.xi_hh_rebin").setLevel(logging.WARNING)

# Cache produced by `scripts/test_xi_hh_threshold_to_bin.py --n-boxes 100`.
# See `xi_hh_threshold_to_bin_results.md` (Run 5) for provenance.
CACHE = os.path.join(os.path.dirname(os.getcwd()),
                     "xi_thr2bin_100box_offgrid_cache.npz") \
        if os.path.basename(os.getcwd()) == "notebooks" \
        else "xi_thr2bin_100box_offgrid_cache.npz"

z = np.load(CACHE, allow_pickle=False)
print("cache contents:")
for k in z.files:
    a = z[k]
    print(f"  {k:24s}  shape={a.shape}  dtype={a.dtype}")
print()
print("grid:")
print(f"  gravity   = {str(z['gravity'])}")
print(f"  imodel    = {int(z['imodel'])}  (fiducial sentinel)")
print(f"  redshift  = {float(z['redshift'])}")
print(f"  N_boxes   = {z['iboxes'].size}")
print(f"  thresholds = {z['thresholds']}")
print(f"  logM_out   = {z['logM_out']}  (differential bin centres)")
print(f"  dlogM      = {float(z['dlogM'])}  dex (bin full width)")
import logging
import os

import numpy as np
import matplotlib.pyplot as plt

from halocat import rebin_xi_hh_record
from halocat.dataloader import XiHHRecord

# The rebin module emits an INFO log for every input record that has NaN
# entries (small-r empty pair bins). For a per-box loop across 100 boxes
# this is just noise — silence to WARNING so genuine issues still surface.
logging.getLogger("halocat.xi_hh_rebin").setLevel(logging.WARNING)

# Cache produced by `scripts/test_xi_hh_threshold_to_bin.py --n-boxes 100`.
# See `xi_hh_threshold_to_bin_results.md` (Run 5) for provenance.
CACHE = os.path.join(os.path.dirname(os.getcwd()),
                     "xi_thr2bin_100box_offgrid_cache.npz") \
        if os.path.basename(os.getcwd()) == "notebooks" \
        else "xi_thr2bin_100box_offgrid_cache.npz"

z = np.load(CACHE, allow_pickle=False)
print("cache contents:")
for k in z.files:
    a = z[k]
    print(f"  {k:24s}  shape={a.shape}  dtype={a.dtype}")
print()
print("grid:")
print(f"  gravity   = {str(z['gravity'])}")
print(f"  imodel    = {int(z['imodel'])}  (fiducial sentinel)")
print(f"  redshift  = {float(z['redshift'])}")
print(f"  N_boxes   = {z['iboxes'].size}")
print(f"  thresholds = {z['thresholds']}")
print(f"  logM_out   = {z['logM_out']}  (differential bin centres)")
print(f"  dlogM      = {float(z['dlogM'])}  dex (bin full width)")

cache contents:
  xi_thr_per_box            shape=(100, 10, 60)  dtype=float64
  n1_thr_per_box            shape=(100, 10)  dtype=int64
  pair_indices_thr          shape=(10, 2)  dtype=int64
  xi_dir_per_box            shape=(100, 6, 60)  dtype=float64
  n1_dir_per_box            shape=(100, 6)  dtype=int64
  pair_indices_dir          shape=(6, 2)  dtype=int64
  thresholds                shape=(4,)  dtype=float64
  direct_bins               shape=(3, 2)  dtype=float64
  gravity                   shape=()  dtype=<U4
  imodel                    shape=()  dtype=int64
  redshift                  shape=()  dtype=float64
  iboxes                    shape=(100,)  dtype=int64
  r_edges                   shape=(61,)  dtype=float64
  logM_out                  shape=(3,)  dtype=float64
  dlogM                     shape=()  dtype=float64

grid:
  gravity   = LCDM
  imodel    = 0  (fiducial sentinel)
  redshift  = 0.25
  N_boxes   = 100
  thresholds = [12.6 12.8 13.  13.2]
  logM_out   = [12.7 12.9 13.1]  (differential bin centres)
  dlogM      = 0.1  dex (bin full width)

1. Pack one box into a threshold-style `XiHHRecord`¶

rebin_xi_hh_record expects an XiHHRecord whose mass_bins[:, 0] are interpreted as thresholds (and whose upper edges should be +inf). Each auto-pair M{i}_M{i} carries n1 = n_h(>= thresholds[i]) * V, which the rebin uses to recover $n_h(\ge)$ at the bin edges via log-linear interpolation.

The cache stores the upper triangle of pairs. We rebuild the full XiHHRecord schema below.

In [2]:

Copied!





BOX_SIZE = 1024.0  # Mpc/h, halocat config default

def threshold_record(b: int) -> XiHHRecord:
    # Pack the b-th cached box into a threshold-style XiHHRecord.
    xi = z["xi_thr_per_box"][b]
    n1 = z["n1_thr_per_box"][b]
    pair_indices = z["pair_indices_thr"]
    thresholds = z["thresholds"]
    r_edges = z["r_edges"]

    mass_bins = np.column_stack([thresholds,
                                  np.full_like(thresholds, np.inf)])
    pairs = [f"M{int(i)}_M{int(j)}" for (i, j) in pair_indices]

    # n2 of pair (i, j) = n1 of auto pair (j, j).
    n1_auto = {int(i): int(n1[p])
               for p, (i, j) in enumerate(pair_indices) if i == j}
    n2 = np.array([n1_auto[int(j)] for (_, j) in pair_indices], dtype=int)

    log10M1 = np.array([mass_bins[int(i)] for (i, _) in pair_indices])
    log10M2 = np.array([mass_bins[int(j)] for (_, j) in pair_indices])

    r_centres = 0.5 * (r_edges[:-1] + r_edges[1:])
    P = xi.shape[0]
    r = np.broadcast_to(r_centres[None, :], (P, r_centres.size)).copy()

    return XiHHRecord(
        r_edges=r_edges, mass_bins=mass_bins, pairs=pairs,
        pair_indices=np.asarray(pair_indices, dtype=int),
        r=r, xi=xi.astype(np.float64),
        n1=n1.astype(int), n2=n2,
        log10M1=log10M1, log10M2=log10M2,
        attrs={"box_size": BOX_SIZE,
               "gravity": str(z["gravity"]),
               "redshift": float(z["redshift"]),
               "imodel": int(z["imodel"]),
               "ibox": int(z["iboxes"][b])},
    )

rec_thr = threshold_record(0)
print("threshold record (ibox =", rec_thr.attrs["ibox"], "):")
print("  mass_bins =\n", rec_thr.mass_bins)
print("  pairs     =", rec_thr.pairs)
print("  xi.shape  =", rec_thr.xi.shape, "  (P_pairs, N_r)")
print("  auto n1   =",
      [int(rec_thr.n1[p]) for p, (i, j) in enumerate(rec_thr.pair_indices)
       if i == j])
BOX_SIZE = 1024.0  # Mpc/h, halocat config default

def threshold_record(b: int) -> XiHHRecord:
    # Pack the b-th cached box into a threshold-style XiHHRecord.
    xi = z["xi_thr_per_box"][b]
    n1 = z["n1_thr_per_box"][b]
    pair_indices = z["pair_indices_thr"]
    thresholds = z["thresholds"]
    r_edges = z["r_edges"]

    mass_bins = np.column_stack([thresholds,
                                  np.full_like(thresholds, np.inf)])
    pairs = [f"M{int(i)}_M{int(j)}" for (i, j) in pair_indices]

    # n2 of pair (i, j) = n1 of auto pair (j, j).
    n1_auto = {int(i): int(n1[p])
               for p, (i, j) in enumerate(pair_indices) if i == j}
    n2 = np.array([n1_auto[int(j)] for (_, j) in pair_indices], dtype=int)

    log10M1 = np.array([mass_bins[int(i)] for (i, _) in pair_indices])
    log10M2 = np.array([mass_bins[int(j)] for (_, j) in pair_indices])

    r_centres = 0.5 * (r_edges[:-1] + r_edges[1:])
    P = xi.shape[0]
    r = np.broadcast_to(r_centres[None, :], (P, r_centres.size)).copy()

    return XiHHRecord(
        r_edges=r_edges, mass_bins=mass_bins, pairs=pairs,
        pair_indices=np.asarray(pair_indices, dtype=int),
        r=r, xi=xi.astype(np.float64),
        n1=n1.astype(int), n2=n2,
        log10M1=log10M1, log10M2=log10M2,
        attrs={"box_size": BOX_SIZE,
               "gravity": str(z["gravity"]),
               "redshift": float(z["redshift"]),
               "imodel": int(z["imodel"]),
               "ibox": int(z["iboxes"][b])},
    )

rec_thr = threshold_record(0)
print("threshold record (ibox =", rec_thr.attrs["ibox"], "):")
print("  mass_bins =\n", rec_thr.mass_bins)
print("  pairs     =", rec_thr.pairs)
print("  xi.shape  =", rec_thr.xi.shape, "  (P_pairs, N_r)")
print("  auto n1   =",
      [int(rec_thr.n1[p]) for p, (i, j) in enumerate(rec_thr.pair_indices)
       if i == j])

threshold record (ibox = 1 ):
  mass_bins =
 [[12.6  inf]
 [12.8  inf]
 [13.   inf]
 [13.2  inf]]
  pairs     = ['M0_M0', 'M0_M1', 'M0_M2', 'M0_M3', 'M1_M1', 'M1_M2', 'M1_M3', 'M2_M2', 'M2_M3', 'M3_M3']
  xi.shape  = (10, 60)   (P_pairs, N_r)
  auto n1   = [1121453, 677493, 412637, 245613]

2. Rebin into differential bins¶

rebin_xi_hh_record returns a new XiHHRecord with the standard pair-sparse layout — ready to drop into any downstream code that already consumes XiHHRecord (loaders, plotters, analysis scripts). The rebin never writes to disk and never modifies its input.

In [3]:

Copied!





rec_bin = rebin_xi_hh_record(
    rec_thr,
    logM_out=z["logM_out"],
    dlogM=float(z["dlogM"]),
    interp_method="linear",
)

print("rebinned record:")
print("  mass_bins =\n", rec_bin.mass_bins)
print("  pairs     =", rec_bin.pairs)
print("  xi.shape  =", rec_bin.xi.shape)
print("  attrs (added by rebin):")
for k in ("rebin_method", "rebin_interp", "rebin_dlogM"):
    print(f"    {k:14s} = {rec_bin.attrs[k]!r}")
rec_bin = rebin_xi_hh_record(
    rec_thr,
    logM_out=z["logM_out"],
    dlogM=float(z["dlogM"]),
    interp_method="linear",
)

print("rebinned record:")
print("  mass_bins =\n", rec_bin.mass_bins)
print("  pairs     =", rec_bin.pairs)
print("  xi.shape  =", rec_bin.xi.shape)
print("  attrs (added by rebin):")
for k in ("rebin_method", "rebin_interp", "rebin_dlogM"):
    print(f"    {k:14s} = {rec_bin.attrs[k]!r}")

rebinned record:
  mass_bins =
 [[12.65 12.75]
 [12.85 12.95]
 [13.05 13.15]]
  pairs     = ['M0_M0', 'M0_M1', 'M0_M2', 'M1_M1', 'M1_M2', 'M2_M2']
  xi.shape  = (6, 60)
  attrs (added by rebin):
    rebin_method   = 'cumulative_to_binned'
    rebin_interp   = 'linear'
    rebin_dlogM    = 0.1

3. Compare against direct measurement (one box)¶

The cache also stores a direct, finite-bin measurement of the same three differential mass bins. For a single box the two routes should agree on the signal but the threshold→bin route already has lower shot noise — visible at small $r$ in the high-mass auto-pair where the direct sample contains few pairs.

In [4]:

Copied!





r = 0.5 * (z["r_edges"][:-1] + z["r_edges"][1:])
direct_bins = z["direct_bins"]
pair_indices_dir = z["pair_indices_dir"]
xi_dir_b0 = z["xi_dir_per_box"][0]

# Re-order rec_bin pairs to match the direct grid's pair_indices order.
key_to_idx = {k: i for i, k in enumerate(rec_bin.pairs)}
xi_bin_b0 = np.empty_like(xi_dir_b0)
for p, (i, j) in enumerate(pair_indices_dir):
    xi_bin_b0[p] = rec_bin.xi[key_to_idx[f"M{int(i)}_M{int(j)}"]]

# Plot the three auto pairs side by side.
auto_p = [p for p, (i, j) in enumerate(pair_indices_dir) if i == j]
fig, axes = plt.subplots(1, len(auto_p), figsize=(4.0 * len(auto_p), 3.4),
                          sharey=False)
for ax, p in zip(np.atleast_1d(axes), auto_p):
    i, _ = pair_indices_dir[p]
    lo, hi = direct_bins[int(i)]
    ax.plot(r, r ** 2 * xi_dir_b0[p], "o", color="C0", ms=3, label="direct")
    ax.plot(r, r ** 2 * xi_bin_b0[p], "s", color="C3", ms=3, alpha=0.85,
            label="threshold → bin")
    ax.set(xscale="log",
           xlabel=r"$r$ [Mpc/h]",
           title=fr"$\log_{{10}}M \in [{lo:.2f}, {hi:.2f})$")
    ax.grid(True, ls=":", alpha=0.5)
np.atleast_1d(axes)[0].set_ylabel(r"$r^2\,\xi_{hh}(r)$  [(Mpc/h)$^2$]")
np.atleast_1d(axes)[0].legend(fontsize=8, frameon=False)
fig.suptitle(f"Single-box comparison — ibox {int(z['iboxes'][0])}",
             fontsize=11)
fig.tight_layout(rect=(0, 0, 1, 0.94))
plt.show()
r = 0.5 * (z["r_edges"][:-1] + z["r_edges"][1:])
direct_bins = z["direct_bins"]
pair_indices_dir = z["pair_indices_dir"]
xi_dir_b0 = z["xi_dir_per_box"][0]

# Re-order rec_bin pairs to match the direct grid's pair_indices order.
key_to_idx = {k: i for i, k in enumerate(rec_bin.pairs)}
xi_bin_b0 = np.empty_like(xi_dir_b0)
for p, (i, j) in enumerate(pair_indices_dir):
    xi_bin_b0[p] = rec_bin.xi[key_to_idx[f"M{int(i)}_M{int(j)}"]]

# Plot the three auto pairs side by side.
auto_p = [p for p, (i, j) in enumerate(pair_indices_dir) if i == j]
fig, axes = plt.subplots(1, len(auto_p), figsize=(4.0 * len(auto_p), 3.4),
                          sharey=False)
for ax, p in zip(np.atleast_1d(axes), auto_p):
    i, _ = pair_indices_dir[p]
    lo, hi = direct_bins[int(i)]
    ax.plot(r, r ** 2 * xi_dir_b0[p], "o", color="C0", ms=3, label="direct")
    ax.plot(r, r ** 2 * xi_bin_b0[p], "s", color="C3", ms=3, alpha=0.85,
            label="threshold → bin")
    ax.set(xscale="log",
           xlabel=r"$r$ [Mpc/h]",
           title=fr"$\log_{{10}}M \in [{lo:.2f}, {hi:.2f})$")
    ax.grid(True, ls=":", alpha=0.5)
np.atleast_1d(axes)[0].set_ylabel(r"$r^2\,\xi_{hh}(r)$  [(Mpc/h)$^2$]")
np.atleast_1d(axes)[0].legend(fontsize=8, frameon=False)
fig.suptitle(f"Single-box comparison — ibox {int(z['iboxes'][0])}",
             fontsize=11)
fig.tight_layout(rect=(0, 0, 1, 0.94))
plt.show()

No description has been provided for this image

4. Stack across 100 boxes — the S/N argument¶

The threshold→bin route's variance reduction is the whole point of the exercise. Loop over the 100 boxes, rebin each, then compare the SEM of the ensemble mean between the two routes.

Both routes target the same differential bins, so any difference in SEM is genuine signal-to-noise gain (or loss) of the threshold method relative to direct measurement.

In [5]:

Copied!





# Loop the rebin across all 100 boxes. ~1 s total.
N = z["xi_thr_per_box"].shape[0]
xi_bin_per_box = np.empty_like(z["xi_dir_per_box"])
for b in range(N):
    rec = rebin_xi_hh_record(
        threshold_record(b),
        logM_out=z["logM_out"], dlogM=float(z["dlogM"]),
        interp_method="linear",
    )
    keymap = {k: i for i, k in enumerate(rec.pairs)}
    for p, (i, j) in enumerate(pair_indices_dir):
        xi_bin_per_box[b, p] = rec.xi[keymap[f"M{int(i)}_M{int(j)}"]]

with np.errstate(invalid="ignore"):
    mean_d = np.nanmean(z["xi_dir_per_box"], axis=0)
    mean_c = np.nanmean(xi_bin_per_box, axis=0)
    sem_d = np.nanstd(z["xi_dir_per_box"], axis=0, ddof=1) / np.sqrt(N)
    sem_c = np.nanstd(xi_bin_per_box, axis=0, ddof=1) / np.sqrt(N)

# SEM ratio at r > 1 Mpc/h (clean two-halo regime, no empty pair bins).
mask = r > 1.0
print("pair    log10M1×log10M2                 ⟨sem_c/sem_d⟩   equiv N_box gain")
print("-" * 78)
for p, (i, j) in enumerate(pair_indices_dir):
    ratio = float(np.nanmean(sem_c[p, mask] / sem_d[p, mask]))
    gain = ratio ** -2
    lo1, hi1 = direct_bins[int(i)]
    lo2, hi2 = direct_bins[int(j)]
    label = f"M{int(i)}_M{int(j)}   [{lo1:.2f},{hi1:.2f}) × [{lo2:.2f},{hi2:.2f})"
    print(f"{label:40s}   {ratio:11.3f}    {gain:6.2f}×")
# Loop the rebin across all 100 boxes. ~1 s total.
N = z["xi_thr_per_box"].shape[0]
xi_bin_per_box = np.empty_like(z["xi_dir_per_box"])
for b in range(N):
    rec = rebin_xi_hh_record(
        threshold_record(b),
        logM_out=z["logM_out"], dlogM=float(z["dlogM"]),
        interp_method="linear",
    )
    keymap = {k: i for i, k in enumerate(rec.pairs)}
    for p, (i, j) in enumerate(pair_indices_dir):
        xi_bin_per_box[b, p] = rec.xi[keymap[f"M{int(i)}_M{int(j)}"]]

with np.errstate(invalid="ignore"):
    mean_d = np.nanmean(z["xi_dir_per_box"], axis=0)
    mean_c = np.nanmean(xi_bin_per_box, axis=0)
    sem_d = np.nanstd(z["xi_dir_per_box"], axis=0, ddof=1) / np.sqrt(N)
    sem_c = np.nanstd(xi_bin_per_box, axis=0, ddof=1) / np.sqrt(N)

# SEM ratio at r > 1 Mpc/h (clean two-halo regime, no empty pair bins).
mask = r > 1.0
print("pair    log10M1×log10M2                 ⟨sem_c/sem_d⟩   equiv N_box gain")
print("-" * 78)
for p, (i, j) in enumerate(pair_indices_dir):
    ratio = float(np.nanmean(sem_c[p, mask] / sem_d[p, mask]))
    gain = ratio ** -2
    lo1, hi1 = direct_bins[int(i)]
    lo2, hi2 = direct_bins[int(j)]
    label = f"M{int(i)}_M{int(j)}   [{lo1:.2f},{hi1:.2f}) × [{lo2:.2f},{hi2:.2f})"
    print(f"{label:40s}   {ratio:11.3f}    {gain:6.2f}×")

pair    log10M1×log10M2                 ⟨sem_c/sem_d⟩   equiv N_box gain
------------------------------------------------------------------------------
M0_M0   [12.65,12.75) × [12.65,12.75)            0.631      2.51×
M0_M1   [12.65,12.75) × [12.85,12.95)            0.651      2.36×
M0_M2   [12.65,12.75) × [13.05,13.15)            0.625      2.56×
M1_M1   [12.85,12.95) × [12.85,12.95)            0.609      2.69×
M1_M2   [12.85,12.95) × [13.05,13.15)            0.624      2.57×
M2_M2   [13.05,13.15) × [13.05,13.15)            0.572      3.05×

/tmp/ipykernel_1642427/807656439.py:15: RuntimeWarning: Mean of empty slice
  mean_d = np.nanmean(z["xi_dir_per_box"], axis=0)
/cosma/home/durham/dc-ruan1/.local/lib/python3.14/site-packages/numpy/lib/_nanfunctions_impl.py:2015: RuntimeWarning: Degrees of freedom <= 0 for slice.
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,

The "equiv. N_box gain" is (sem_c / sem_d)^{-2} — i.e. how many extra direct-measurement boxes you'd need to match the threshold→bin route's ensemble-mean precision. Roughly 3–4× more boxes at this grid spacing, with the gain growing for narrower / higher-mass bins where direct shot noise dominates.

The figure below shows $r^2 \xi(r)$ with $\pm 1\sigma_{\rm SEM}$ errorbars for both routes (top), and the fractional residual (bottom).

In [6]:

Copied!





n_pairs = mean_d.shape[0]
n_cols = min(n_pairs, 3)
n_rows = int(np.ceil(n_pairs / n_cols))
fig, axes = plt.subplots(2 * n_rows, n_cols,
                          figsize=(3.8 * n_cols, 2.4 * 2 * n_rows),
                          sharex=True)
axes = np.atleast_2d(axes).reshape(2 * n_rows, n_cols)

# Tiny horizontal offset so error caps don't overlap.
log_r = np.log10(r)
dx = 0.5 * np.median(np.diff(log_r))
r_d = 10.0 ** (log_r - 0.10 * dx)
r_c = 10.0 ** (log_r + 0.10 * dx)

with np.errstate(invalid="ignore"):
    ratio = mean_c / mean_d - 1.0
    sem_ratio = np.sqrt((sem_c / np.abs(mean_d)) ** 2
                        + (sem_d * mean_c / mean_d ** 2) ** 2)

for p in range(n_pairs):
    row, col = divmod(p, n_cols)
    ax_top = axes[2 * row, col]
    ax_bot = axes[2 * row + 1, col]

    ax_top.errorbar(r_d, r ** 2 * mean_d[p], yerr=r ** 2 * sem_d[p],
                    fmt="o", color="C0", ms=3, lw=0.0, elinewidth=1.0,
                    capsize=2.0, label="direct")
    ax_top.errorbar(r_c, r ** 2 * mean_c[p], yerr=r ** 2 * sem_c[p],
                    fmt="s", color="C3", ms=2.5, lw=0.0, elinewidth=1.0,
                    capsize=2.0, alpha=0.85, label="threshold → bin")
    ax_top.set_xscale("log")
    ax_top.grid(True, ls=":", alpha=0.5)
    i, j = pair_indices_dir[p]
    lo1, hi1 = direct_bins[int(i)]
    lo2, hi2 = direct_bins[int(j)]
    ax_top.set_title(f"[{lo1:.2f},{hi1:.2f}) × [{lo2:.2f},{hi2:.2f})",
                     fontsize=9)
    if p == 0:
        ax_top.set_ylabel(r"$r^2\,\xi(r)$  [(Mpc/h)$^2$]")
        ax_top.legend(fontsize=8, frameon=False)

    ax_bot.axhline(0.0, color="k", lw=0.7, alpha=0.6)
    ax_bot.errorbar(r, ratio[p], yerr=sem_ratio[p],
                    fmt="s", color="C3", ms=2.5, lw=0.0, elinewidth=1.0,
                    capsize=2.0)
    ax_bot.set_xscale("log")
    ax_bot.set_ylim(-0.10, 0.10)
    ax_bot.grid(True, ls=":", alpha=0.5)
    if p == 0:
        ax_bot.set_ylabel(r"$\xi_{\rm conv}/\xi_{\rm direct} - 1$")

for col in range(n_cols):
    axes[-1, col].set_xlabel(r"$r$ [Mpc/h]")

fig.suptitle(f"100-box ensemble — direct vs threshold→bin "
             f"(linear interp, dlogM={float(z['dlogM']):.2f})",
             fontsize=11)
fig.tight_layout(rect=(0, 0, 1, 0.97))
plt.show()
n_pairs = mean_d.shape[0]
n_cols = min(n_pairs, 3)
n_rows = int(np.ceil(n_pairs / n_cols))
fig, axes = plt.subplots(2 * n_rows, n_cols,
                          figsize=(3.8 * n_cols, 2.4 * 2 * n_rows),
                          sharex=True)
axes = np.atleast_2d(axes).reshape(2 * n_rows, n_cols)

# Tiny horizontal offset so error caps don't overlap.
log_r = np.log10(r)
dx = 0.5 * np.median(np.diff(log_r))
r_d = 10.0 ** (log_r - 0.10 * dx)
r_c = 10.0 ** (log_r + 0.10 * dx)

with np.errstate(invalid="ignore"):
    ratio = mean_c / mean_d - 1.0
    sem_ratio = np.sqrt((sem_c / np.abs(mean_d)) ** 2
                        + (sem_d * mean_c / mean_d ** 2) ** 2)

for p in range(n_pairs):
    row, col = divmod(p, n_cols)
    ax_top = axes[2 * row, col]
    ax_bot = axes[2 * row + 1, col]

    ax_top.errorbar(r_d, r ** 2 * mean_d[p], yerr=r ** 2 * sem_d[p],
                    fmt="o", color="C0", ms=3, lw=0.0, elinewidth=1.0,
                    capsize=2.0, label="direct")
    ax_top.errorbar(r_c, r ** 2 * mean_c[p], yerr=r ** 2 * sem_c[p],
                    fmt="s", color="C3", ms=2.5, lw=0.0, elinewidth=1.0,
                    capsize=2.0, alpha=0.85, label="threshold → bin")
    ax_top.set_xscale("log")
    ax_top.grid(True, ls=":", alpha=0.5)
    i, j = pair_indices_dir[p]
    lo1, hi1 = direct_bins[int(i)]
    lo2, hi2 = direct_bins[int(j)]
    ax_top.set_title(f"[{lo1:.2f},{hi1:.2f}) × [{lo2:.2f},{hi2:.2f})",
                     fontsize=9)
    if p == 0:
        ax_top.set_ylabel(r"$r^2\,\xi(r)$  [(Mpc/h)$^2$]")
        ax_top.legend(fontsize=8, frameon=False)

    ax_bot.axhline(0.0, color="k", lw=0.7, alpha=0.6)
    ax_bot.errorbar(r, ratio[p], yerr=sem_ratio[p],
                    fmt="s", color="C3", ms=2.5, lw=0.0, elinewidth=1.0,
                    capsize=2.0)
    ax_bot.set_xscale("log")
    ax_bot.set_ylim(-0.10, 0.10)
    ax_bot.grid(True, ls=":", alpha=0.5)
    if p == 0:
        ax_bot.set_ylabel(r"$\xi_{\rm conv}/\xi_{\rm direct} - 1$")

for col in range(n_cols):
    axes[-1, col].set_xlabel(r"$r$ [Mpc/h]")

fig.suptitle(f"100-box ensemble — direct vs threshold→bin "
             f"(linear interp, dlogM={float(z['dlogM']):.2f})",
             fontsize=11)
fig.tight_layout(rect=(0, 0, 1, 0.97))
plt.show()

5. Linear vs cubic interpolation¶

Off-grid bin edges introduce a small interpolation bias on $F$. Cubic interpolation reduces this from $O(h^2)$ to $O(h^4)$ at no extra measurement cost — the same threshold ξ_AB cube is reused. The S/N gain (sem_c/sem_d) is unchanged because it reflects the per-box variance ratio, not the interpolation residual.

Caveat. The cubic path uses scipy.interpolate.RegularGridInterpolator with a tensor-product B-spline solve, which rejects NaN inputs. The linear path tolerates NaN by simple propagation. The cached xi_thr_per_box has NaN at a handful of small-$r$ bins where pair counts are zero; we filter the input record's r-axis to the all-finite range before running the cubic comparison.

In [7]:

Copied!





def _slice_record_r(rec, r_mask):
    # Return a copy of rec restricted to r-bins where r_mask is True.
    # Updates r_edges by carrying the union of edges of the surviving bins.
    keep_idx = np.flatnonzero(r_mask)
    new_r = rec.r[:, keep_idx]
    new_xi = rec.xi[:, keep_idx]
    # r_edges: keep edges that bound at least one surviving bin.
    edge_keep = np.zeros(rec.r_edges.size, dtype=bool)
    edge_keep[keep_idx] = True
    edge_keep[keep_idx + 1] = True
    new_r_edges = rec.r_edges[edge_keep]
    return XiHHRecord(
        r_edges=new_r_edges, mass_bins=rec.mass_bins, pairs=rec.pairs,
        pair_indices=rec.pair_indices, r=new_r, xi=new_xi,
        n1=rec.n1, n2=rec.n2,
        log10M1=rec.log10M1, log10M2=rec.log10M2, attrs=dict(rec.attrs),
    )

# r-bins where ξ_thr is finite for every (box, pair) in the cache.
finite_r = ~np.isnan(z["xi_thr_per_box"]).any(axis=(0, 1))
print(f"finite r-bins: {finite_r.sum()}/{finite_r.size}  "
      f"(r ∈ [{r[finite_r].min():.3f}, {r[finite_r].max():.3f}] Mpc/h)")

xi_bin_cubic = np.full_like(z["xi_dir_per_box"], np.nan)
xi_dir_finite = z["xi_dir_per_box"][:, :, finite_r]
for b in range(N):
    rec = rebin_xi_hh_record(
        _slice_record_r(threshold_record(b), finite_r),
        logM_out=z["logM_out"], dlogM=float(z["dlogM"]),
        interp_method="cubic",
    )
    keymap = {k: i for i, k in enumerate(rec.pairs)}
    for p, (i, j) in enumerate(pair_indices_dir):
        xi_bin_cubic[b, p, finite_r] = rec.xi[keymap[f"M{int(i)}_M{int(j)}"]]

with np.errstate(invalid="ignore"):
    mean_c_cub = np.nanmean(xi_bin_cubic, axis=0)
    sem_c_cub = np.nanstd(xi_bin_cubic, axis=0, ddof=1) / np.sqrt(N)

# Restrict bias / SEM-ratio reporting to r > 1 Mpc/h (clean two-halo regime).
print()
print("pair    med |Δ| linear → cubic     ⟨sem_c/sem_d⟩ linear → cubic")
print("-" * 70)
for p, (i, j) in enumerate(pair_indices_dir):
    rel_lin = np.nanmedian(
        np.abs(mean_c[p, mask] - mean_d[p, mask]) / np.abs(mean_d[p, mask])
    )
    rel_cub = np.nanmedian(
        np.abs(mean_c_cub[p, mask] - mean_d[p, mask])
        / np.abs(mean_d[p, mask])
    )
    sr_lin = float(np.nanmean(sem_c[p, mask] / sem_d[p, mask]))
    sr_cub = float(np.nanmean(sem_c_cub[p, mask] / sem_d[p, mask]))
    print(f"M{int(i)}_M{int(j)}    {rel_lin:.3e} → {rel_cub:.3e}     "
          f"{sr_lin:.3f} → {sr_cub:.3f}")
def _slice_record_r(rec, r_mask):
    # Return a copy of rec restricted to r-bins where r_mask is True.
    # Updates r_edges by carrying the union of edges of the surviving bins.
    keep_idx = np.flatnonzero(r_mask)
    new_r = rec.r[:, keep_idx]
    new_xi = rec.xi[:, keep_idx]
    # r_edges: keep edges that bound at least one surviving bin.
    edge_keep = np.zeros(rec.r_edges.size, dtype=bool)
    edge_keep[keep_idx] = True
    edge_keep[keep_idx + 1] = True
    new_r_edges = rec.r_edges[edge_keep]
    return XiHHRecord(
        r_edges=new_r_edges, mass_bins=rec.mass_bins, pairs=rec.pairs,
        pair_indices=rec.pair_indices, r=new_r, xi=new_xi,
        n1=rec.n1, n2=rec.n2,
        log10M1=rec.log10M1, log10M2=rec.log10M2, attrs=dict(rec.attrs),
    )

# r-bins where ξ_thr is finite for every (box, pair) in the cache.
finite_r = ~np.isnan(z["xi_thr_per_box"]).any(axis=(0, 1))
print(f"finite r-bins: {finite_r.sum()}/{finite_r.size}  "
      f"(r ∈ [{r[finite_r].min():.3f}, {r[finite_r].max():.3f}] Mpc/h)")

xi_bin_cubic = np.full_like(z["xi_dir_per_box"], np.nan)
xi_dir_finite = z["xi_dir_per_box"][:, :, finite_r]
for b in range(N):
    rec = rebin_xi_hh_record(
        _slice_record_r(threshold_record(b), finite_r),
        logM_out=z["logM_out"], dlogM=float(z["dlogM"]),
        interp_method="cubic",
    )
    keymap = {k: i for i, k in enumerate(rec.pairs)}
    for p, (i, j) in enumerate(pair_indices_dir):
        xi_bin_cubic[b, p, finite_r] = rec.xi[keymap[f"M{int(i)}_M{int(j)}"]]

with np.errstate(invalid="ignore"):
    mean_c_cub = np.nanmean(xi_bin_cubic, axis=0)
    sem_c_cub = np.nanstd(xi_bin_cubic, axis=0, ddof=1) / np.sqrt(N)

# Restrict bias / SEM-ratio reporting to r > 1 Mpc/h (clean two-halo regime).
print()
print("pair    med |Δ| linear → cubic     ⟨sem_c/sem_d⟩ linear → cubic")
print("-" * 70)
for p, (i, j) in enumerate(pair_indices_dir):
    rel_lin = np.nanmedian(
        np.abs(mean_c[p, mask] - mean_d[p, mask]) / np.abs(mean_d[p, mask])
    )
    rel_cub = np.nanmedian(
        np.abs(mean_c_cub[p, mask] - mean_d[p, mask])
        / np.abs(mean_d[p, mask])
    )
    sr_lin = float(np.nanmean(sem_c[p, mask] / sem_d[p, mask]))
    sr_cub = float(np.nanmean(sem_c_cub[p, mask] / sem_d[p, mask]))
    print(f"M{int(i)}_M{int(j)}    {rel_lin:.3e} → {rel_cub:.3e}     "
          f"{sr_lin:.3f} → {sr_cub:.3f}")

finite r-bins: 53/60  (r ∈ [0.106, 113.406] Mpc/h)

pair    med |Δ| linear → cubic     ⟨sem_c/sem_d⟩ linear → cubic
----------------------------------------------------------------------
M0_M0    3.202e-03 → 3.206e-03     0.631 → 0.602
M0_M1    1.948e-03 → 1.651e-03     0.651 → 0.669
M0_M2    2.389e-03 → 1.395e-03     0.625 → 0.618
M1_M1    4.211e-03 → 3.097e-03     0.609 → 0.662
M1_M2    3.995e-03 → 2.187e-03     0.624 → 0.662
M2_M2    4.955e-03 → 4.153e-03     0.572 → 0.606

/tmp/ipykernel_1642427/3233171112.py:37: RuntimeWarning: Mean of empty slice
  mean_c_cub = np.nanmean(xi_bin_cubic, axis=0)

6. Summary¶

rebin_xi_hh_record is a one-call conversion: feed it a threshold-style XiHHRecord and it returns a new XiHHRecord with differential bins. Pair-sparse layout is preserved, so any downstream code that consumes XiHHRecord works unchanged.
For dlogM = 0.10 at the fiducial-LCDM 100-box grid, the threshold→bin route reaches direct-measurement precision with ~3–4× fewer boxes.
Linear interpolation introduces an off-grid bias of ~0.5%, which interp_method="cubic" reduces by an order of magnitude at no extra measurement cost.
The conversion does not interpolate in $r$ — the existing r axis of the input ξ_AB is carried through unchanged.