L0 regularization paper (WIP)#685
Conversation
c19a0a4 to
7b7994d
Compare
baogorek
left a comment
There was a problem hiding this comment.
First round of feedback!
- I think Author ordering is perfect
- Paragraphs are one long line in a text editor. Hey, we've all got text wrap. But I would recommend having an AI split up the lines very loosely, to 100-120 character length. If you notice, all my comments show up at the beginning of the paragraph, because the paragraph is the line.
- Other methods you may want to consider for the background:
a) Covariate Balancing in Causal Inference
b) Small Area Estimation - We've got to figure out how to talk about states + DC and districts|
- I'm less sure after reading that GREG and IPF are your "competitors." Let's chat.
| \begin{abstract} | ||
| Tax-benefit microsimulation models typically operate at the national level, using household survey weights calibrated to aggregate population targets. Subnational analysis---at the level of states, congressional districts, or local authorities---requires datasets that simultaneously satisfy geographic distributional constraints while preserving household-level detail. We present a method based on $L_0$ regularization that jointly optimizes survey weight magnitudes and sparsity to produce calibrated subnational microsimulation datasets. | ||
|
|
||
| Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes. |
There was a problem hiding this comment.
This is good. There is one thing that we do differently that Louizos et al that I want to talk to you about in person.
|
|
||
| Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes. | ||
|
|
||
| The pipeline begins with the US Current Population Survey. Each household record is cloned multiple times and assigned to random census blocks drawn from a population-weighted distribution. Program participation indicators are re-randomized per geographic assignment using local take-up rates. Each clone is then run through \policyengine{}'s tax-benefit microsimulation engine to generate geography-specific outputs. The $L_0$ optimizer selects which household-geography combinations to retain, calibrating simultaneously against approximately 37,800 targets across three geographic levels. The sparsity penalty is configurable: a higher penalty produces a compact national dataset of approximately 50,000 records, while a lower penalty yields a larger dataset of approximately 3--4 million records covering all 436 congressional districts and 50 states individually. The method is implemented as the open-source \texttt{l0-python} PyTorch package. |
There was a problem hiding this comment.
Paragraph feels less like an abstract and more like a body paragraph. I know it's still a WIP, but really the abstract needs to have cold, hard stats in it that show performance. (We'll get them.)
There was a problem hiding this comment.
Uggh, one other thing we have to tackle: technically there are not 436 congressional districts. I was about to tell you to use 51 "state equivalents," but that's not exactly correct either. A TODO.
| \section{Introduction} | ||
| \label{sec:introduction} | ||
|
|
||
| Microsimulation models estimate the effects of tax and benefit policies on households by applying program rules to individual-level microdata. Most operational models---including those maintained by the Congressional Budget Office \citep{cbo2018}, the Joint Committee on Taxation \citep{jct2023}, and the Tax Policy Center \citep{tpc2024}---operate at the national level. They calibrate household survey weights to aggregate administrative totals such as total income tax revenue, program enrollment counts, and demographic benchmarks, then use the reweighted dataset to simulate policy reforms. |
There was a problem hiding this comment.
"They calibrate" -> do we know that for sure?
|
|
||
| Subnational policy analysis introduces a fundamentally different calibration challenge. Rather than matching a single set of national aggregates, the microdata must simultaneously reproduce distributional statistics at multiple geographic levels: congressional districts, states, and the nation as a whole. A dataset calibrated for the state of California must match California-specific IRS income totals, SNAP participation counts, Medicaid enrollment, and age distributions, while remaining consistent with national budget projections from the CBO and tax expenditure estimates from the JCT. Across 436 congressional districts and 50 states, this produces approximately 37,800 simultaneous calibration targets. | ||
|
|
||
| Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands. |
There was a problem hiding this comment.
IPF is also just for categorical variables (which you do indicate with "cross-classification tables," but it may make sense to call it out), and also in classical raking there is no "almost." You either get the calibration perfect or it failed to converge. You may want to comment that this is also called "raking." There is a YouTube video called something like "generalized raking" where the presenter talks about relaxing that.
My understanding is that GREG too seeks to exactly match the targets and it's a failure if it can't (not just positive loss), but check me on that. GREG does permit quantitative, in addition to qualitative variables. But it will happily use negative weights, which is annoying.
There was a problem hiding this comment.
"becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands." A source would be helpful here. Is this because, as I suggested above, the procedure simply failing because it can't match the equations?
|
|
||
| Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands. | ||
|
|
||
| Spatial microsimulation methods take a different approach, often distinguishing between reweighting methods and synthetic reconstruction methods for constructing small-area microdata \citep{tanton2014review}. Within this broader literature, researchers have used combinatorial optimization and simulated annealing \citep{williamson1998, huang2001, harland2012} as well as deterministic reweighting \citep{tanton2011, lovelace2016}. These methods typically operate at a single geographic level and require separate calibration runs for each area, making joint multi-level calibration difficult. |
There was a problem hiding this comment.
So even though you have "different" in the first sentence, you've switched from classical statistics to Microsimulation literature, which could be jarring. I think flowing through the different conceptual frameworks is going to require some artistic thinking here.
| @@ -0,0 +1,158 @@ | |||
| \section{Methodology} | |||
| \label{sec:methodology} | |||
|
|
|||
There was a problem hiding this comment.
Pure reader note: By the time we've gotten to methodology, there's been a lot of methodology!
|
|
||
| \subsubsection{Block sampling} | ||
|
|
||
| Census blocks are the finest geographic unit in the decennial census. Each block maps deterministically to a congressional district, county, tract, and state. The sampling distribution $P_{\text{pop}}(\text{block})$ is proportional to the block's share of the national population. Drawing blocks rather than congressional districts ensures fine-grained geographic variation within districts and enables derivation of county-level variables (Section~\ref{sec:stage4}). |
There was a problem hiding this comment.
And of course, now we've got adjusted gross income in the formula
There was a problem hiding this comment.
Ohhh, I see that's mentioned below. Hmm, I wonder if it's best to just show the formula all at once.
P_{AGI}(b) reads a bit strange since that would imply that AGI is the random variable. I think we want a multinomial distribution over all census blocks, and I do think it would be best to define that multinomial distribution directly.
|
|
||
| \subsubsection{Per-state parallel simulation} | ||
|
|
||
| The matrix is populated by running each household through \policyengine{}'s tax-benefit microsimulation engine. Because many target variables depend on state-specific tax and benefit rules, a separate simulation is required for each state. A parallel dispatcher sends one job per unique state FIPS code to a pool of worker processes. Each worker creates a fresh \texttt{Microsimulation} instance, overwrites every household's \texttt{state\_fips} with the target state, invalidates cached downstream variables, and calculates all target variables at the household and person levels, accounting for differences in state legislation. |
There was a problem hiding this comment.
ACA PTC has county specific rules, and eventually other policies are coming that are going to be at lower levels. Though admittedly, in the run I just ran, I'm only using the state level. A plug here for #598 which has been getting buried. We've got to figure out how to set the geographic level. Oh, that's right, Max had the idea for that and it's documented in that issue.
|
|
||
| \subsubsection{Hyperparameters} | ||
|
|
||
| Table~\ref{tab:hyperparameters} lists the optimization hyperparameters with their values and roles. The stretch parameters $\gamma = -0.1$ and $\zeta = 1.1$ follow the original Hard Concrete paper, placing approximately 9\% of the sigmoid's mass below 0 and above 1, which is what allows clipping to produce exact zeros and ones. |
There was a problem hiding this comment.
Note that we've kept most of Louizos's parameters. I have been setting beta back to .67, which was what they used, but honestly I don't have a good reason, and .35 works as well. The initial probability is .999, and that's just from trial and error. I'd set it to 1 if I could (it blows up). It's intended to be the proportion of gates that are open at the start. I always saw worse performance when I started with some gates closed. If they start open, they will close (eventually) though it will take more epochs. I know this is a "trust me bro" style of argument.
|
|
||
| \subsubsection{Unachievable targets} | ||
|
|
||
| Of the approximately 37,800 targets, \tbc[count] are marked unachievable (row sum zero in the calibration matrix). These correspond to congressional districts where no clones carry nonzero values for the target variable. Increasing the clone count from 430 reduces the number of unachievable targets, at the cost of a larger calibration matrix. |
There was a problem hiding this comment.
I know 37,800 is just a placeholder. This is inflated by the fact that multiple years of the same target can be in the database, etc.
cdd0057 to
da11870
Compare
b200851 to
6de667d
Compare
Paper: "L0 regularization for subnational microsimulation calibration"
targeting the International Journal of Microsimulation.
Current state of the manuscript:
- Full paper structure: abstract, introduction, background, data,
methodology, results, discussion, conclusion, appendix
- Formal survey calibration problem definition with GREG and IPF
explained in depth, including benefits, drawbacks, and current
practice in operational models (CBO, JCT, TPC, EUROMOD, TAXSIM)
- Four-stage pipeline methodology (clone, matrix, L0 optimize,
assemble) documented against the pipeline source code
- Detailed appendix target tables populated from policy_data.db
(37,758 targets: 33,572 district, 4,080 state, 106 national)
- All writing in US English, citations linked via plainnat/natbib
Still TODO:
- [ ] Implement IPF and GREG baselines on the same calibration matrix
to populate the comparison table (tables/comparison.tex)
- [ ] Run calibration experiments and fill in all [TBC] placeholders
in the results section (accuracy, sparsity, convergence, ESS)
- [ ] Generate convergence curve figure from calibration_log.csv
- [ ] Select and run a subnational policy application example
(Section 5.5 — candidate: EITC expansion across CDs)
- [ ] Review pipeline methodology section against latest code for
accuracy (clone-and-assign, matrix builder, assembly steps)
- [ ] Review and deepen background section: verify claims about
GREG/IPF limitations, add any missing related work
- [ ] Resolve pre-existing overfull hbox warnings (long URLs in
conclusion, hyperparameters table width)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce a paper benchmarking scaffold that compares L0 and GREG on the same exported calibration matrix while routing IPF through a separate automatic preprocessing step that reconstructs IPF-ready unit and target inputs from the saved package metadata. The scaffold includes two R runners, manifest-driven bundle export, common scoring against the shared matrix, environment setup helpers, and end-to-end tests for the runner schemas.
Replace the old order-dependent sequential IPF flow with a validated one-call path. The converter now keeps only closed categorical systems, derives complements only from authored parent totals, reports explicit drop reasons for non-runnable target families, and surfaces incompatible-total failures through structured diagnostics instead of silently chaining margin blocks. Update the benchmark harness to export retained-authored IPF scoring artifacts, validate the external-IPF input contract, and let L0/GREG opt into scoring on the same retained-authored subset for apples-to-apples comparisons. Refresh the walkthrough notebook, README, example manifest, and add regression coverage for closure logic, runner behavior, export validation, and the new scoring contract.
Methodological
--------------
- Add `--train-on {shared_requested,ipf_retained_authored}` to
benchmark_cli so L0 and GREG can be fit on the same target subset IPF
was given. Matched runs write `{method}_matched_summary.json` so they
don't overwrite the full-info `{method}_summary.json`. Both runs record
`training_target_set` and `scoring_target_set` in the summary.
- Add `seed` parameter to `fit_l0_weights` (seeds torch, CUDA when
present, numpy). Default None preserves prior non-deterministic
behavior. The benchmark CLI plumbs `seed` from `method_options.l0` so
paper benchmark runs are reproducible.
- Align IPF `bound` default in benchmark_cli to 10.0 to match the example
manifest; the prior 4.0 was too tight for real district-level fits.
Correctness
-----------
- Fix `is_authored` dedupe key in `_try_close_binary_subset` — collapsing
the dict on `(geo, cell)` is correct; the prior key allowed an authored
+ derived duplicate to slip through. Add an assertion for the genuine
bug case where the same cell would arrive with conflicting authoring.
- Fail loudly with `IPFConversionError` (and a diagnostics file) when the
package's `targets_df` lacks `target_id`. The previous silent skip left
downstream `--score-on ipf_retained_authored` to surface a misleading
FileNotFoundError.
- Wrap converter `ValueError` / `FileNotFoundError` raise sites in
`IPFConversionError` so `ipf_conversion_diagnostics.json` is always
written before export aborts.
Tests added
-----------
- ipf_conversion: `negative_derived_complement` guardrail,
`incompatible_totals` consistency check, `unsupported_partial_margin`
3+ category subset case.
- benchmarking_runners: seed wire-up (CLI → fit_l0_weights), `--train-on`
retained-subset routing + fail-when-subset-missing, one-call IPF on a
coherent person + household problem, derived-complement cell fitted
through surveysd::ipf.
- benchmark_export: missing-`target_id` fail-loud path with diagnostics.
Documentation
-------------
- README: matched two-fit recipe, L0 seed setting, GREG-linear's
negative-weight regime (with `negative_weight_share` as the surfaced
metric), strictness contract, open-question 1 resolution.
- 02_ipf_walkthrough.ipynb: new section 10g demonstrating binary-subset
closure via an authored parent total (the safe counterpart to the
dropped open-subset case); one-call statement tightened to "one
exported IPF run equals one coherent closed categorical system."
Still deferred
--------------
- Entity-aware extension for `tax_unit_count` / `spm_unit_count`.
- `survey:::grake` private-namespace migration / R package pinning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the deferred-future framing ("`tax_unit_count` and `spm_unit_count`
remain outside the core household/person IPF path *in this pass*") with the
methodological reason: `surveysd::ipf` supports exactly two entity levels
natively (`conP` for row-level / person-style constraints, `conH` for
constraints aggregated by `hid`), and there is no generalised mechanism
for additional counted entities such as `tax_unit` or `spm_unit`.
Producing a single weight vector that simultaneously satisfies targets at
three or more entity levels is not possible with classical IPF. Running
IPF separately per scope and aggregating would give it more degrees of
freedom than L0 / GREG (each pass solves a smaller subproblem with its
own freedom) and would not be a like-for-like comparison. The benchmark
therefore restricts IPF to `person_count` and `household_count` —
together or alone — and drops other count families at the count check.
Those targets remain in the shared sparse system that L0 and GREG fit,
so the cross-method comparison on the IPF-feasible subset stays
apples-to-apples via `--score-on ipf_retained_authored`.
Updated the README's methodology section and the IPF inputs subsection
to articulate the constraint, and tightened the `_target_scope` error
message in `ipf_conversion.py` to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tor, fixes
This commit lands the production scaffolding called for by paper-l0/BENCHMARK_PLAN.md
so the L0 / GREG / IPF benchmark can be launched end-to-end against the saved
calibration package without further code changes.
Tier manifests (paper-l0/benchmarking/manifests/tier*.json)
All ten paper-reported manifests, each reading from one shared
calibration_package.pkl and differing only in target_filters:
tier1_mixed.json L0 + GREG on the full filtered slice (count +
dollar combined) over a five-state, ten-district
geography subset plus national targets.
tier1_ipf.json L0 + IPF on the same slice; IPF retains the
authored closed subset and the matched runs use
--train-on ipf_retained_authored.
tier2_scaling_250 ...
tier2_scaling_largest.json Scaling ladder (250 / 500 / 1k / 2.5k / 5k / 10k
/ largest coherent pre-production subset) over
the same package, growing geography coverage to
grow the target set instead of the unit universe.
tier3_production.json Least-filtered view; failures are reportable
results (status=failed with notes) rather than
missing rows.
All tier manifests opt into method_options.ipf.return_na = true so non-
convergence surfaces NaN and the runner converts that into a visible runtime
error instead of writing NaN-laden weights to disk.
Geographic ID format is the convention the saved package actually uses: state
FIPS without leading zeros for single-digit states (e.g. CA = "6", CA-01 =
"601"). The two pre-existing example manifests are updated to match.
Orchestrator (paper-l0/benchmarking/run_benchmark_suite.py)
Exports each manifest, runs every method declared in it, and (when IPF is in
the manifest) schedules matched-input rows for the other methods via
--train-on ipf_retained_authored --score-on ipf_retained_authored. Aggregates
per-manifest summaries into one tier_<n>_summary.csv per tier plus a unified
suite_summary.csv. Failures (export-time IPFConversionError, runner non-zero
exits, missing summary files) are recorded as status=failed rows with the
captured reason in notes; the orchestrator never aborts the suite, which is
load-bearing for Tier 3's "completed-or-failed" reporting story.
Tier 2 rungs are discovered and ordered by target_filters.max_targets
(smallest first; uncapped rungs sort last) so the summary reads top-to-bottom
in increasing target count regardless of filename sort order.
Both run_benchmark_suite.py and benchmark_cli.py prepend the in-tree repo
root to sys.path before importing policyengine_us_data. This avoids being
shadowed by an editable install of the same package name pointing at a
sibling repo, which previously caused fit_l0_weights() to lose the seed
parameter at script-invocation time only (not when imported from another
module).
IPF runner: returnNA=FALSE design task closed
paper-l0/benchmarking/runners/ipf_runner.R now accepts an optional 11th
positional argument return_na (default TRUE for backwards compatibility).
When return_na is TRUE and surveysd::ipf returns any NaN weights -- the
silent-non-convergence path called out in the BENCHMARK_PLAN's "Immediate
next design tasks" -- the runner exits with a clear error rather than
writing NaN weights. The Python CLI plumbs method_options.ipf.return_na
through to the runner so manifests control the behavior declaratively.
Modal-artifact ingest (paper-l0/benchmarking/patch_package_paths.py)
The Modal-built calibration_package.pkl records absolute paths from the
build container (/pipeline/artifacts/<run_id>/...) for metadata.dataset_path
and metadata.db_path. The new helper rewrites those paths in-place to point
at local copies of the dataset H5 and policy_data.db so the rest of the
benchmark scaffold (and especially benchmark_export.build_ipf_inputs, which
has hard existence checks on both paths) runs unchanged on a local checkout.
Documentation (paper-l0/benchmarking/README.md)
New "Paper-reported tiers" section: which manifests belong to which tier,
what each method does at each rung, the run_benchmark_suite.py entry point
with example invocations, and the per-tier summary CSV layout.
Verification done before committing
Three end-to-end smoke runs were exercised against the production
calibration package (17,736 targets x 5,159,570 cloned units, n_clones=430)
to verify the full pipeline:
1. count-only smoke (29 targets, 100 epochs L0): all five rows green --
L0 + GREG full-info, IPF on retained-authored subset, plus matched L0
and GREG runs trained and scored on the IPF subset.
2. mixed-target smoke (count + dollar combined): L0 and GREG both
completed end-to-end; this exercises the Tier 1 / Tier 2 mixed-target
path that count_like_only=true cannot reach.
3. forced non-convergence (max_iter=5, epsP=1e-20): surveysd::ipf returned
NaN weights, the runner stopped with the new "did not converge"
message, and the orchestrator recorded status=failed with the full R
traceback in notes.
The smoke manifests, scratch run dirs, and download scratch were removed
from the repo before committing. The plans (paper-l0/BENCHMARK_PLAN.md and
outline.md) are intentionally not included in this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6de667d to
de913d9
Compare
No description provided.