Skip to content

feat(1491): free-text → release resolution cron + flowsheet_freetext_resolution table#1496

Merged
jakebromberg merged 4 commits into
mainfrom
feature/1491-freetext-resolve
Jun 25, 2026
Merged

feat(1491): free-text → release resolution cron + flowsheet_freetext_resolution table#1496
jakebromberg merged 4 commits into
mainfrom
feature/1491-freetext-resolve

Conversation

@jakebromberg

Copy link
Copy Markdown
Member

What

Adds a recurring cron that resolves the ~43% of music plays with flowsheet.album_id IS NULL from their free-text (artist, album) to a Discogs release, caching verdicts in a new flowsheet_freetext_resolution table. This is Phase-2 Track 1 — the release leg of attribution-corrected catalog popularity.

  • Migration 0106: flowsheet_freetext_resolution, composite PK (norm_artist, norm_album), with discogs_release_id / discogs_master_id / match_confidence / match_source and attempt_at + resolved_at markers.
  • Normalizer normalize-album-title.ts: parallels normalize-artist-name.ts but additionally strips edition/format suffixes ((Remastered), - 2011 Remaster), featuring clauses, and collapses the self-titled family — so pressings of one logical album share a key.
  • Job jobs/catalog-popularity-freetext-resolve/: enumerates DISTINCT normalized pairs, resolves via LML bulkLookupMetadata, UPSERTs. Cooperative pause via awaitQuietWindow; cron 30 4 * * * UTC.

Attempt-at semantics

attempt_at is stamped on any responded outcome (match OR explicit no-match) so the TTL retry window arms; left NULL only on transient LML failure, keeping the row immediately retryable. The release_id == 0 streaming-only sentinel (BS#1185) collapses to a null release, distinct from a real release_id > 0.

Track 0 coupling

discogs_master_id stays NULL for now and is deliberately omitted from the UPSERT's INSERT and set clauses, so a later Track-0-aware writer (once WXYC/library-metadata-lookup#688 surfaces master_id) is preserved, not clobbered. Track 1's release leg is independent and ships now.

Tests

Unit: normalizer cases + job batching/retry (75 targeted). Integration (pg): LML round-trip + UPSERT idempotency, run green against real Postgres (migration 0106 applies cleanly). Full suite: 3489 unit. No parity guard — the table is Backend-internal, not in the api.yaml SSOT.

Closes #1491

…resolution table

Phase-2 Track 1 of the catalog-popularity work (BS#1486). Resolves the ~43%
of music plays that are free-text/unlinked (flowsheet.album_id IS NULL) to
Discogs releases so the Phase-2 popularity collapse (Track 2) can attribute
the top-of-distribution records the linked-only album_plays signal can't see.

- Migration 0106 creates flowsheet_freetext_resolution: composite PK
  (norm_artist, norm_album), nullable discogs_release_id / discogs_master_id /
  match_confidence / match_source, and the attempt_at responded-outcome marker
  (stamped on match OR no-match, left NULL on transient LML failure so the row
  stays retryable). discogs_master_id stays NULL until LML Track 0 ships; the
  release leg is independent and runs in parallel.

- normalizeAlbumTitle (shared/database): album-leg twin of normalizeArtistName,
  but strips edition/remaster/format suffixes, collapses the self-titled family,
  canonicalizes & → and, and expands vol./pt. abbreviations — so pressing
  variants of one logical album collapse to a single dedup key. Unit-tested over
  s/t, &/and, featuring, edition-parenthetical, and abbreviation cases.

- jobs/catalog-popularity-freetext-resolve: recurring daily cron (deploy-base
  registration via package.json cron-schedule). Enumerates DISTINCT free-text
  pairs, folds them into normalized keys in JS, resolves each batch via LML
  bulkLookupMetadata under the Semaphore(5)+TokenBucket(50/min) ceiling, and
  UPSERTs verdicts race-guarded on the composite PK. Retry policy is the
  attempt_at marker + a no-match TTL; cooperative pause yields to live DJ
  activity. Modeled on jobs/album-level-backfill.

- Tests: unit (normalizer cases + job batching/retry/UPSERT with mocked LML),
  integration (pg marker — LML-shaped UPSERT contract + idempotency + master-id
  preservation + retry-eligibility against a real Postgres). No api.yaml parity
  guard — the table is Backend-internal.

CI change-detection already covers the new paths via the jobs/** and db-init
(migrations/** + schema.ts) filters.
@github-actions

Copy link
Copy Markdown

Schema constraint shape report

data-shape report errored (exit 0): node:internal/modules/runmain:107 triggerUncaughtException( ^ Error ERRMODULENOTFOUND: Cannot find package 'postgres' imported from /home/runner/work/Backend-Service/Backend-Service/scripts/schema-shape-report.mjs Did you mean to import "postgres/cjs/src/index.js"? at Object.getPackageJ; manual check required

Three correctness fixes in the free-text (artist, album) -> Discogs dedup
path, all surfaced by a max-effort review of the cron:

- normalize-album-title: anchor EDITION_CLAUSE with \b so edition keywords
  match as whole words. Without it 'ost'/'mono'/'stereo' matched inside
  lost/ghost/monologue/stereotype, wrongly stripping non-edition clauses
  ('(Lost Tapes)', '- Lost Sessions') and collapsing distinct albums into
  one dedup key.
- normalize-album-title: require a real featuring marker (full word
  "featuring" or the abbreviations with a period) plus a following guest, so
  a bare content word 'Ft'/'Feat' ('10 Ft Ganja Plant', 'A Great Feat') is
  no longer treated as a marker and the title truncated.
- job normalizePairs: apply the same whitespace collapse+trim to the artist
  leg that the album leg already gets, so 'J Dilla ' / 'J  Dilla' / 'J Dilla'
  share one whitespace-canonical dedup key instead of splitting into separate
  rows + duplicate LML lookups + a split play count.

Also swaps a gratuitous mainstream featuring-guest placeholder for a
WXYC-representative name in the normalizer tests.
…tion-artist-backfill

Follow-up review pass:

- normalize-album-title: the tightened featuring regex required a space after
  the abbreviation period, regressing the no-space form 'feat.X'/'ft.X' that
  the original stripped (and which the parenthesized branch still strips).
  Allow an optional space after the period so 'Donuts feat.J Dilla' collapses
  to the same key as 'Donuts feat. J Dilla' while still leaving a bare content
  word like 'Ft' in '10 Ft Ganja Plant' intact.
- Move the cron from 04:30 to 04:45 UTC: 04:30 collided exactly with the
  LML-bounded rotation-artist-backfill cron, contradicting the README's
  de-confliction rationale. 04:45 is a free minute; README + description
  corrected to name the real overnight LML-bounded crons.
…DE.md

CI `npm ci` failed at install because the @wxyc/catalog-popularity-freetext-resolve workspace was missing from package-lock.json (local node_modules had it symlinked, so local checks passed). Sync the lock, and add the job to the monorepo workspace table for parity with every other job.
@jakebromberg jakebromberg merged commit a3da652 into main Jun 25, 2026
6 checks passed
@jakebromberg jakebromberg deleted the feature/1491-freetext-resolve branch June 25, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Free-text → release+master resolution cron + flowsheet_freetext_resolution table (Phase-2 Track 1)

1 participant