feat(1491): free-text → release resolution cron + flowsheet_freetext_resolution table#1496
Merged
Merged
Conversation
…resolution table Phase-2 Track 1 of the catalog-popularity work (BS#1486). Resolves the ~43% of music plays that are free-text/unlinked (flowsheet.album_id IS NULL) to Discogs releases so the Phase-2 popularity collapse (Track 2) can attribute the top-of-distribution records the linked-only album_plays signal can't see. - Migration 0106 creates flowsheet_freetext_resolution: composite PK (norm_artist, norm_album), nullable discogs_release_id / discogs_master_id / match_confidence / match_source, and the attempt_at responded-outcome marker (stamped on match OR no-match, left NULL on transient LML failure so the row stays retryable). discogs_master_id stays NULL until LML Track 0 ships; the release leg is independent and runs in parallel. - normalizeAlbumTitle (shared/database): album-leg twin of normalizeArtistName, but strips edition/remaster/format suffixes, collapses the self-titled family, canonicalizes & → and, and expands vol./pt. abbreviations — so pressing variants of one logical album collapse to a single dedup key. Unit-tested over s/t, &/and, featuring, edition-parenthetical, and abbreviation cases. - jobs/catalog-popularity-freetext-resolve: recurring daily cron (deploy-base registration via package.json cron-schedule). Enumerates DISTINCT free-text pairs, folds them into normalized keys in JS, resolves each batch via LML bulkLookupMetadata under the Semaphore(5)+TokenBucket(50/min) ceiling, and UPSERTs verdicts race-guarded on the composite PK. Retry policy is the attempt_at marker + a no-match TTL; cooperative pause yields to live DJ activity. Modeled on jobs/album-level-backfill. - Tests: unit (normalizer cases + job batching/retry/UPSERT with mocked LML), integration (pg marker — LML-shaped UPSERT contract + idempotency + master-id preservation + retry-eligibility against a real Postgres). No api.yaml parity guard — the table is Backend-internal. CI change-detection already covers the new paths via the jobs/** and db-init (migrations/** + schema.ts) filters.
Schema constraint shape reportdata-shape report errored (exit 0): node:internal/modules/runmain:107 triggerUncaughtException( ^ Error ERRMODULENOTFOUND: Cannot find package 'postgres' imported from /home/runner/work/Backend-Service/Backend-Service/scripts/schema-shape-report.mjs Did you mean to import "postgres/cjs/src/index.js"? at Object.getPackageJ; manual check required |
Three correctness fixes in the free-text (artist, album) -> Discogs dedup
path, all surfaced by a max-effort review of the cron:
- normalize-album-title: anchor EDITION_CLAUSE with \b so edition keywords
match as whole words. Without it 'ost'/'mono'/'stereo' matched inside
lost/ghost/monologue/stereotype, wrongly stripping non-edition clauses
('(Lost Tapes)', '- Lost Sessions') and collapsing distinct albums into
one dedup key.
- normalize-album-title: require a real featuring marker (full word
"featuring" or the abbreviations with a period) plus a following guest, so
a bare content word 'Ft'/'Feat' ('10 Ft Ganja Plant', 'A Great Feat') is
no longer treated as a marker and the title truncated.
- job normalizePairs: apply the same whitespace collapse+trim to the artist
leg that the album leg already gets, so 'J Dilla ' / 'J Dilla' / 'J Dilla'
share one whitespace-canonical dedup key instead of splitting into separate
rows + duplicate LML lookups + a split play count.
Also swaps a gratuitous mainstream featuring-guest placeholder for a
WXYC-representative name in the normalizer tests.
…tion-artist-backfill Follow-up review pass: - normalize-album-title: the tightened featuring regex required a space after the abbreviation period, regressing the no-space form 'feat.X'/'ft.X' that the original stripped (and which the parenthesized branch still strips). Allow an optional space after the period so 'Donuts feat.J Dilla' collapses to the same key as 'Donuts feat. J Dilla' while still leaving a bare content word like 'Ft' in '10 Ft Ganja Plant' intact. - Move the cron from 04:30 to 04:45 UTC: 04:30 collided exactly with the LML-bounded rotation-artist-backfill cron, contradicting the README's de-confliction rationale. 04:45 is a free minute; README + description corrected to name the real overnight LML-bounded crons.
…DE.md CI `npm ci` failed at install because the @wxyc/catalog-popularity-freetext-resolve workspace was missing from package-lock.json (local node_modules had it symlinked, so local checks passed). Sync the lock, and add the job to the monorepo workspace table for parity with every other job.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a recurring cron that resolves the ~43% of music plays with
flowsheet.album_id IS NULLfrom their free-text(artist, album)to a Discogs release, caching verdicts in a newflowsheet_freetext_resolutiontable. This is Phase-2 Track 1 — the release leg of attribution-corrected catalog popularity.flowsheet_freetext_resolution, composite PK(norm_artist, norm_album), withdiscogs_release_id/discogs_master_id/match_confidence/match_sourceandattempt_at+resolved_atmarkers.normalize-album-title.ts: parallelsnormalize-artist-name.tsbut additionally strips edition/format suffixes ((Remastered),- 2011 Remaster), featuring clauses, and collapses the self-titled family — so pressings of one logical album share a key.jobs/catalog-popularity-freetext-resolve/: enumerates DISTINCT normalized pairs, resolves via LMLbulkLookupMetadata, UPSERTs. Cooperative pause viaawaitQuietWindow; cron30 4 * * *UTC.Attempt-at semantics
attempt_atis stamped on any responded outcome (match OR explicit no-match) so the TTL retry window arms; left NULL only on transient LML failure, keeping the row immediately retryable. Therelease_id == 0streaming-only sentinel (BS#1185) collapses to a null release, distinct from a realrelease_id > 0.Track 0 coupling
discogs_master_idstays NULL for now and is deliberately omitted from the UPSERT's INSERT andsetclauses, so a later Track-0-aware writer (once WXYC/library-metadata-lookup#688 surfacesmaster_id) is preserved, not clobbered. Track 1's release leg is independent and ships now.Tests
Unit: normalizer cases + job batching/retry (75 targeted). Integration (
pg): LML round-trip + UPSERT idempotency, run green against real Postgres (migration 0106 applies cleanly). Full suite: 3489 unit. No parity guard — the table is Backend-internal, not in the api.yaml SSOT.Closes #1491