discogs-cache

ETL pipeline for building a PostgreSQL cache of Discogs release data, filtered to artists in the WXYC radio library catalog. Reduces Discogs API calls by providing a local cache for album lookups, track validation, and artwork URLs.

Overview

The pipeline processes monthly Discogs data dumps (~63.3 GB XML) into a focused PostgreSQL database (~3 GB) containing only releases by artists in the WXYC library catalog. This provides:

Fast local lookups instead of rate-limited Discogs API calls
Accent-insensitive trigram fuzzy text search via pg_trgm + unaccent
Shared data resource for multiple consuming services

Prerequisites

Python 3.11+
PostgreSQL with the pg_trgm and unaccent extensions (or use Docker Compose)
discogs-xml-converter -- Rust binary for XML-to-CSV conversion (build from source or install on PATH)
Discogs monthly data dumps (XML) from https://discogs-data-dumps.s3.us-west-2.amazonaws.com/index.html -- releases.xml.gz is required; artists.xml.gz and labels.xml.gz are optional for enhanced filtering and dedup
library.db -- SQLite export of the WXYC library catalog, generated by scripts/export_to_sqlite.py (requires SSH tunnel to WXYC MySQL)

Quick Start

uv sync

Generate library.db (requires SSH tunnel to WXYC MySQL):

python scripts/export_to_sqlite.py   # see script for required env vars

Run the pipeline:

python scripts/run_pipeline.py \
  --xml /path/to/xml_dumps/ \
  --library-db library.db \
  --database-url postgresql://localhost:5432/discogs

--xml accepts a single XML file or a directory. When given a directory, the converter auto-detects XML files by their root element and processes them in order: artists first (to build alias maps), then labels, then releases. See Artist Alias Filtering and Sublabel Hierarchy for details.

By default, discogs-xml-converter is expected on PATH. Use --converter to specify an alternate path. Use --library-artists to provide a pre-existing artist list instead of generating one from library.db.

Pipeline

All 8 steps are automated by run_pipeline.py. The script supports two modes: full pipeline from XML, or database build from pre-filtered CSVs.

Step	Script	Description
1. Generate artist list	`scripts/enrich_library_artists.py`	Extract artists from library.db, optionally enrich with cross-references
2. Convert + Filter	`discogs-xml-converter`	XML to CSV with optional artist filtering (alias-enhanced when artists.xml is present)
3. Create schema	`schema/create_database.sql`, `schema/create_functions.sql`	Set up tables, extensions, and functions
4. Import	`scripts/import_csv.py`	Bulk load CSVs via psycopg COPY
5. Create indexes	`schema/create_indexes.sql`	Accent-insensitive trigram GIN indexes for fuzzy search
6. Deduplicate	`scripts/dedup_releases.py`	Keep best release per master_id (label match, US, most tracks)
7. Prune/Copy	`scripts/verify_cache.py`	Remove non-library releases or copy matches to target DB
8. Vacuum	`VACUUM FULL`	Reclaim disk space

Full Pipeline (--xml)

Runs steps 1-8. --xml accepts either a single XML file or a directory containing XML dumps. When --library-db is provided, the pipeline generates library_artists.txt automatically and uses it to filter during XML conversion:

# Single file (releases only)
python scripts/run_pipeline.py \
  --xml /path/to/releases.xml.gz \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs

# Directory (auto-detects releases, artists, labels XMLs)
python scripts/run_pipeline.py \
  --xml /path/to/xml_dumps/ \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs

When the directory contains artists.xml.gz, alias-enhanced filtering is enabled automatically. When it contains labels.xml.gz, the converter produces label_hierarchy.csv which the pipeline uses for sublabel-aware dedup.

To enrich the artist list with alternate names from the WXYC catalog database, add --wxyc-db-url:

python scripts/run_pipeline.py \
  --xml /path/to/xml_dumps/ \
  --library-db /path/to/library.db \
  --wxyc-db-url mysql://user:pass@host:port/wxycmusic \
  --database-url postgresql://localhost:5432/discogs

Use --library-artists to provide a pre-existing artist list instead of generating one from library.db.

Database Build (--csv-dir)

Runs steps 3-8 from pre-filtered CSVs:

python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs

--library-db is optional; if omitted, the prune step is skipped
--library-labels accepts a pre-generated library_labels.csv for label-aware dedup
--database-url defaults to DATABASE_URL env var or postgresql://localhost:5432/discogs

Docker Compose

docker-compose.yml provides a self-contained environment with PostgreSQL and the pipeline. The Dockerfile compiles discogs-xml-converter from source in a Rust builder stage, so no local Rust toolchain is needed. Place XML dumps and library.db in a data/ directory and run docker compose up --build.

Label-Aware Dedup

By default, dedup keeps the release with the most tracks per master_id group. When WXYC label preferences are available, dedup instead prefers the release whose Discogs label matches WXYC's known pressing -- ensuring the cached edition matches what the station actually owns.

Label preferences come from WXYC's FLOWSHEET_ENTRY_PROD MySQL table (rotation play entries include LABEL_NAME). The extraction script scripts/extract_library_labels.py produces a library_labels.csv with (artist_name, release_title, label_name) triples.

There are two ways to enable label-aware dedup:

Automatic extraction (when --wxyc-db-url is provided): the pipeline extracts labels from MySQL before the dedup step.
Pre-generated CSV (when --library-labels is provided): the pipeline uses the CSV directly, no MySQL connection needed.

# Automatic: extract labels from WXYC MySQL and use for dedup
python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --library-db /path/to/library.db \
  --wxyc-db-url mysql://user:pass@host:port/wxycmusic \
  --database-url postgresql://localhost:5432/discogs

# Pre-generated: use an existing library_labels.csv
python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --library-labels /path/to/library_labels.csv \
  --database-url postgresql://localhost:5432/discogs

The ranking order is: label match (prefer WXYC's pressing) > US country (domestic pressing) > track count (quality tiebreaker) > release ID (deterministic fallback).

Artist Alias Filtering

When the converter processes a directory containing artists.xml.gz alongside releases.xml.gz, it uses Discogs artist aliases to improve filtering accuracy. For example, if WXYC's library has "Puff Daddy" but a Discogs release is credited to "P. Diddy" (an alias), alias-enhanced filtering catches the match using the artist_id as a join key.

The converter parses artists.xml first, building a map of artist_id -> [aliases, name_variations]. When filtering releases, each credited artist is checked by both canonical name and all known aliases. This is precise because artist_id links the release credit to the correct alias list, avoiding cross-artist false positives.

This is enabled automatically when --xml points to a directory containing artists.xml.gz and --library-artists is provided. No additional flags needed.

Sublabel Hierarchy

When labels.xml.gz is present in the XML directory, the converter produces label_hierarchy.csv mapping sublabels to parent labels (e.g., Parlophone -> EMI). The pipeline uses this during dedup for bidirectional sublabel resolution:

If WXYC says "EMI" but Discogs has "Parlophone" (a sublabel of EMI), the label match is recognized
If WXYC says "Parlophone" but Discogs has "EMI" (the parent), the label match is also recognized

This is one level of parent resolution. Discogs label hierarchies are generally shallow; multi-level resolution can be added later if needed.

For --csv-dir mode, you can pass a pre-generated hierarchy file via --label-hierarchy:

python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --library-labels /path/to/library_labels.csv \
  --label-hierarchy /path/to/label_hierarchy.csv \
  --database-url postgresql://localhost:5432/discogs

Copy to Target Database

Instead of pruning releases in place (which destroys the full imported dataset), you can copy only matched releases to a separate target database:

python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs \
  --target-db-url postgresql://localhost:5432/discogs_cache

This preserves the full discogs database and creates a lean discogs_cache database with only KEEP and REVIEW releases, complete with schema, FK constraints, and trigram indexes. The target database is created automatically if it doesn't exist.

You can also use --copy-to directly with verify_cache.py:

python scripts/verify_cache.py \
  --copy-to postgresql://localhost:5432/discogs_cache \
  /path/to/library.db \
  postgresql://localhost:5432/discogs

--copy-to and --prune are mutually exclusive.

Resuming a Failed Pipeline

If a pipeline run fails mid-way (e.g., disk full during index creation), you can resume from where it left off instead of restarting from scratch:

python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs \
  --resume

The pipeline tracks step completion in a JSON state file (default: .pipeline_state.json). On resume, completed steps are skipped. You can specify a custom state file path with --state-file:

python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --resume \
  --state-file /tmp/my_pipeline_state.json

If no state file exists when --resume is used, the pipeline infers completed steps from database state (e.g., schema exists, tables have rows, indexes present, master_id column dropped by dedup).

--resume is only valid with --csv-dir mode, not --xml mode.

Preserving Converted CSVs (--keep-csv)

In --xml mode, converted CSVs are written to a temporary directory that is deleted after the pipeline finishes (or on failure). If the pipeline fails after the 13-minute conversion step, you'd need to re-convert from scratch.

Use --keep-csv to write CSVs to a persistent directory instead:

python scripts/run_pipeline.py \
  --xml /path/to/xml_dumps/ \
  --library-db /path/to/library.db \
  --keep-csv /path/to/kept_csvs \
  --database-url postgresql://localhost:5432/discogs

If the pipeline fails, you can restart from the CSVs using --csv-dir:

python scripts/run_pipeline.py \
  --csv-dir /path/to/kept_csvs/csv \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs

--keep-csv is only meaningful in --xml mode.

Running Steps Manually

Individual steps can also be run directly:

# 1. Convert XML to CSV (with optional artist filtering)
# Single file (releases only)
discogs-xml-converter /path/to/releases.xml.gz --output-dir /path/to/csv/
# Directory (auto-detects releases, artists, labels XMLs)
discogs-xml-converter /path/to/xml_dumps/ --output-dir /path/to/csv/ \
  --library-artists /path/to/library_artists.txt

# 2. Create schema and functions
psql -d discogs -f schema/create_database.sql
psql -d discogs -f schema/create_functions.sql

# 3. Import CSVs
python scripts/import_csv.py /path/to/csv/ [database_url]

# 4. Create indexes (10-30 min on large datasets)
psql -d discogs -f schema/create_indexes.sql

# 5. Deduplicate (optionally with label matching and sublabel hierarchy)
python scripts/dedup_releases.py [database_url]
python scripts/dedup_releases.py --library-labels /path/to/library_labels.csv [database_url]
python scripts/dedup_releases.py --library-labels /path/to/library_labels.csv \
  --label-hierarchy /path/to/label_hierarchy.csv [database_url]

# 6. Prune (dry run first, then with --prune or --copy-to)
python scripts/verify_cache.py /path/to/library.db [database_url]
python scripts/verify_cache.py --prune /path/to/library.db [database_url]
# Or copy to a target database instead:
python scripts/verify_cache.py --copy-to postgresql:///discogs_cache /path/to/library.db [database_url]

# 7. Vacuum
psql -d discogs -c "VACUUM FULL;"

Database Schema

The schema files in schema/ define the shared contract between this ETL pipeline and all consumers.

Tables

Table	Description
`release`	Release metadata: id, title, release_year, country, artwork_url
`release_artist`	Artists on releases (main + extra credits), with optional Discogs `artist_id`
`release_label`	Label names per release (e.g., Parlophone, Factory Records)
`release_track`	Tracks on releases with position and duration
`release_track_artist`	Artists on specific tracks (for compilations)
`cache_metadata`	Data freshness tracking (cached_at, source)

Indexes

Foreign key indexes on all child tables
Accent-insensitive trigram GIN indexes (pg_trgm + unaccent) on title and artist_name columns for fuzzy text search. Uses an immutable f_unaccent() wrapper to enable index expressions with lower(f_unaccent(column)).
Cache metadata indexes for freshness queries

Consumer Integration

Consumers connect via the DATABASE_URL_DISCOGS environment variable:

DATABASE_URL_DISCOGS=postgresql://user:pass@host:5432/discogs

Current consumers:

library-metadata-lookup (discogs/cache_service.py) - Python/asyncpg
Backend-Service - TypeScript/Node.js (planned)

Testing

Tests are organized into three layers:

# Unit tests (no external dependencies, run by default)
pytest tests/unit/ -v

# Integration tests (needs PostgreSQL)
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
  pytest -m postgres -v

# MySQL integration tests (needs WXYC MySQL on port 3307)
pytest -m mysql -v

# E2E tests (needs PostgreSQL, runs full pipeline as subprocess)
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
  pytest -m e2e -v

Integration and E2E tests are excluded by default (pytest with no args runs only unit tests). Start the test database with:

docker compose up db -d

Migrations

The migrations/ directory contains historical one-time migrations:

01_optimize_schema.sql - Initial schema optimization (drops unused tables/columns, adds artwork_url and release_year, deduplicates by master_id). Already applied to the production database.

Documentation

Cache Technical Overview -- design rationale, benchmarks, and pipeline architecture details

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
docs		docs
hooks		hooks
lib		lib
migrations		migrations
schema		schema
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

discogs-cache

Overview

Prerequisites

Quick Start

Pipeline

Full Pipeline (--xml)

Database Build (--csv-dir)

Docker Compose

Label-Aware Dedup

Artist Alias Filtering

Sublabel Hierarchy

Copy to Target Database

Resuming a Failed Pipeline

Preserving Converted CSVs (--keep-csv)

Running Steps Manually

Database Schema

Tables

Indexes

Consumer Integration

Testing

Migrations

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

discogs-cache

Overview

Prerequisites

Quick Start

Pipeline

Full Pipeline (--xml)

Database Build (--csv-dir)

Docker Compose

Label-Aware Dedup

Artist Alias Filtering

Sublabel Hierarchy

Copy to Target Database

Resuming a Failed Pipeline

Preserving Converted CSVs (--keep-csv)

Running Steps Manually

Database Schema

Tables

Indexes

Consumer Integration

Testing

Migrations

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages