ETL pipeline for building a PostgreSQL cache of Discogs release data, filtered to artists in the WXYC radio library catalog. Reduces Discogs API calls by providing a local cache for album lookups, track validation, and artwork URLs.
The pipeline processes monthly Discogs data dumps (~63.3 GB XML) into a focused PostgreSQL database (~3 GB) containing only releases by artists in the WXYC library catalog. This provides:
- Fast local lookups instead of rate-limited Discogs API calls
- Accent-insensitive trigram fuzzy text search via pg_trgm + unaccent
- Shared data resource for multiple consuming services
- Python 3.11+
- PostgreSQL with the
pg_trgmandunaccentextensions (or use Docker Compose) - discogs-xml-converter -- Rust binary for XML-to-CSV conversion (build from source or install on PATH)
- Discogs monthly data dumps (XML) from https://discogs-data-dumps.s3.us-west-2.amazonaws.com/index.html --
releases.xml.gzis required;artists.xml.gzandlabels.xml.gzare optional for enhanced filtering and dedup library.db-- SQLite export of the WXYC library catalog, generated byscripts/export_to_sqlite.py(requires SSH tunnel to WXYC MySQL)
uv syncGenerate library.db (requires SSH tunnel to WXYC MySQL):
python scripts/export_to_sqlite.py # see script for required env varsRun the pipeline:
python scripts/run_pipeline.py \
--xml /path/to/xml_dumps/ \
--library-db library.db \
--database-url postgresql://localhost:5432/discogs--xml accepts a single XML file or a directory. When given a directory, the converter auto-detects XML files by their root element and processes them in order: artists first (to build alias maps), then labels, then releases. See Artist Alias Filtering and Sublabel Hierarchy for details.
By default, discogs-xml-converter is expected on PATH. Use --converter to specify an alternate path. Use --library-artists to provide a pre-existing artist list instead of generating one from library.db.
All 8 steps are automated by run_pipeline.py. The script supports two modes: full pipeline from XML, or database build from pre-filtered CSVs.
| Step | Script | Description |
|---|---|---|
| 1. Generate artist list | scripts/enrich_library_artists.py |
Extract artists from library.db, optionally enrich with cross-references |
| 2. Convert + Filter | discogs-xml-converter |
XML to CSV with optional artist filtering (alias-enhanced when artists.xml is present) |
| 3. Create schema | schema/create_database.sql, schema/create_functions.sql |
Set up tables, extensions, and functions |
| 4. Import | scripts/import_csv.py |
Bulk load CSVs via psycopg COPY |
| 5. Create indexes | schema/create_indexes.sql |
Accent-insensitive trigram GIN indexes for fuzzy search |
| 6. Deduplicate | scripts/dedup_releases.py |
Keep best release per master_id (label match, US, most tracks) |
| 7. Prune/Copy | scripts/verify_cache.py |
Remove non-library releases or copy matches to target DB |
| 8. Vacuum | VACUUM FULL |
Reclaim disk space |
Runs steps 1-8. --xml accepts either a single XML file or a directory containing XML dumps. When --library-db is provided, the pipeline generates library_artists.txt automatically and uses it to filter during XML conversion:
# Single file (releases only)
python scripts/run_pipeline.py \
--xml /path/to/releases.xml.gz \
--library-db /path/to/library.db \
--database-url postgresql://localhost:5432/discogs
# Directory (auto-detects releases, artists, labels XMLs)
python scripts/run_pipeline.py \
--xml /path/to/xml_dumps/ \
--library-db /path/to/library.db \
--database-url postgresql://localhost:5432/discogsWhen the directory contains artists.xml.gz, alias-enhanced filtering is enabled automatically. When it contains labels.xml.gz, the converter produces label_hierarchy.csv which the pipeline uses for sublabel-aware dedup.
To enrich the artist list with alternate names from the WXYC catalog database, add --wxyc-db-url:
python scripts/run_pipeline.py \
--xml /path/to/xml_dumps/ \
--library-db /path/to/library.db \
--wxyc-db-url mysql://user:pass@host:port/wxycmusic \
--database-url postgresql://localhost:5432/discogsUse --library-artists to provide a pre-existing artist list instead of generating one from library.db.
Runs steps 3-8 from pre-filtered CSVs:
python scripts/run_pipeline.py \
--csv-dir /path/to/filtered/ \
--library-db /path/to/library.db \
--database-url postgresql://localhost:5432/discogs--library-dbis optional; if omitted, the prune step is skipped--library-labelsaccepts a pre-generatedlibrary_labels.csvfor label-aware dedup--database-urldefaults toDATABASE_URLenv var orpostgresql://localhost:5432/discogs
docker-compose.yml provides a self-contained environment with PostgreSQL and the pipeline. The Dockerfile compiles discogs-xml-converter from source in a Rust builder stage, so no local Rust toolchain is needed. Place XML dumps and library.db in a data/ directory and run docker compose up --build.
By default, dedup keeps the release with the most tracks per master_id group. When WXYC label preferences are available, dedup instead prefers the release whose Discogs label matches WXYC's known pressing -- ensuring the cached edition matches what the station actually owns.
Label preferences come from WXYC's FLOWSHEET_ENTRY_PROD MySQL table (rotation play entries include LABEL_NAME). The extraction script scripts/extract_library_labels.py produces a library_labels.csv with (artist_name, release_title, label_name) triples.
There are two ways to enable label-aware dedup:
-
Automatic extraction (when
--wxyc-db-urlis provided): the pipeline extracts labels from MySQL before the dedup step. -
Pre-generated CSV (when
--library-labelsis provided): the pipeline uses the CSV directly, no MySQL connection needed.
# Automatic: extract labels from WXYC MySQL and use for dedup
python scripts/run_pipeline.py \
--csv-dir /path/to/filtered/ \
--library-db /path/to/library.db \
--wxyc-db-url mysql://user:pass@host:port/wxycmusic \
--database-url postgresql://localhost:5432/discogs
# Pre-generated: use an existing library_labels.csv
python scripts/run_pipeline.py \
--csv-dir /path/to/filtered/ \
--library-labels /path/to/library_labels.csv \
--database-url postgresql://localhost:5432/discogsThe ranking order is: label match (prefer WXYC's pressing) > US country (domestic pressing) > track count (quality tiebreaker) > release ID (deterministic fallback).
When the converter processes a directory containing artists.xml.gz alongside releases.xml.gz, it uses Discogs artist aliases to improve filtering accuracy. For example, if WXYC's library has "Puff Daddy" but a Discogs release is credited to "P. Diddy" (an alias), alias-enhanced filtering catches the match using the artist_id as a join key.
The converter parses artists.xml first, building a map of artist_id -> [aliases, name_variations]. When filtering releases, each credited artist is checked by both canonical name and all known aliases. This is precise because artist_id links the release credit to the correct alias list, avoiding cross-artist false positives.
This is enabled automatically when --xml points to a directory containing artists.xml.gz and --library-artists is provided. No additional flags needed.
When labels.xml.gz is present in the XML directory, the converter produces label_hierarchy.csv mapping sublabels to parent labels (e.g., Parlophone -> EMI). The pipeline uses this during dedup for bidirectional sublabel resolution:
- If WXYC says "EMI" but Discogs has "Parlophone" (a sublabel of EMI), the label match is recognized
- If WXYC says "Parlophone" but Discogs has "EMI" (the parent), the label match is also recognized
This is one level of parent resolution. Discogs label hierarchies are generally shallow; multi-level resolution can be added later if needed.
For --csv-dir mode, you can pass a pre-generated hierarchy file via --label-hierarchy:
python scripts/run_pipeline.py \
--csv-dir /path/to/filtered/ \
--library-labels /path/to/library_labels.csv \
--label-hierarchy /path/to/label_hierarchy.csv \
--database-url postgresql://localhost:5432/discogsInstead of pruning releases in place (which destroys the full imported dataset), you can copy only matched releases to a separate target database:
python scripts/run_pipeline.py \
--csv-dir /path/to/filtered/ \
--library-db /path/to/library.db \
--database-url postgresql://localhost:5432/discogs \
--target-db-url postgresql://localhost:5432/discogs_cacheThis preserves the full discogs database and creates a lean discogs_cache database with only KEEP and REVIEW releases, complete with schema, FK constraints, and trigram indexes. The target database is created automatically if it doesn't exist.
You can also use --copy-to directly with verify_cache.py:
python scripts/verify_cache.py \
--copy-to postgresql://localhost:5432/discogs_cache \
/path/to/library.db \
postgresql://localhost:5432/discogs--copy-to and --prune are mutually exclusive.
If a pipeline run fails mid-way (e.g., disk full during index creation), you can resume from where it left off instead of restarting from scratch:
python scripts/run_pipeline.py \
--csv-dir /path/to/filtered/ \
--library-db /path/to/library.db \
--database-url postgresql://localhost:5432/discogs \
--resumeThe pipeline tracks step completion in a JSON state file (default: .pipeline_state.json). On resume, completed steps are skipped. You can specify a custom state file path with --state-file:
python scripts/run_pipeline.py \
--csv-dir /path/to/filtered/ \
--resume \
--state-file /tmp/my_pipeline_state.jsonIf no state file exists when --resume is used, the pipeline infers completed steps from database state (e.g., schema exists, tables have rows, indexes present, master_id column dropped by dedup).
--resume is only valid with --csv-dir mode, not --xml mode.
In --xml mode, converted CSVs are written to a temporary directory that is deleted after the pipeline finishes (or on failure). If the pipeline fails after the 13-minute conversion step, you'd need to re-convert from scratch.
Use --keep-csv to write CSVs to a persistent directory instead:
python scripts/run_pipeline.py \
--xml /path/to/xml_dumps/ \
--library-db /path/to/library.db \
--keep-csv /path/to/kept_csvs \
--database-url postgresql://localhost:5432/discogsIf the pipeline fails, you can restart from the CSVs using --csv-dir:
python scripts/run_pipeline.py \
--csv-dir /path/to/kept_csvs/csv \
--library-db /path/to/library.db \
--database-url postgresql://localhost:5432/discogs--keep-csv is only meaningful in --xml mode.
Individual steps can also be run directly:
# 1. Convert XML to CSV (with optional artist filtering)
# Single file (releases only)
discogs-xml-converter /path/to/releases.xml.gz --output-dir /path/to/csv/
# Directory (auto-detects releases, artists, labels XMLs)
discogs-xml-converter /path/to/xml_dumps/ --output-dir /path/to/csv/ \
--library-artists /path/to/library_artists.txt
# 2. Create schema and functions
psql -d discogs -f schema/create_database.sql
psql -d discogs -f schema/create_functions.sql
# 3. Import CSVs
python scripts/import_csv.py /path/to/csv/ [database_url]
# 4. Create indexes (10-30 min on large datasets)
psql -d discogs -f schema/create_indexes.sql
# 5. Deduplicate (optionally with label matching and sublabel hierarchy)
python scripts/dedup_releases.py [database_url]
python scripts/dedup_releases.py --library-labels /path/to/library_labels.csv [database_url]
python scripts/dedup_releases.py --library-labels /path/to/library_labels.csv \
--label-hierarchy /path/to/label_hierarchy.csv [database_url]
# 6. Prune (dry run first, then with --prune or --copy-to)
python scripts/verify_cache.py /path/to/library.db [database_url]
python scripts/verify_cache.py --prune /path/to/library.db [database_url]
# Or copy to a target database instead:
python scripts/verify_cache.py --copy-to postgresql:///discogs_cache /path/to/library.db [database_url]
# 7. Vacuum
psql -d discogs -c "VACUUM FULL;"The schema files in schema/ define the shared contract between this ETL pipeline and all consumers.
| Table | Description |
|---|---|
release |
Release metadata: id, title, release_year, country, artwork_url |
release_artist |
Artists on releases (main + extra credits), with optional Discogs artist_id |
release_label |
Label names per release (e.g., Parlophone, Factory Records) |
release_track |
Tracks on releases with position and duration |
release_track_artist |
Artists on specific tracks (for compilations) |
cache_metadata |
Data freshness tracking (cached_at, source) |
- Foreign key indexes on all child tables
- Accent-insensitive trigram GIN indexes (
pg_trgm+unaccent) ontitleandartist_namecolumns for fuzzy text search. Uses an immutablef_unaccent()wrapper to enable index expressions withlower(f_unaccent(column)). - Cache metadata indexes for freshness queries
Consumers connect via the DATABASE_URL_DISCOGS environment variable:
DATABASE_URL_DISCOGS=postgresql://user:pass@host:5432/discogs
Current consumers:
- library-metadata-lookup (
discogs/cache_service.py) - Python/asyncpg - Backend-Service - TypeScript/Node.js (planned)
Tests are organized into three layers:
# Unit tests (no external dependencies, run by default)
pytest tests/unit/ -v
# Integration tests (needs PostgreSQL)
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
pytest -m postgres -v
# MySQL integration tests (needs WXYC MySQL on port 3307)
pytest -m mysql -v
# E2E tests (needs PostgreSQL, runs full pipeline as subprocess)
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
pytest -m e2e -vIntegration and E2E tests are excluded by default (pytest with no args runs only unit tests). Start the test database with:
docker compose up db -dThe migrations/ directory contains historical one-time migrations:
01_optimize_schema.sql- Initial schema optimization (drops unused tables/columns, adds artwork_url and release_year, deduplicates by master_id). Already applied to the production database.
- Cache Technical Overview -- design rationale, benchmarks, and pipeline architecture details