MP3 Backfill To Postgres: A Step-by-Step Guide
T-Y020B: MP3 Backfill to Postgres: A Comprehensive Guide
Hey guys, let's dive into the details of how to perform a one-time MP3 backfill to a PostgreSQL service. This process is designed to migrate existing audio files to ensure seamless audio playback in your content. This comprehensive guide will walk you through the implementation, objectives, script requirements, execution strategy, verification, error handling, and acceptance criteria. Buckle up; this is going to be a fun ride!
1. The Mission: Migrating MP3s
Our primary goal is to move all historical MP3 files from a NAS (Network Attached Storage) export directory to a PostgreSQL Render service. The purpose? To ensure that all existing content has working audio playback capabilities. We're essentially making sure that all the audio files are available and accessible for users to enjoy. Sounds good, right?
2. Source and Destination: Where the Magic Happens
Let's talk about where the files live and where they're going. The source data includes:
- NAS Audio Directory: This is where the MP3 files are currently stored:
/Volumes/Docker/YTV2/data/exports/
. - Local Mirror: A local copy of the files for testing and faster access:
/Users/markdarby/projects/YTV_temp_NAS_files/data/exports/
. - Target Service: The destination where the MP3s will be uploaded:
https://ytv2-dashboard-postgres.onrender.com/ingest/audio
.
So, we're taking these files from their current location and moving them to our PostgreSQL service. Simple, right?
3. Script Requirements: The Heart of the Operation
Now, let's get into the core of the operation: the script. The script, named backfill_mp3s.py
, is the workhorse that does all the heavy lifting. Here's a breakdown:
3.1. Backfill Script (backfill_mp3s.py
)
This Python script is designed to upload the MP3 files. Here's what it does:
- Initialization: The script takes the source directory, target URL, and an ingest token as input. It sets up a session for making HTTP requests and keeps track of progress, including processed files, successful uploads, failures, and skipped files.
- File Extraction: It extracts the
video_id
from the filename (e.g.,abc123.mp3
becomesabc123
). - Upload Function: The core function is
upload_mp3
, which uploads each MP3 file to the PostgreSQL service. It uses therequests
library to send a POST request with the audio file and thevideo_id
. - Run Function: The
run_backfill
function orchestrates the entire process. It iterates through all the MP3 files in the source directory, uploads them, and provides real-time progress updates. It also implements batch processing and delay (rate limiting) to avoid overwhelming the target service. - Reporting: Finally, the script generates a detailed report showing the total files processed, successful uploads, failures, and success rate. It saves a CSV file (
failed_mp3_uploads.csv
) listing the failed uploads for easy retry.
#!/usr/bin/env python3
"""
One-time MP3 backfill to PostgreSQL service
Usage: python backfill_mp3s.py --source /path/to/exports --target-url https://... --token <token>
"""
import os
import requests
import glob
from pathlib import Path
import time
import csv
from typing import List, Tuple
class MP3Backfiller:
def __init__(self, source_dir: str, target_url: str, token: str):
self.source_dir = Path(source_dir)
self.target_url = target_url.rstrip('/')
self.token = token
self.session = requests.Session()
self.session.headers.update({"X-INGEST-TOKEN": token})
# Progress tracking
self.processed = 0
self.successful = 0
self.failed = []
self.skipped = []
def extract_video_id(self, filename: str) -> str:
"""Extract video_id from filename (e.g., 'abc123.mp3' -> 'abc123')"""
return Path(filename).stem
def upload_mp3(self, mp3_path: Path) -> bool:
"""Upload single MP3 file to PostgreSQL service"""
video_id = self.extract_video_id(mp3_path.name)
try:
with open(mp3_path, 'rb') as f:
files = {"audio": (mp3_path.name, f, "audio/mpeg")}
data = {"video_id": video_id}
response = self.session.post(
f"{self.target_url}/ingest/audio",
files=files,
data=data,
timeout=60
)
if response.status_code == 200:
return True
else:
print(f"β {video_id}: HTTP {response.status_code}")
return False
except Exception as e:
print(f"β {video_id}: {str(e)}")
return False
def run_backfill(self, batch_size: int = 10, delay: float = 0.5):
"""Execute backfill with progress reporting"""
mp3_files = list(self.source_dir.glob("*.mp3"))
total_files = len(mp3_files)
print(f"π΅ Found {total_files} MP3 files to backfill")
print(f"π‘ Target: {self.target_url}/ingest/audio")
print(f"π Batch size: {batch_size}, Delay: {delay}s")
print("-" * 60)
for i, mp3_path in enumerate(mp3_files, 1):
video_id = self.extract_video_id(mp3_path.name)
print(f"[{i:3d}/{total_files}] {video_id}...", end=" ")
if self.upload_mp3(mp3_path):
print("β
")
self.successful += 1
else:
print("β")
self.failed.append(video_id)
self.processed += 1
# Progress report every batch
if i % batch_size == 0:
success_rate = (self.successful / self.processed) * 100
print(f"π Progress: {self.processed}/{total_files} ({success_rate:.1f}% success)")
time.sleep(delay) # Rate limiting
self.generate_report()
def generate_report(self):
"""Generate final backfill report"""
success_rate = (self.successful / self.processed) * 100 if self.processed > 0 else 0
print("\n" + "=" * 60)
print("π΅ MP3 BACKFILL COMPLETE")
print("=" * 60)
print(f"π Total Processed: {self.processed}")
print(f"β
Successful: {self.successful}")
print(f"β Failed: {len(self.failed)}")
print(f"π Success Rate: {success_rate:.1f}%")
if self.failed:
print(f"\nβ Failed uploads ({len(self.failed)}):")
for video_id in self.failed[:10]: # Show first 10
print(f" - {video_id}")
if len(self.failed) > 10:
print(f" ... and {len(self.failed) - 10} more")
# Save failure list for retry
if self.failed:
with open("failed_mp3_uploads.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(["video_id", "filename"])
for video_id in self.failed:
writer.writerow([video_id, f"{video_id}.mp3"])
print(f"πΎ Failed uploads saved to: failed_mp3_uploads.csv")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Backfill MP3s to PostgreSQL service")
parser.add_argument("--source", required=True, help="Source directory with MP3 files")
parser.add_argument("--target-url", required=True, help="Target PostgreSQL service URL")
parser.add_argument("--token", required=True, help="INGEST_TOKEN for authentication")
parser.add_argument("--batch-size", type=int, default=10, help="Batch size for progress reporting")
parser.add_argument("--delay", type=float, default=0.5, help="Delay between batches (seconds)")
args = parser.parse_args()
backfiller = MP3Backfiller(args.source, args.target_url, args.token)
backfiller.run_backfill(args.batch_size, args.delay)
3.2. Execution Strategy: How to Run the Script
There are two ways to run the script:
-
Option A - Local (Faster): If you have the NAS files downloaded on your laptop, this is the way to go. You'd run the script from your local machine.
# Run from laptop with NAS files downloaded cd /Users/markdarby/projects/YTV_temp_NAS_files python backfill_mp3s.py \ --source ./data/exports \ --target-url https://ytv2-dashboard-postgres.onrender.com \ --token $INGEST_TOKEN \ --batch-size 5 \ --delay 1.0
-
Option B - NAS Direct: If you're running the script from the NAS itself, use this method.
# Run from NAS itself cd /Volumes/Docker/YTV2 python backfill_mp3s.py \ --source ./data/exports \ --target-url https://ytv2-dashboard-postgres.onrender.com \ --token $INGEST_TOKEN
4. Verification & Quality Control: Ensuring Success
Before we unleash the script, we need to make sure everything is in order. Here's a breakdown of what to check:
4.1. Pre-Flight Checks: These checks happen before the full run.
- Verify that the
/ingest/audio
endpoint can accept test uploads. - Confirm that the
INGEST_TOKEN
authentication is working correctly. - Test with a few sample files before running the script on the entire dataset.
4.2. Progress Monitoring: Keep an eye on the progress.
- Real-time Success/Failure Reporting: The script provides immediate feedback on whether uploads are succeeding or failing.
- Progress Checkpoints: It reports progress in batches (e.g., every 10 uploads).
- Rate Limiting: To prevent the Render service from being overwhelmed, the script includes delays between batches.
- Failed Uploads CSV: A CSV file is generated to help with retrying failed uploads.
4.3. Post-Backfill Verification: After the script has run, it's time to verify that everything worked as expected.
-
Check if sample uploads worked using this command:
curl -s "https://ytv2-dashboard-postgres.onrender.com/api/reports" | \ jq '.reports[] | select(.media.audio_url != null) | {video_id, title, audio_url}'
-
Spot-check specific audio files:
curl -I "https://ytv2-dashboard-postgres.onrender.com/exports/audio/SAMPLE_VIDEO_ID.mp3"
-
Test the Dashboard UI: Verify that audio players appear and function correctly.
5. Error Handling & Recovery: What Happens When Things Go Wrong
Even the best scripts can run into trouble. Here's how we're prepared:
- Network Failures: The script should be able to handle network glitches and retry uploads with exponential backoff.
- Rate Limiting: The script respects the Render service limits with appropriate delays.
- Partial Failures: If some uploads fail, the script continues processing the remaining files and logs the failures for a retry.
- Resume Capability: The script can skip files that have already been uploaded, making it easy to resume an interrupted run.
6. Expectations and Acceptance Criteria: What We Aim For
Here's what we expect and the criteria for success:
6.1. Expected Volumes:
- Total MP3s: Roughly 81 files (based on the content count).
- Processing Time: The script should complete within approximately 10-15 minutes, thanks to rate limiting.
- Success Target: We're aiming for a success rate of at least 95%.
6.2. Acceptance Criteria:
- β The script processes all MP3 files in the source directory.
- β At least 95% of the MP3s are successfully uploaded to the PostgreSQL service.
- β Failed uploads are logged in a CSV file for manual review and retry.
- β
The PostgreSQL
/api/reports
includesmedia.audio_url
for all uploaded files. - β
Sample audio files are playable via
/exports/audio/<file>
URLs. - β The Dashboard UI correctly displays HTML5 audio players for the backfilled content.
- β Spot-check verification confirms that the oldest, newest, and random sample files work.
7. Environment Setup: Get Ready to Run
Before you start, make sure you have the INGEST_TOKEN
set up. You'll need to replace <postgres_service_token>
with the actual token value.
INGEST_TOKEN=<postgres_service_token>
8. File Locations: Where to Find Everything
- Script: The Python script is located at
/Users/markdarby/projects/YTV_temp_NAS_files/backfill_mp3s.py
. - Source: The MP3 files are in
/Users/markdarby/projects/YTV_temp_NAS_files/data/exports/*.mp3
. - Log Output: Any failed uploads will be saved to
failed_mp3_uploads.csv
.
9. Git Workflow: Keep Things Organized
Remember to commit your work. Use the following message:
-
Commit "T-Y020B: Add MP3 backfill script for PostgreSQL service migration"
-
Location: The script was created in a temporary directory for local execution.
10. Dependencies and Next Steps
- Dependencies: This project relies on T-Y020C (ingest endpoints) which is marked as completed. β
- Next: We can now enable T-Y020D (audio URL wiring) for testing.