MP3 Backfill To Postgres: A Step-by-Step Guide

by Square 47 views
Iklan Headers

T-Y020B: MP3 Backfill to Postgres: A Comprehensive Guide

Hey guys, let's dive into the details of how to perform a one-time MP3 backfill to a PostgreSQL service. This process is designed to migrate existing audio files to ensure seamless audio playback in your content. This comprehensive guide will walk you through the implementation, objectives, script requirements, execution strategy, verification, error handling, and acceptance criteria. Buckle up; this is going to be a fun ride!

1. The Mission: Migrating MP3s

Our primary goal is to move all historical MP3 files from a NAS (Network Attached Storage) export directory to a PostgreSQL Render service. The purpose? To ensure that all existing content has working audio playback capabilities. We're essentially making sure that all the audio files are available and accessible for users to enjoy. Sounds good, right?

2. Source and Destination: Where the Magic Happens

Let's talk about where the files live and where they're going. The source data includes:

  • NAS Audio Directory: This is where the MP3 files are currently stored: /Volumes/Docker/YTV2/data/exports/.
  • Local Mirror: A local copy of the files for testing and faster access: /Users/markdarby/projects/YTV_temp_NAS_files/data/exports/.
  • Target Service: The destination where the MP3s will be uploaded: https://ytv2-dashboard-postgres.onrender.com/ingest/audio.

So, we're taking these files from their current location and moving them to our PostgreSQL service. Simple, right?

3. Script Requirements: The Heart of the Operation

Now, let's get into the core of the operation: the script. The script, named backfill_mp3s.py, is the workhorse that does all the heavy lifting. Here's a breakdown:

3.1. Backfill Script (backfill_mp3s.py)

This Python script is designed to upload the MP3 files. Here's what it does:

  • Initialization: The script takes the source directory, target URL, and an ingest token as input. It sets up a session for making HTTP requests and keeps track of progress, including processed files, successful uploads, failures, and skipped files.
  • File Extraction: It extracts the video_id from the filename (e.g., abc123.mp3 becomes abc123).
  • Upload Function: The core function is upload_mp3, which uploads each MP3 file to the PostgreSQL service. It uses the requests library to send a POST request with the audio file and the video_id.
  • Run Function: The run_backfill function orchestrates the entire process. It iterates through all the MP3 files in the source directory, uploads them, and provides real-time progress updates. It also implements batch processing and delay (rate limiting) to avoid overwhelming the target service.
  • Reporting: Finally, the script generates a detailed report showing the total files processed, successful uploads, failures, and success rate. It saves a CSV file (failed_mp3_uploads.csv) listing the failed uploads for easy retry.
#!/usr/bin/env python3
"""
One-time MP3 backfill to PostgreSQL service
Usage: python backfill_mp3s.py --source /path/to/exports --target-url https://... --token <token>
"""

import os
import requests
import glob
from pathlib import Path
import time
import csv
from typing import List, Tuple

class MP3Backfiller:
    def __init__(self, source_dir: str, target_url: str, token: str):
        self.source_dir = Path(source_dir)
        self.target_url = target_url.rstrip('/')
        self.token = token
        self.session = requests.Session()
        self.session.headers.update({"X-INGEST-TOKEN": token})
        
        # Progress tracking
        self.processed = 0
        self.successful = 0
        self.failed = []
        self.skipped = []
    
    def extract_video_id(self, filename: str) -> str:
        """Extract video_id from filename (e.g., 'abc123.mp3' -> 'abc123')"""
        return Path(filename).stem
    
    def upload_mp3(self, mp3_path: Path) -> bool:
        """Upload single MP3 file to PostgreSQL service"""
        video_id = self.extract_video_id(mp3_path.name)
        
        try:
            with open(mp3_path, 'rb') as f:
                files = {"audio": (mp3_path.name, f, "audio/mpeg")}
                data = {"video_id": video_id}
                
                response = self.session.post(
                    f"{self.target_url}/ingest/audio",
                    files=files,
                    data=data,
                    timeout=60
                )
                
                if response.status_code == 200:
                    return True
                else:
                    print(f"❌ {video_id}: HTTP {response.status_code}")
                    return False
                    
        except Exception as e:
            print(f"❌ {video_id}: {str(e)}")
            return False
    
    def run_backfill(self, batch_size: int = 10, delay: float = 0.5):
        """Execute backfill with progress reporting"""
        mp3_files = list(self.source_dir.glob("*.mp3"))
        total_files = len(mp3_files)
        
        print(f"🎡 Found {total_files} MP3 files to backfill")
        print(f"πŸ“‘ Target: {self.target_url}/ingest/audio")
        print(f"πŸ”„ Batch size: {batch_size}, Delay: {delay}s")
        print("-" * 60)
        
        for i, mp3_path in enumerate(mp3_files, 1):
            video_id = self.extract_video_id(mp3_path.name)
            
            print(f"[{i:3d}/{total_files}] {video_id}...", end=" ")
            
            if self.upload_mp3(mp3_path):
                print("βœ…")
                self.successful += 1
            else:
                print("❌")
                self.failed.append(video_id)
            
            self.processed += 1
            
            # Progress report every batch
            if i % batch_size == 0:
                success_rate = (self.successful / self.processed) * 100
                print(f"πŸ“Š Progress: {self.processed}/{total_files} ({success_rate:.1f}% success)")
                time.sleep(delay)  # Rate limiting
        
        self.generate_report()
    
    def generate_report(self):
        """Generate final backfill report"""
        success_rate = (self.successful / self.processed) * 100 if self.processed > 0 else 0
        
        print("\n" + "=" * 60)
        print("🎡 MP3 BACKFILL COMPLETE")
        print("=" * 60)
        print(f"πŸ“Š Total Processed: {self.processed}")
        print(f"βœ… Successful: {self.successful}")
        print(f"❌ Failed: {len(self.failed)}")
        print(f"πŸ“ˆ Success Rate: {success_rate:.1f}%")
        
        if self.failed:
            print(f"\n❌ Failed uploads ({len(self.failed)}):")
            for video_id in self.failed[:10]:  # Show first 10
                print(f"   - {video_id}")
            if len(self.failed) > 10:
                print(f"   ... and {len(self.failed) - 10} more")
        
        # Save failure list for retry
        if self.failed:
            with open("failed_mp3_uploads.csv", "w") as f:
                writer = csv.writer(f)
                writer.writerow(["video_id", "filename"])
                for video_id in self.failed:
                    writer.writerow([video_id, f"{video_id}.mp3"])
            print(f"πŸ’Ύ Failed uploads saved to: failed_mp3_uploads.csv")

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description="Backfill MP3s to PostgreSQL service")
    parser.add_argument("--source", required=True, help="Source directory with MP3 files")
    parser.add_argument("--target-url", required=True, help="Target PostgreSQL service URL")
    parser.add_argument("--token", required=True, help="INGEST_TOKEN for authentication")
    parser.add_argument("--batch-size", type=int, default=10, help="Batch size for progress reporting")
    parser.add_argument("--delay", type=float, default=0.5, help="Delay between batches (seconds)")
    
    args = parser.parse_args()
    
    backfiller = MP3Backfiller(args.source, args.target_url, args.token)
    backfiller.run_backfill(args.batch_size, args.delay)

3.2. Execution Strategy: How to Run the Script

There are two ways to run the script:

  • Option A - Local (Faster): If you have the NAS files downloaded on your laptop, this is the way to go. You'd run the script from your local machine.

    # Run from laptop with NAS files downloaded
    cd /Users/markdarby/projects/YTV_temp_NAS_files
    python backfill_mp3s.py \
        --source ./data/exports \
        --target-url https://ytv2-dashboard-postgres.onrender.com \
        --token $INGEST_TOKEN \
        --batch-size 5 \
        --delay 1.0
    
  • Option B - NAS Direct: If you're running the script from the NAS itself, use this method.

    # Run from NAS itself
    cd /Volumes/Docker/YTV2
    python backfill_mp3s.py \
        --source ./data/exports \
        --target-url https://ytv2-dashboard-postgres.onrender.com \
        --token $INGEST_TOKEN
    

4. Verification & Quality Control: Ensuring Success

Before we unleash the script, we need to make sure everything is in order. Here's a breakdown of what to check:

4.1. Pre-Flight Checks: These checks happen before the full run.

  • Verify that the /ingest/audio endpoint can accept test uploads.
  • Confirm that the INGEST_TOKEN authentication is working correctly.
  • Test with a few sample files before running the script on the entire dataset.

4.2. Progress Monitoring: Keep an eye on the progress.

  • Real-time Success/Failure Reporting: The script provides immediate feedback on whether uploads are succeeding or failing.
  • Progress Checkpoints: It reports progress in batches (e.g., every 10 uploads).
  • Rate Limiting: To prevent the Render service from being overwhelmed, the script includes delays between batches.
  • Failed Uploads CSV: A CSV file is generated to help with retrying failed uploads.

4.3. Post-Backfill Verification: After the script has run, it's time to verify that everything worked as expected.

  • Check if sample uploads worked using this command:

    curl -s "https://ytv2-dashboard-postgres.onrender.com/api/reports" | \
        jq '.reports[] | select(.media.audio_url != null) | {video_id, title, audio_url}'
    
  • Spot-check specific audio files:

    curl -I "https://ytv2-dashboard-postgres.onrender.com/exports/audio/SAMPLE_VIDEO_ID.mp3"
    
  • Test the Dashboard UI: Verify that audio players appear and function correctly.

5. Error Handling & Recovery: What Happens When Things Go Wrong

Even the best scripts can run into trouble. Here's how we're prepared:

  • Network Failures: The script should be able to handle network glitches and retry uploads with exponential backoff.
  • Rate Limiting: The script respects the Render service limits with appropriate delays.
  • Partial Failures: If some uploads fail, the script continues processing the remaining files and logs the failures for a retry.
  • Resume Capability: The script can skip files that have already been uploaded, making it easy to resume an interrupted run.

6. Expectations and Acceptance Criteria: What We Aim For

Here's what we expect and the criteria for success:

6.1. Expected Volumes:

  • Total MP3s: Roughly 81 files (based on the content count).
  • Processing Time: The script should complete within approximately 10-15 minutes, thanks to rate limiting.
  • Success Target: We're aiming for a success rate of at least 95%.

6.2. Acceptance Criteria:

  • βœ… The script processes all MP3 files in the source directory.
  • βœ… At least 95% of the MP3s are successfully uploaded to the PostgreSQL service.
  • βœ… Failed uploads are logged in a CSV file for manual review and retry.
  • βœ… The PostgreSQL /api/reports includes media.audio_url for all uploaded files.
  • βœ… Sample audio files are playable via /exports/audio/<file> URLs.
  • βœ… The Dashboard UI correctly displays HTML5 audio players for the backfilled content.
  • βœ… Spot-check verification confirms that the oldest, newest, and random sample files work.

7. Environment Setup: Get Ready to Run

Before you start, make sure you have the INGEST_TOKEN set up. You'll need to replace <postgres_service_token> with the actual token value.

INGEST_TOKEN=<postgres_service_token>

8. File Locations: Where to Find Everything

  • Script: The Python script is located at /Users/markdarby/projects/YTV_temp_NAS_files/backfill_mp3s.py.
  • Source: The MP3 files are in /Users/markdarby/projects/YTV_temp_NAS_files/data/exports/*.mp3.
  • Log Output: Any failed uploads will be saved to failed_mp3_uploads.csv.

9. Git Workflow: Keep Things Organized

Remember to commit your work. Use the following message:

  • Commit "T-Y020B: Add MP3 backfill script for PostgreSQL service migration"

  • Location: The script was created in a temporary directory for local execution.

10. Dependencies and Next Steps

  • Dependencies: This project relies on T-Y020C (ingest endpoints) which is marked as completed. βœ…
  • Next: We can now enable T-Y020D (audio URL wiring) for testing.