The comprehensive processor can extract YouTube video subtitles/transcripts for better topic analysis.
The YouTubeSubtitleExtractor tries multiple methods to get subtitles:
-
youtube-transcript-api (Primary method)
- Uses the community API to fetch transcripts
- Works with auto-generated and manual captions
- Supports multiple languages (prefers English)
- No API key required
-
YouTube Data API (If API key provided)
- Official Google API
- More reliable but requires API key
- Subject to quota limits
-
Page parsing (Fallback)
- Parses YouTube page HTML
- Limited implementation
- Last resort method
HTTP 429 Error: Too many requests
- YouTube blocks IPs that make too many requests
- Cloud provider IPs (AWS, GCP, Azure) are often blocked by default
- Wait 10-60 minutes before retrying
- Consider using a YouTube API key for better reliability
-
Use YouTube Data API Key:
python scripts/comprehensive_processor_cli.py /path/to/graph \ --youtube-api-key YOUR_API_KEY
-
Process in batches:
- Don't process your entire library at once
- Process 10-20 videos at a time
- Wait between batches
-
Working around IP bans:
- Use a residential IP (not cloud/datacenter)
- Use a VPN to change IP address
- Add delays between requests
- See: https://github.com/jdepoix/youtube-transcript-api#working-around-ip-bans
Only subtitles longer than this will be used for analysis:
--min-subtitle-length 100 # Default: 100 charactersIf subtitles aren't available or fail to extract:
- Topic analysis falls back to video title only
- Still creates useful topics based on title keywords
- Less comprehensive than with full transcripts
Check processing statistics to see subtitle extraction success:
stats = processor.run()
print(f"Subtitles extracted: {stats['stats']['subtitles_extracted']}")
print(f"Videos processed: {stats['stats']['videos_enhanced']}")Currently uses youtube-transcript-api v1.2.3:
- Instance-based API (not class methods)
- Uses
api.list(video_id)andtranscript.fetch() - Returns
FetchedTranscriptSnippetobjects with.textattribute
- Start small: Test with a few videos first
- Use API key: More reliable for large batches
- Monitor logs: Watch for rate limit warnings
- Be patient: Wait if you hit rate limits
- Process incrementally: The processor skips already-processed videos
Successful extraction:
INFO - Successfully extracted 23942 chars from transcript API
INFO - Extracted 23942 chars of subtitles
Rate limited:
WARNING - All subtitle extraction methods failed for video: QdnxjYj1pS0
DEBUG - Could not get transcripts: YouTube is blocking requests from your IP
No subtitles available:
DEBUG - No transcripts available for video abc123
Test subtitle extraction on a single video:
python test_subtitle_extraction.pyNote: If you've been testing multiple times, you may be temporarily rate-limited.