# Real News Video Sourcing Guide

Tested 2026-05-11 on Ubuntu 22.04 VPS.

## Working Sources (Direct MP4 via curl)

### AP News (JWPlayer CDN)
- URL pattern: `https://cdn.jwplayer.com/videos/{id}.mp4`
- Resolution: 480x270, H.264, ~200-500 kbps
- Method: `curl -sL -H "Referer: https://apnews.com/" "<url>" -o clip.mp4`
- Scrape: `curl -sL 'https://apnews.com/hub/videos' | grep -oP 'jwplayer\.com/videos/[^\s"\'<>]+\.mp4'`
- Reliability: HIGH — direct MP4 downloads work consistently

### India Today (tosshub CDN)
- URL pattern: `https://video-indiatoday.tosshub.com/indiatoday/video/YYYY_MM/tooltips/{filename}-preview.mp4`
- Resolution: 360x202, H.264, ~100-250 kbps (preview quality)
- Method: `curl -sL -H "Referer: https://www.indiatoday.in/" "<url>" -o clip.mp4`
- Scrape: `curl -sL 'https://www.indiatoday.in/videos' | grep -oP 'tosshub\.com[^\s"\'<>]+\.mp4'`
- Reliability: MEDIUM — only preview clips available, full videos may need different URL pattern

## Non-Working Sources

### YouTube (yt-dlp broken)
- System yt-dlp version 2022.04.08 is too old
- `yt-dlp -U` fails (apt-managed package)
- Error: "Unable to extract Initial JS player n function name"
- Workaround: Use browser to find CC-licensed videos, or use alternative download sites

### African News Sources

African news video sources are extremely limited for programmatic download:

| Source | Result | Notes |
|--------|--------|-------|
| NTA Nigeria | JSON metadata only | URL pattern looks like MP4 but returns JSON |
| SABC News | YouTube embeds | No direct MP4, videos are on YouTube |
| Africanews | No direct MP4 | JS-rendered player |
| Al Jazeera Africa | No direct MP4 | Same as main Al Jazeera site |

**Recommendation for African footage:** Use AP News or Reuters clips that cover African topics, or use the browser to manually find Creative Commons African news footage on YouTube. Alternatively, consider filming original B-roll if the channel has local contributors.

### BBC, Reuters, Al Jazeera, DW, France24
- All use JS-rendered video players
- No direct MP4 links in raw HTML
- Would require browser automation + network interception to extract

## Transcoding Pipeline

```bash
# Normalize any downloaded clip to 720p
ffmpeg -y -i raw.mp4 -t 10 \
  -vf "scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2" \
  -c:v libx264 -preset ultrafast -crf 23 -c:a aac -b:a 128k \
  processed.mp4
```

## Verification

Always verify downloads:
```bash
ffprobe -v quiet -show_entries format=duration,size -show_entries stream=codec_name,width,height clip.mp4
```

Files under 1KB are likely error pages, not video.
