# Football Data Sources & Web Scraping Methods

## ⚠️ Core Tool: lynx (2026-06-10 Breakthrough)

lynx is the server-side tool that can "see" JS-rendered page content.

```bash
# Basic usage
lynx -dump -nolist <url>
# Wide pages (odds tables, etc.)
lynx -dump -nolist -width=200 <url>
```

### lynx-Verified Sites (As of 2026-06-10)

| Site | lynx rating | Notes |
|------|-------------|-------|
| FlashScore (flashscore.com) | ★★★ | Match details error since 2026-06-10; main page barely loads. **No longer reliable!** |
| LiveScore (livescore.com) | ★★★★ | Match list + JSON API readable |
| NowGoal (nowgoal.net) | ★★★★★ | Scores, odds tables, match status — all readable |
| Goal.com (goal.com) | ★★★★ | curl+bs4 can parse matches JSON (future matches only) |
| Wikipedia | ★★★★★ | Team history, FIFA rankings, recent results |
| OddsPortal | ☆☆☆☆☆ | Timeout + JS render; curl and lynx both fail |
| bet365 | ☆☆☆☆☆ | Cloudflare full block |
| Google/Bing | ★ | JS forced redirect |

### Jina AI Fallback

```bash
curl -s "https://r.jina.ai/https://TARGET_SITE" -H "User-Agent: Mozilla/5.0"
```

Bypasses some JS rendering but results vary by site.

## Quick Match Lookup (2026-06-10 Optimized)

**Today/future matches (fastest):**
```bash
lynx -dump -nolist -width=300 https://www.flashscore.com/
curl -s https://www.goal.com/en/live-scores -H "User-Agent: Mozilla/5.0"
```

**Completed match scores:**
```bash
lynx -dump -nolist https://www.nowgoal.net/oddscomp/{match_id}
```

**Team history:**
```bash
lynx -dump -nolist https://en.wikipedia.org/wiki/{team}_national_football_team
```

## Goal.com — JSON Match Data Extraction

```python
import urllib.request, json

url = "https://www.goal.com/en/live-scores"
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
resp = urllib.request.urlopen(req, timeout=10)
html = resp.read().decode('utf-8', errors='replace')

# Find "matches" JSON array using bracket depth counting
start = html.find('"matches"')
bracket_start = html.find('[', start)
depth = 0
end = bracket_start
for i in range(bracket_start, len(html)):
    if html[i] == '[': depth += 1
    elif html[i] == ']':
        depth -= 1
        if depth == 0: end = i + 1; break

matches = json.loads(html[bracket_start:end])
# Each match: startDate, status (RESULT/LIVE/''), teamA, teamB, score
```

**Important:** Goal.com only shows FUTURE matches, not completed ones. Use NowGoal for completed matches.

## NowGoal Odds Pages

```bash
lynx -dump -nolist -width=200 https://www.nowgoal.net/oddscomp/{match_id}
```

lynx can read NowGoal's odds tables directly. If lynx shows "---" for odds numbers, the match hasn't opened or has already closed.

## 500.com Correct Usage

| URL | Access | Content |
|-----|--------|---------|
| `www.500.com` (mobile) | ✅ Homepage only | Match results, preview news |
| `liansai.500.com/zuqiu-19476/` | ✅ No login needed | World Cup schedule + European odds |
| `odds.500.com` | ❌ Requires login | Asian handicap / over-under |
| `www.500.com/soccer/` | ❌ 404 | All sub-pages 404 |

## Data Source Priority

1. **NowGoal** — lynx-readable scores + odds tables
2. **Goal.com** — curl+bs4 JSON for match schedules
3. **Wikipedia** — team history, rankings, recent results
4. **FIFA API** — `api.fifa.com/api/v1/ranking`
5. **Polymarket Gamma API** — World Cup markets only

## Server Network Limitations

- ✅ Wikipedia, Polymarket, arXiv, Goal.com
- ❌ Google, Reddit, Bilibili, YouTube, ESPN, Transfermarkt, Sofascore API (403)
- ⚠️ Bing/Google search — JS only, curl returns search box only

## Pitfalls & Lessons

- **Match IDs change per date** — don't reuse an ID across different match days
- **Cron uses UTC** — Beijing time = UTC+8. Schedule cron jobs accordingly.
- **Cron may not trigger** — always verify with `cronjob(action='list')`
- **Don't select matches after 22:00 Beijing time** — 张哥 goes to sleep
- **Friendly/small matches** usually have no public odds — don't waste time
- **Never fabricate odds numbers** — if 2-3 methods fail, say "can't find them"
