---
name: web-research-extraction
description: "Web research and structured data extraction — search, scrape, parse, and save data from websites using Python urllib and regex. When you need to look up information from the web, extract structured data (tables, lists, schedules), and save it for later use."
version: 1.0.0
author: 星璇
tags: [web-research, data-extraction, web-scraping, python, urllib, regex, wikipedia]
platforms: [linux, macos, windows]
---

# Web Research & Data Extraction

Search the web, extract structured data from HTML pages, and save it in reusable formats.

## When to Use

- User asks you to look up information from the web
- You need to extract structured data (tables, schedules, lists) from websites
- You need to save research results to files for later reference
- Other skills (arxiv, polymarket, etc.) don't cover the data source

## Core Technique: Python urllib + Regex

The most reliable approach on restricted servers (no browser, no curl for some sites):

```python
import urllib.request
import re

url = "https://example.com/data-page"
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(req, timeout=15) as resp:
    html = resp.read().decode("utf-8", errors="replace")

# Extract data with regex
pattern = r'your-regex-pattern'
results = re.findall(pattern, html)
```

### Why urllib over curl?
- Works when curl is blocked or times out
- Better error handling in Python
- Can parse HTML inline without piping through grep/sed
- Timeout control

## Data Extraction Patterns

### Wikipedia Tables
Wikipedia HTML uses CSS classes to identify table elements:

| Class | Meaning |
|-------|---------|
| `fhome` | Home team in sports tables |
| `faway` | Away team in sports tables |
| `fscore` | Match number/score |
| `mw-heading` | Section headers (h2, h3) |

### Extracting Group-Stage Sports Data
```python
# Find all group headers and their positions
group_headers = [(m.start(), m.group(1)) for m in re.finditer(r'<h3[^>]*>Group ([A-L])</h3>', html)]

# For each group, extract teams from its section
for i, (pos, gname) in enumerate(group_headers):
    end = group_headers[i+1][0] if i+1 < len(group_headers) else len(html)
    section = html[pos:end]
    teams = re.findall(r'<a[^>]*national[^"]*"[^>]*>([^<]+)</a>', section)
    unique_teams = list(dict.fromkeys(teams))  # deduplicate preserving order
```

## Save Pattern: File + Memory

Always save important research results in TWO places:

### 1. File (persistent, shareable)
```python
# Save as markdown for readability
with open('/root/filename.md', 'w') as f:
    f.write(formatted_data)
```

### 2. Memory (searchable in future sessions)
```
memory(action='add', target='memory', content='Key facts from the research...')
```

## Research Data Presentation Standards

### ⚠️ Critical Rule: Pure Objective Research = NO Personal Context

When the user asks you to do **purely objective research** (e.g., "survey all video platforms that pay"), you MUST:

1. **ZERO personal context** — Do not mention the user's server location, VPN access, content preferences, or any personal situation. The research itself is neutral.
2. **NO content type splitting** — Do NOT break down a platform's earnings by content category (sports/funny/education/etc.) unless the user explicitly asks. One platform = one row of data.
3. **NO recommendations/analysis** — Pure data tables only. Do not add "this platform is best for you" or "suitable for X type of content." Let the user draw their own conclusions.
4. **One row per entity** — Each platform gets exactly one row in the comparison table. No sub-rows for different content types on the same platform.

### Data Verification Protocol

1. **Cross-reference before presenting** — If a number seems off, verify it from at least 2 independent sources before putting it in the final output.
2. **Real creator feedback > official claims** — Official platform marketing often inflates numbers. Real user reports (Reddit, forums, creator communities) are more reliable.
3. **Own the error** — If the user points out wrong data, don't argue. Go back and re-research properly from scratch.

### Table Format Rules

- Header row in **bold**
- One row per platform
- Columns ordered logically: Name → Country → Payout → Conditions → Competition
- No footnotes, no inline analysis, no "note:" sections unless the data genuinely needs explanation
- Sort by the metric the user cares about (e.g., payout descending)

## Time Zone Awareness

**Always use Beijing time (UTC+8) for communication with 张哥.**

Server timezone should be set to `Asia/Shanghai`:
```bash
timedatectl set-timezone Asia/Shanghai
```

When displaying match times or schedules, always convert to Beijing time.

### Extracting World Cup Match Schedules (Wikipedia footballbox)

Wikipedia stores each match in a `footballbox` div. Here's the complete extraction pattern:

```python
import urllib.request
import re
import html as html_module

url = "https://en.wikipedia.org/wiki/2026_FIFA_World_Cup"
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(req, timeout=15) as resp:
    content = resp.read().decode()

# Find all footballbox divs (each contains one match)
boxes = re.findall(r'<div[^>]*class="[^"]*footballbox[^"]*"[^>]*>(.*?)</div>\s*</div>', content, re.DOTALL)

for box in boxes:
    mid = re.search(r'Match (\d+)', box)
    if not mid:
        continue
    
    # Date: class="fdate"
    date_m = re.search(r'class="fdate"[^>]*>(.*?)</td>', box, re.DOTALL)
    date_str = re.sub(r'<[^>]+>', '', date_m.group(1)).strip() if date_str else ''
    date_str = html_module.unescape(date_str)
    
    # Time + Timezone: <time> tag contains both
    time_m = re.search(r'<time[^>]*>(.*?)</time>', box, re.DOTALL)
    if time_m:
        raw = re.sub(r'<[^>]+>', '', time_m.group(1)).strip()
        raw = html_module.unescape(raw)
        tm = re.search(r'(\d{1,2}:\d{2}\s*[ap]\.m\.)', raw, re.IGNORECASE)
        time_str = tm.group(1) if tm else ''
        tz_m = re.search(r'UTC([−+-]\d+)', time_m.group(1))  # note: Unicode minus sign
        tz_offset = int(tz_m.group(1).replace('−', '-')) if tz_m else None
    
    # Teams: class="fhome" and class="faway"
    home_m = re.search(r'class="fhome"[^>]*>(.*?)</td>', box, re.DOTALL)
    away_m = re.search(r'class="faway"[^>]*>(.*?)</td>', box, re.DOTALL)
    
    def extract_team(block):
        texts = re.findall(r'>([^<]+)<', block)
        for t in texts:
            t = t.strip()
            if t and len(t) > 1 and not t.startswith('&#') and not t.startswith('Match'):
                return html_module.unescape(t)
        return ''
    
    home = extract_team(home_m.group(1)) if home_m else ''
    away = extract_team(away_m.group(1)) if away_m else ''
```

### Beijing Time Conversion

World Cup matches are played across multiple US/Canada/Mexico time zones.
Always convert to Beijing time (UTC+8) for 张哥:

| Venue | Local TZ | → Beijing |
|-------|----------|-----------|
| Mexico City | UTC-6 | +14h |
| Toronto | UTC-4 | +12h |
| Los Angeles | UTC-7 | +15h |
| Dallas/Houston | UTC-5 | +13h |
| New York/Boston | UTC-4 | +12h |

```python
def to_beijing(time_str, tz_offset):
    """Convert local match time to Beijing time (UTC+8)"""
    import re
    tm = re.search(r'(\d{1,2}):(\d{2})\s*([ap])\.?m\.?', time_str, re.IGNORECASE)
    if not tm:
        return time_str
    hour, minute = int(tm.group(1)), int(tm.group(2))
    if tm.group(3).upper() == 'P' and hour != 12:
        hour += 12
    elif tm.group(3).upper() == 'A' and hour == 12:
        hour = 0
    bj_hour = hour + (8 - tz_offset)
    day_note = ""
    if bj_hour >= 24:
        bj_hour -= 24
        day_note = " (+1天)"
    return f"{bj_hour:02d}:{minute:02d}{day_note}"
```

**Key pitfall**: The UTC offset in Wikipedia HTML uses Unicode minus sign `−` (U+2212), NOT ASCII hyphen `-`. Always `.replace('−', '-')` before `int()`.

## Sports & Prediction Markets Data Sources

Domain-specific data extraction for football/sports analysis and betting odds research. Full reference: `references/sports-data-sources.md`

### Core Tool: lynx (JS-Rendered Page Viewer)

`lynx -dump -nolist <url>` is often the only way to see JS-rendered live scores and odds tables from the server.

**Verified sports sites:**

| Site | lynx | curl | Notes |
|------|------|------|-------|
| **FlashScore** | ★★★ | ❌ | Main page live scores; match details unreliable since 2026-06-10 |
| **LiveScore** | ★★★★ | ✅ | JSON API readable |
| **NowGoal** | ★★★★★ | ❌ | Scores + odds tables fully readable; use `nowgoal.net/oddscomp/{match_id}` |
| **Goal.com** | ★★★ | ✅ curl+bs4 | JSON match data via `goal.com/en/live-scores` (future matches only) |
| **Wikipedia** | ★★★★★ | ✅ | Team history, FIFA rankings, footballbox parsing |
| **Polymarket** | — | ✅ | Gamma API for prediction markets |

### Goal.com JSON Match Extraction

```python
import urllib.request, json

url = "https://www.goal.com/en/live-scores"
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
resp = urllib.request.urlopen(req, timeout=10)
html = resp.read().decode('utf-8', errors='replace')

# "matches" JSON array — use bracket depth counting to find boundaries
start = html.find('"matches"')
bracket_start = html.find('[', start)
depth, end = 0, bracket_start
for i in range(bracket_start, len(html)):
    if html[i] == '[': depth += 1
    elif html[i] == ']':
        depth -= 1
        if depth == 0: end = i + 1; break
matches = json.loads(html[bracket_start:end])
```

**Key:** Goal.com only shows FUTURE matches. Use NowGoal via lynx for completed match results.

### 500.com (Chinese Football Data)

| URL | Access | Content |
|-----|--------|---------|
| `www.500.com` (mobile) | ✅ Homepage | Match results, preview news |
| `liansai.500.com/zuqiu-19476/` | ✅ No login | World Cup schedule + European odds |
| `odds.500.com` | ❌ Login required | Asian handicap / over-under |

Encoding: `gb2312`, not utf-8. `urllib.request.urlopen(req).read().decode("gb2312", errors="replace")`

### Server Network Notes for Sports Sites

- ✅ Wikipedia, Polymarket, arXiv, Goal.com, 500.com (homepage)
- ❌ Google, Reddit, Bilibili, YouTube, ESPN, Transfermarkt, Sofascore API (403)
- ⚠️ Yahoo Search — sometimes returns plain text results

### Odds Data Reality

**Almost all odds sites are JS-dynamic.** bet365 is Cloudflare-blocked. OddsPortal times out. The server cannot fetch live betting odds directly. Options:
- **NowGoal** via lynx (reads odds table structure, may show "--" for unopened matches)
- **FIFA API** — rankings only: `api.fifa.com/api/v1/ranking`
- **Polymarket Gamma API** — prediction market odds for major tournaments only
- **Never fabricate odds numbers** — if 2-3 methods fail, say "can't find them"

### Sports Betting Analysis (Related Skill)

For the betting analysis framework (psychology model, 4-step probability method, strategy) use the `sports-betting-analysis` skill.

## Common Pitfalls

0. **Never delegate work back to the user** — If you can't find data, try different tools/sites/methods yourself. Never say "can you tell me which match" or "can you send me the result." The user pays you to do the work. Figure it out yourself: install new tools, try different data sources, use lynx/bs4/Jina AI. Only say "I can't find it" after exhausting 2-3 approaches.

1. **Wikipedia HTML is complex** — don't try to parse the full page at once. Split by section headers first.
2. **Regex timeout** — keep patterns specific. Broad patterns like `.*` on large HTML can be slow.
3. **Encoding issues** — always use `errors="replace"` when decoding. For Chinese sites (500.com), use `gb2312` encoding.
4. **Rate limiting** — Wikipedia allows reasonable use, but don't hammer it. Cache results.
5. **Memory limit** — memory has ~2200 char limit. Keep entries concise, save details to files.
6. **Unicode minus sign** — Wikipedia uses `−` (U+2212) in UTC offsets, not `-` (U+002D). Replace before int conversion.
7. **footballbox class** — The CSS class is `footballbox` (no space), matched via `[^"]*footballbox[^"]*` to handle multiple classes.

## Server Network Notes

On Hostinger servers, many sites are blocked. Reliable sources:
- ✅ Wikipedia (en.wikipedia.org) — full access
- ✅ Polymarket APIs — full access
- ✅ arXiv — full access
- ✅ **Goal.com** (`goal.com/en/live-scores`) — ✅ 可访问！有完整今日赛程JSON
- ✅ **500.com (500彩票网)** — 专业足球数据，可访问！GB2312编码
- ❌ Google — blocked
- ❌ Reddit — 403
- ❌ Bilibili — 412
- ❌ YouTube — requires captcha
- ❌ 雷速体育 (leisu.com) — blocked
- ❌ 虎扑 (hupu.com) — blocked
- ❌ 直播吧 (zhibo8.com) — JS动态渲染，服务器端无法抓取实时数据
- ❌ ESPN / Transfermarkt — JS渲染或无法访问
- ❌ **Bing / Google 搜索结果** — JS渲染，curl只能拿到搜索框
- ✅ **Yahoo Search** (`search.yahoo.com`) — 有时能返回纯文本搜索结果
- ✅ **Mojeek** (`mojeek.com`) — 有时能返回纯文本结果

### ⚠️ 赔率数据获取（2026-06-10最终确认）
**所有赔率网站均无法从服务器端获取数据。** 全部JS动态渲染：
- OddsPortal / NowGoal / BetExplorer / Oddschecker / FlashScore — JS渲染
- Sofascore API — 403 Forbidden
- Google/Bing/Yahoo 搜索结果 — JS渲染，拿不到实际赔率数字
- Polymarket — 只有世界杯冠军等大盘口，友谊赛无市场

**正确做法**：如果2-3种方式都查不到赔率，直接告诉用户"查不到"，不要继续尝试10+种方式浪费时间。**绝对不要编造赔率数字。**

### ✅ lynx — 服务器端网页浏览突破口（2026-06-09~10发现）

**lynx是服务器端唯一能"看到"JS渲染后页面内容的工具。** 已安装：`/usr/bin/lynx`

```bash
# 基本用法
lynx -dump -nolist <url>
lynx -dump -nolist -width=200 <url>  # 宽页面用
```

**已验证lynx可读的网站：**
| 网站 | lynx效果 | 说明 |
|------|---------|------|
| FlashScore | ★★★★★ | 可读完整比赛列表+比分+赛程 |
| LiveScore | ★★★★ | 可读比赛列表 |
| NowGoal | ★★★★ | 可读比分+赔率表格结构 |
| Goal.com | ★★★ | 可读导航，比赛数据需curl+bs4 |
| Wikipedia | ★★★★★ | 完整可读 |
| OddsPortal | ★★ | 只能看到搜索框 |
| Google/Bing | ★ | JS强制跳转，lynx也救不了 |

**Python解析工具（已装到/opt/hermes-venv/）：**
- beautifulsoup4 + lxml + html2text
- 配合lynx或curl使用

**Jina AI备选解析：**
```bash
curl -s "https://r.jina.ai/https://目标网站" -H "User-Agent: Mozilla/5.0"
```
有时能绕过JS渲染获取页面内容，但效果因站而异。

### 足球数据网站访问能力排名（2026-06-10最终版）
1. **FlashScore** (flashscore.com) — lynx可读完整比赛列表
2. **LiveScore** (livescore.com) — lynx可读+JSON API
3. **Goal.com** — curl+bs4解析matches JSON
4. **NowGoal** (nowgoal.net) — lynx可读比分+赔率表格
5. **Wikipedia** — 历史/球队数据

### 查数据标准流程
1. **先用lynx试** → 能读到就直接用
2. **lynx读不到** → curl+bs4解析HTML/JSON
3. **还不行** → Jina AI (r.jina.ai) 备选
4. **都不行** → 换数据源（如用FlashScore代替OddsPortal）
5. **最多试3种方式**，查不到直说，不要继续尝试10+种方式浪费时间
- ✅ **Mojeek** (`mojeek.com`) — 有时能返回纯文本结果

### Goal.com — 实时比赛数据提取（2026-06-09验证）
Goal.com的live-scores页面在HTML中嵌入了JSON数据，可以用Python urllib + regex提取：

```python
import urllib.request, re, json
from datetime import datetime, timezone, timedelta

url = "https://www.goal.com/en/live-scores"
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
resp = urllib.request.urlopen(req, timeout=10)
html = resp.read().decode('utf-8', errors='replace')

# 找到 "matches" JSON数组的起始位置
start = html.find('"matches"')
bracket_start = html.find('[', start)

# 找到匹配的右括号（处理嵌套）— 不能用简单regex，必须用括号深度计数
depth = 0
end = bracket_start
for i in range(bracket_start, len(html)):
    if html[i] == '[': depth += 1
    elif html[i] == ']':
        depth -= 1
        if depth == 0: end = i + 1; break

matches = json.loads(html[bracket_start:end])

# 解析每场比赛
CST = timezone(timedelta(hours=8))
for m in matches:
    start_utc = datetime.fromisoformat(m['startDate'].replace('Z', '+00:00'))
    start_cst = start_utc.astimezone(CST)
    status = m.get('status', '')  # RESULT / LIVE / INPLAY / HALFTIME / ''(未开始)
    team_a = m['teamA']['long']
    team_b = m['teamB']['long']
    score_a = m.get('score', {}).get('teamA', '-') if m.get('score') else '-'
    score_b = m.get('score', {}).get('teamB', '-') if m.get('score') else '-'
    print(f"{start_cst.strftime('%H:%M')} {team_a} {score_a}-{score_b} {team_b} [{status}]")
```

**关键**：不能用简单的regex `r'"matches":\s*(\[.*?\])'` 因为JSON内部有嵌套括号。必须用括号深度计数法找到完整的JSON数组。

### 赔率数据获取（2026-06-10更新）
**⚠️ 所有赔率网站均无法从服务器端获取数据。** 全部JS动态渲染：
- OddsPortal / NowGoal / BetExplorer / Oddschecker / FlashScore — JS渲染
- Sofascore API — 403 Forbidden
- Google/Bing 搜索结果 — JS渲染，拿不到实际内容
- Yahoo Search — 有时能返回纯文本，但通常不含具体赔率数字
- Polymarket — 只有世界杯冠军等大盘口，友谊赛无市场

**正确做法**：如果2-3种方式都查不到赔率，直接告诉用户"查不到"，不要继续尝试10+种方式浪费时间。不要编造赔率数字。

### 500.com — Professional Football Data
500.com is a professional Chinese football betting data site. It's accessible from the server.
- Use `decode("gb2312", errors="replace")` for encoding
- URL pattern: `https://liansai.500.com/zuqiu-{id}/` for league schedules
- More accurate than Wikipedia for Chinese-market odds and analysis
- 张哥 recommends this over domestic sites like 雷速

When a source is blocked, try:
1. Python urllib (sometimes works when curl doesn't)
2. Alternative sources (Wikipedia instead of Google)
3. Ask 张哥 to look it up on his PC

## Reference Files

- `references/worldcup-2026-groups.md` — 2026世界杯完整小组赛对阵表（48队，12组），含时差换算和提取代码
- `references/worldcup-2026-schedule.md` — 前8场赛程（北京时间）、分组、时差表、张哥可看场次
- `references/sports-data-sources.md` — Football/sports data sources: lynx usage, Goal.com JSON, NowGoal odds tables, 500.com access, server network notes (absorbed from sports-betting-analysis)