# Gateway Diagnostics — Detailed Fix Procedures

## 0. Is Gateway Alive?

```bash
hermes gateway status
ps aux | grep hermes | grep -v grep
```

If gateway is NOT running, restart it: `hermes gateway run --replace` via background terminal.

**This server runs gateway manually (not systemd)** — no auto-restart on crash.

## Fix A: Corrupted state.db

**Symptom:** `file is not a database`, `SQLite session store unavailable`

```bash
cp /root/.hermes/state.db /root/.hermes/state.db.corrupt
rm /root/.hermes/state.db
# gateway falls back to JSONL automatically
# then restart: hermes gateway run --replace
```

## Fix B: Schema Mismatch After Update (kanban.db)

**Symptom:** `no such column: session_id`, crash loop, `status=1/FAILURE`

```bash
# Check columns
python3 -c "
import sqlite3
conn = sqlite3.connect('/root/.hermes/kanban.db')
cursor = conn.execute('PRAGMA table_info(tasks)')
print([row[1] for row in cursor.fetchall()])
conn.close()
"

# Add missing columns
python3 -c "
import sqlite3
conn = sqlite3.connect('/root/.hermes/kanban.db')
try:
    conn.execute('ALTER TABLE tasks ADD COLUMN session_id TEXT')
    print('Added session_id')
except Exception as e:
    print(f'session_id: {e}')
conn.commit()
conn.close()
"

# Restart gateway — MUST restart for cache to clear
hermes gateway run --replace
sleep 15
hermes gateway status  # should stay running >30s
```

## Fix B2: Corrupted kanban.db (File-Level)

**Symptom:** `kanban.db is not a valid SQLite database`

```bash
python3 -c "
import sqlite3
try:
    conn = sqlite3.connect('/root/.hermes/kanban.db')
    conn.execute('SELECT 1')
    conn.close()
    print('OK')
except Exception as e:
    print(f'CORRUPT: {e}')
"

cp /root/.hermes/kanban.db /root/.hermes/kanban.db.corrupt.$(date +%Y%m%d%H%M%S)
rm /root/.hermes/kanban.db
hermes kanban init
hermes gateway run --replace
```

**⚠️ Gateway caches DB fingerprint!** Replacing file is NOT enough — must restart gateway process.

## Fix C: QQ WebSocket Timeout (code=4009)

**Symptom:** `WebSocket closed: code=4009 reason=Session timed out`

Usually resolves after gateway restart. If persistent:
1. Check `QQ_APP_ID` and `QQ_CLIENT_SECRET` in `.env`
2. Verify network connectivity from server
3. Check if QQ bot platform changed API endpoints

## Fix D: Full Gateway Reset

```bash
# Stop gateway
sudo systemctl stop hermes-gateway  # or pkill -f hermes
# Fix all DB issues
mv /root/.hermes/state.db /root/.hermes/state.db.bak 2>/dev/null
# Start fresh
hermes gateway run --replace
```

## Fix E: Gateway Process Dead (No Auto-Restart)

**Symptom:** `hermes gateway status` shows "not running", QQ/Telegram silent for hours

```bash
# CORRECT way (NOT nohup, NOT &):
# Via terminal tool: terminal(background=true, command="hermes gateway run --replace")
```

## Fix F: Diagnose QQ WebSocket Disconnection

**Symptom:** QQ bot not responding but gateway shows "running"

```bash
GW_PID=$(pgrep -f "hermes gateway" | head -1)
ss -tnp | grep $GW_PID
# ESTABLISHED connections to 43.159.x.x:443 or 43.128.x.x:443 = OK
# No connections = WebSocket dead, restart gateway
```

## API Relay Switching (OpenRouter → OpenKey)

When OpenRouter KYC blocks 充值:

```yaml
# In config.yaml
providers:
  openkey:
    base_url: https://openkey.cloud/v1
    api_key: sk-xxx           # Use api_key DIRECTLY, NOT api_key_env!
    api_mode: chat_completions
model:
  default: gpt-4o-mini        # No provider prefix!
  provider: openkey            # Must match provider name above
  base_url: https://openkey.cloud/v1
  api_mode: chat_completions
```

**Pitfall:** `api_key_env` does NOT work for custom providers — the gateway process caches env vars at startup and won't see new ones from `.env`. Always use `api_key` directly.

**Pitfall:** Model name format — do NOT use `openkey/gpt-4o-mini`. When `model.provider: openkey` is set, Hermes sends the bare model name to the correct API endpoint.
