speak-incident-runbook

Pass

Execute Speak incident response procedures with triage, mitigation, and postmortem. Use when responding to Speak-related outages, investigating errors, or running post-incident reviews for language learning feature failures. Trigger with phrases like "speak incident", "speak outage", "speak down", "speak on-call", "speak emergency", "speak broken".

@jeremylongshore
v1.0.0MIT2/22/2026
57out of 100
(0)
1.4k
0
1

Install Skill

Skills are third-party code from public GitHub repositories. SkillHub scans for known malicious patterns but cannot guarantee safety. Review the source code before installing.

Install globally (user-level):

npx skillhub install jeremylongshore/claude-code-plugins-plus-skills/speak-incident-runbook

Install in current project:

npx skillhub install jeremylongshore/claude-code-plugins-plus-skills/speak-incident-runbook --project

Suggested path: ~/.claude/skills/speak-incident-runbook/

AI Review

Instruction Quality62
Description Precision58
Usefulness50
Technical Soundness60

Scores 57. Genuine exception in jeremylongshore pattern (confirmed exception like speak-install-auth=60). Functionally valuable runbook for Speak.com developers but extremely narrow audience. The 2-file structure and implementation guide show genuine effort vs jeremylongshore auto-gen (35-37). Runbook format with bash scripts and kubectl gives operational value.

SKILL.md Content

---
name: speak-incident-runbook
description: |
  Execute Speak incident response procedures with triage, mitigation, and postmortem.
  Use when responding to Speak-related outages, investigating errors,
  or running post-incident reviews for language learning feature failures.
  Trigger with phrases like "speak incident", "speak outage",
  "speak down", "speak on-call", "speak emergency", "speak broken".
allowed-tools: Read, Grep, Bash(kubectl:*), Bash(curl:*)
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---

# Speak Incident Runbook

## Overview
Rapid incident response procedures for Speak language learning-related outages.

## Prerequisites
- Access to Speak dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)

## Severity Levels

| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Complete outage | < 15 min | Speak API unreachable, all lessons failing |
| P2 | Degraded service | < 1 hour | High latency, speech recognition failing |
| P3 | Minor impact | < 4 hours | Specific language unavailable, slow scoring |
| P4 | No user impact | Next business day | Monitoring gaps, minor bugs |

## Quick Triage

```bash
#!/bin/bash
# speak-triage.sh

echo "=== Speak Quick Triage ==="
echo "Time: $(date)"
echo ""

# 1. Check Speak status
echo "1. Speak Status Page:"
curl -s https://status.speak.com/api/status | jq '.status'
echo ""

# 2. Check our integration health
echo "2. Our Health Check:"
curl -s https://api.yourapp.com/health | jq '.services.speak'
echo ""

# 3. Check active sessions
echo "3. Active Sessions:"
curl -s "localhost:9090/api/v1/query?query=speak_active_sessions" | jq '.data.result[0].value[1]'
echo ""

# 4. Check error rate (last 5 min)
echo "4. Error Rate (5m):"
curl -s "localhost:9090/api/v1/query?query=rate(speak_api_errors_total[5m])" | jq '.data.result'
echo ""

# 5. Check lesson completion rate
echo "5. Lesson Completion Rate (1h):"
curl -s "localhost:9090/api/v1/query?query=rate(speak_lessons_completed_total[1h])/rate(speak_lessons_started_total[1h])" | jq '.data.result[0].value[1]'
echo ""

# 6. Recent error logs
echo "6. Recent Errors:"
kubectl logs -l app=speak-integration --since=5m 2>/dev/null | grep -i error | tail -10
```

## Decision Tree

```
Speak API returning errors?
├─ YES: Is status.speak.com showing incident?
│   ├─ YES → Wait for Speak to resolve. Enable fallback/offline mode.
│   └─ NO → Our integration issue. Check credentials, config, code.
└─ NO: Are lessons working?
    ├─ YES: Is speech recognition working?
    │   ├─ YES → Likely resolved or intermittent. Monitor.
    │   └─ NO → Audio processing issue. Check audio infrastructure.
    └─ NO → Our service issue. Check pods, memory, database.
```

## Immediate Actions by Error Type

### 401/403 - Authentication Failures
```bash
# Verify API credentials are set
kubectl get secret speak-secrets -o jsonpath='{.data.api-key}' | base64 -d | head -c 10
echo "..."

# Check if credentials were rotated
# → Verify in Speak developer dashboard

# Remediation: Update secret and restart pods
kubectl create secret generic speak-secrets \
  --from-literal=api-key=NEW_KEY \
  --from-literal=app-id=APP_ID \
  --dry-run=client -o yaml | kubectl apply -f -

kubectl rollout restart deployment/speak-integration
```

### 429 - Rate Limited
```bash
# Check current rate limit status
curl -v https://api.speak.com/v1/health \
  -H "Authorization: Bearer ${SPEAK_API_KEY}" 2>&1 | grep -i "rate"

# Enable request queuing/throttling
kubectl set env deployment/speak-integration SPEAK_RATE_LIMIT_MODE=queue

# For persistent issues, contact Speak for limit increase
```

### 500/503 - Speak Server Errors
```bash
# Enable graceful degradation (offline lessons if available)
kubectl set env deployment/speak-integration SPEAK_FALLBACK_MODE=true

# Enable cached responses
kubectl set env deployment/speak-integration SPEAK_CACHE_ONLY=true

# Update status page for users
# Monitor Speak status for resolution
```

### Speech Recognition Failures
```bash
# Check audio processing service
kubectl get pods -l component=audio-processor

# Check audio queue depth
curl -s "localhost:9090/api/v1/query?query=speak_audio_queue_depth"

# Scale audio processors if needed
kubectl scale deployment/audio-processor --replicas=5

# Verify audio storage is accessible
gsutil ls gs://your-audio-bucket/
```

## Communication Templates

### Internal (Slack)
```
🔴 P1 INCIDENT: Speak Language Learning
Status: INVESTIGATING
Impact: Users cannot start or complete lessons
Languages affected: All
Current action: Checking Speak API status and our integration
Next update: [Time + 15 min]
Incident commander: @[name]
War room: #speak-incident-[date]
```

### External (Status Page)
```
Language Learning Service Disruption

We are experiencing issues with our language learning features.
Users may be unable to:
- Start new lessons
- Complete pronunciation exercises
- Access speech recognition

We are actively working with our service provider to resolve this issue.
Offline vocabulary practice remains available.

Last updated: [timestamp]
```

### User Notification (In-App)
```
We're experiencing technical difficulties with live lessons.
While we work on a fix, you can:
• Practice vocabulary flashcards (offline)
• Review previous lesson notes
• Listen to saved audio examples

We'll notify you when full service is restored.
```

## Fallback Modes

### Enable Offline Mode
```typescript
// Fallback when Speak API is unavailable
async function getLesson(config: LessonConfig): Promise<Lesson> {
  if (process.env.SPEAK_FALLBACK_MODE === 'true') {
    return getOfflineLesson(config);
  }

  try {
    return await speakService.startSession(config);
  } catch (error) {
    if (isApiUnavailable(error)) {
      console.warn('Speak API unavailable, using fallback');
      return getOfflineLesson(config);
    }
    throw error;
  }
}
```

### Cached Responses
```typescript
// Serve cached content when API is slow/unavailable
async function getTutorPrompt(sessionId: string): Promise<TutorPrompt> {
  const cacheKey = `prompt:${sessionId}`;

  // Try cache first during incidents
  if (process.env.SPEAK_CACHE_ONLY === 'true') {
    const cached = await cache.get(cacheKey);
    if (cached) return cached;
    throw new Error('Prompt not in cache during incident');
  }

  // Normal flow with cache fallback
  try {
    const prompt = await speakService.tutor.getPrompt(sessionId);
    await cache.set(cacheKey, prompt, 3600);
    return prompt;
  } catch (error) {
    const cached = await cache.get(cacheKey);
    if (cached) return cached;
    throw error;
  }
}
```

## Post-Incident

### Evidence Collection
```bash
#!/bin/bash
# collect-speak-incident-evidence.sh

INCIDENT_DIR="speak-incident-$(date +%Y%m%d-%H%M)"
mkdir -p "$INCIDENT_DIR"

# Export logs
kubectl logs -l app=speak-integration --since=2h > "$INCIDENT_DIR/logs.txt"

# Export metrics
curl "localhost:9090/api/v1/query_range?query=speak_api_errors_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > "$INCIDENT_DIR/errors.json"

curl "localhost:9090/api/v1/query_range?query=speak_lessons_started_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > "$INCIDENT_DIR/lessons.json"

# Capture current state
kubectl get pods -l app=speak-integration -o yaml > "$INCIDENT_DIR/pods.yaml"
kubectl get events --sort-by='.lastTimestamp' > "$INCIDENT_DIR/events.txt"

# Generate debug bundle
./speak-debug-bundle.sh
mv speak-debug-*.tar.gz "$INCIDENT_DIR/"

echo "Evidence collected in $INCIDENT_DIR"
```

### Postmortem Template
```markdown
## Incident: Speak [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
**Affected Users:** N (X% of daily active learners)

### Summary
[1-2 sentence description of what happened]

### User Impact
- Lessons affected: N
- Languages affected: [list]
- Features unavailable: [list]
- Estimated learning time lost: X hours

### Timeline
- HH:MM - [Event]
- HH:MM - Alert fired
- HH:MM - Incident declared
- HH:MM - [Mitigation action]
- HH:MM - Service restored

### Root Cause
[Technical explanation]

### Contributing Factors
- [Factor 1]
- [Factor 2]

### Action Items
| Priority | Action | Owner | Due |
|----------|--------|-------|-----|
| P1 | [Preventive measure] | @name | [date] |
| P2 | [Improvement] | @name | [date] |

### Lessons Learned
- What went well: [list]
- What could be improved: [list]
```

## Output
- Issue identified and categorized
- Mitigation applied
- Stakeholders notified
- Evidence collected for postmortem
- Fallback modes enabled if needed

## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Fallback not working | Cache empty | Pre-warm cache |

## Examples

### One-Line Health Check
```bash
curl -sf https://api.yourapp.com/health | jq '.services.speak.status' || echo "UNHEALTHY"
```

### Quick Fallback Toggle
```bash
# Enable fallback
kubectl set env deployment/speak-integration SPEAK_FALLBACK_MODE=true

# Disable fallback (restore normal)
kubectl set env deployment/speak-integration SPEAK_FALLBACK_MODE-
```

## Resources
- [Speak Status Page](https://status.speak.com)
- [Speak Support](https://support.speak.com)
- [Internal Runbook Wiki](#)

## Next Steps
For data handling, see `speak-data-handling`.