uptime-skills

Set up monitoring, triage incidents, and audit SRE coverage with Uptime.com.

10 skills

uptime

MCP skill server: uptime

{
  "type": "http",
  "url": "${UPTIME_MCP_URL:-https://mcp.uptime.com/mcp}",
  "oauth": {
    "clientId": "MYH77e5qvqbYjKU01EBgIYU8ZQwhVGxpmfUKBUHU"
  }
}

api-scripting

# API check scripting

API checks execute multi-step HTTP request sequences with assertions. Scripts are JSON arrays of step objects executed sequentially. If any assertion fails, execution stops and an alert is raised.

For the complete step type catalog, parameters, selectors, and variables, see `references/step-reference.md`.

## Script format

```json
[{ "step_def": "C_GET", "values": { "url": "https://api.example.com/health" } }, { "step_def": "V_HTTP_STATUS_CODE_SUCCESSFUL" }]
```

The key is `step_def` (not `step_type`). All string values support `$VARIABLE$` interpolation.

## Authentication patterns

### Bearer token flow

The most common pattern: authenticate, extract token, use in subsequent requests.

```json
[
  {
    "step_def": "C_POST",
    "values": {
      "url": "https://api.example.com/login",
      "content_type": "application/json",
      "data": "{\"email\": \"user@example.com\", \"password\": \"secret\"}"
    }
  },
  { "step_def": "V_HTTP_STATUS_CODE_IS", "values": { "status_code": "200" } },
  {
    "step_def": "C_SET_VARIABLE_SELECTOR",
    "values": { "name": "TOKEN", "selector": "data.access_token" }
  },
  {
    "step_def": "C_GET",
    "values": {
      "url": "https://api.example.com/profile",
      "headers": { "Authorization": "Bearer $TOKEN$" }
    }
  },
  { "step_def": "V_HTTP_STATUS_CODE_SUCCESSFUL" }
]
```

### Static API key

Set in `C_SETTINGS_AND_AUTH` as a default header:

```json
{
  "step_def": "C_SETTINGS_AND_AUTH",
  "values": { "headers": { "X-API-Key": "your-api-key" }, "content_type": "application/json" }
}
```

### Client certificate (mTLS)

Set `certificate`, `key`, and optionally `passphrase` in `C_SETTINGS_AND_AUTH`.

## Scripting workflow

1. Understand the API flow to monitor
2. Determine authentication method (bearer token, API key, basic auth, mTLS)
3. Build the script: authenticate, then exercise the key endpoints
4. Add assertions after each request (`V_HTTP_STATUS_CODE_SUCCESSFUL` at minimum)
5. Use `C_SET_VARIABLE_SELECTOR` to chain responses between steps
6. Present the monitoring plan as a numbered list explaining what each step does in plain language
7. Confirm with the user before creating the check via `create_api_check`

## Common pitfalls

- **Using `C_SET_VARIABLE`**: deprecated. Use `C_SET_VARIABLE_SELECTOR` instead.
- **Missing assertions**: every request should have at least a status code assertion. Without one, a 500 error goes undetected.
- **Hardcoded tokens**: use the bearer token flow pattern to authenticate dynamically. Static tokens expire.
- **Wrong selector format**: JSON responses use dot notation (`data.user.email`), XML uses XPath (`//user/email/text()`). The runner auto-detects from Content-Type.
- **Regex syntax**: API checks use Go RE2 regex, not PCRE. Lookaheads and backreferences are not supported.

dashboard-management

# Dashboard management — operational knowledge

Workflow patterns for creating and organizing monitoring dashboards.

## Dashboard creation workflow

### Step 1 — Determine scope

Dashboards work best when they have a clear purpose. Common patterns:

| Pattern                | Scope                             | Use case                                         |
| ---------------------- | --------------------------------- | ------------------------------------------------ |
| Domain dashboard       | All checks for one domain         | Per-service visibility                           |
| Team dashboard         | All checks owned by a team        | Team-level overview                              |
| Service tier dashboard | All critical/production checks    | Executive or on-call view                        |
| Check type dashboard   | All checks of one type (e.g. SSL) | Focused monitoring (certificate expiry overview) |

### Step 2 — Check capacity and gather checks

Call `get_account_usage` to check whether dashboard or widget limits apply to the account. If limits exist and are near capacity (>80%), warn the user before creating new dashboards or widgets.

Gather checks based on the chosen scope:

- `list_checks` with tag filter for domain/team dashboards
- `list_checks` then filter by check type for type-specific dashboards
- `list_tags` to discover grouping structure if scope isn't clear

### Step 3 — Create dashboard

`create_dashboard` with:

- **Name**: descriptive, matching the scope (e.g. "example.com monitoring", "Platform team overview")
- **Description**: brief note on what this dashboard covers and who it's for

### Step 4 — Add widgets

Add widgets for the checks gathered in Step 2. Widget selection depends on what information matters for the dashboard's audience.

## Widget selection guidance

| Widget type       | Best for             | Notes                                             |
| ----------------- | -------------------- | ------------------------------------------------- |
| Check status      | At-a-glance up/down  | Use for overview dashboards — shows current state |
| Response time     | Performance trending | Best for HTTP and TCP checks — shows latency      |
| Uptime percentage | SLA reporting        | Shows uptime over a time window                   |
| Alert history     | Incident patterns    | Shows when and how often alerts fired             |

### Widget configuration

When adding a widget, specify:

- **Check**: which monitoring check the widget displays data for
- **Widget type**: one of the types above
- **Time range**: the lookback window for historical data (e.g. 24h, 7d, 30d)

Response time and uptime percentage widgets are most useful with longer time ranges (7d+). Check status widgets are best with short/live views.

### Widget types by check type

Not all widget types make sense for all check types:

| Check type          | Check status | Response time | Uptime % | Alert history |
| ------------------- | ------------ | ------------- | -------- | ------------- |
| HTTP                | Yes          | Yes           | Yes      | Yes           |
| DNS, ICMP, TCP, UDP | Yes          | Yes           | Yes      | Yes           |
| SSL, WHOIS, RDAP    | Yes          | No            | Yes      | Yes           |
| Blacklist, Malware  | Yes          | No            | No       | Yes           |
| Page Speed          | Yes          | No            | No       | No            |
| Group               | Yes          | No            | Yes      | Yes           |
| Transaction, API    | Yes          | Yes           | Yes      | Yes           |

SSL, WHOIS, RDAP, Blacklist, and Malware are auto-located checks with infrequent intervals, so response time widgets add no value. Page Speed produces Lighthouse scores, not uptime data.

### Recommended layouts

**Domain dashboard**: group widgets by check type

1. HTTP checks (status + response time)
2. DNS checks (status)
3. SSL/Certificate checks (status + expiry)
4. Infrastructure checks (ICMP, TCP — status)
5. Registration checks (WHOIS, RDAP — status)

**On-call dashboard**: prioritize actionable information

1. Currently alerting checks (status, filtered to down)
2. Recently recovered checks
3. Response time anomalies

**Executive dashboard**: focus on outcomes

1. Uptime percentages for key services
2. Page Speed scores
3. SSL certificate expiry countdown

## Naming conventions

Consistent dashboard names help with discovery:

- `{domain} monitoring` — per-domain dashboards
- `{team} overview` — team dashboards
- `{purpose} dashboard` — special-purpose (e.g. "SSL expiry dashboard")

## Updating existing dashboards

When monitoring evolves, dashboards need to keep up.

### Adding checks to a dashboard

1. `get_dashboard` to see current widgets.
2. Identify which new checks are missing.
3. Add widgets for each new check, matching the existing layout pattern.

### Rebuilding a dashboard

When a dashboard has drifted significantly (many stale widgets, missing checks):

1. `list_checks` with the domain tag to get the current check set.
2. Compare against existing widgets.
3. Remove widgets for deleted checks.
4. Add widgets for new checks.
5. Reorder to match the layout conventions above.

### When to create a new dashboard vs update

- **Update**: checks were added or removed, but the scope is the same
- **New dashboard**: monitoring scope changed (e.g. a domain was split into multiple services, or a new team took ownership)

## Maintenance

Dashboards can drift as checks are added or removed:

- When creating new checks, add them to the relevant dashboard
- When deleting checks, associated widgets may need cleanup
- Periodically review dashboards for stale or orphaned widgets
- When `monitoring-optimization` finds configuration issues, update widgets to reflect changes

## Deleting dashboards

Use `delete_dashboard` only for permanent removal. Deletion removes all widgets. If a dashboard is temporarily unneeded, consider removing widgets instead and keeping the dashboard shell for later reuse.

Before deleting, confirm the dashboard isn't referenced by other team members or linked from external tools (e.g. Slack bookmarks, wiki pages).

## Linking dashboards to setup workflow

When `monitoring-planning` creates checks for a new domain, Phase 2 creates a dashboard. This ensures every monitored domain has a corresponding dashboard from day one. The dashboard should include widgets for all checks created during setup.

incident-triage

# Incident triage

Workflow for investigating alerts, outages, and service degradation.

## Alert review workflow

### Step 1: get current alerts

`list_alerts` to see all active alerts. Key fields:

| Field        | Meaning                                            |
| ------------ | -------------------------------------------------- |
| `alert_type` | Check type that triggered (HTTP, DNS, SSL, etc.)   |
| `is_up`      | `false` = currently down, `true` = recovered       |
| `created_at` | When the alert fired                               |
| `output`     | Raw check output, the most useful diagnostic field |

### Step 2: investigate the check

`get_check` on the alerting check to see:

- Current configuration (is it checking the right thing?)
- Last response time and status
- Whether the check is paused
- Contact groups (is anyone being notified?)

### Step 3: get outage details

`list_outages` filtered by check to see the timeline:

- Outage start and end times
- Duration
- Which probe locations detected the failure

## Correlation patterns

Multiple simultaneous alerts often point to a root cause upstream of any individual check.

### DNS + HTTP failures

| Alerts firing         | Likely root cause                                                |
| --------------------- | ---------------------------------------------------------------- |
| DNS A + HTTP          | DNS resolution failure; HTTP can't connect because DNS is broken |
| DNS NS + DNS A + HTTP | Nameserver failure, cascading into all resolution                |
| DNS MX + SMTP         | DNS-level mail routing failure                                   |

**Triage action**: check DNS checks first. If NS is down, that's the root cause.

### SSL + HTTP failures

| Alerts firing                            | Likely root cause                                      |
| ---------------------------------------- | ------------------------------------------------------ |
| SSL + HTTP (certificate error in output) | Expired or misconfigured certificate                   |
| SSL only (HTTP still passing)            | Certificate issue browsers warn on but don't block yet |

### Widespread failures (many checks, many domains)

If checks across multiple unrelated domains fail simultaneously:

- Likely a probe location issue, not a target issue
- Check if all failing checks share the same probe locations
- May indicate a monitoring platform issue rather than a real outage

### Single check failure

- HTTP down + DNS OK + ICMP OK: application-level issue (web server, load balancer)
- ICMP down + everything else down: host/network unreachable
- TCP port check down + HTTP OK: possible firewall change on the specific port

## Upstream provider correlation

**This step is mandatory during every triage.** Many outages that appear local are actually caused by infrastructure provider incidents.

### Always check major providers

- Check existing CloudStatus checks; if upstream dependency monitoring is configured, they will already show provider incidents
- If no CloudStatus checks exist, use web search to check for ongoing incidents at providers identified via DNS inference

Key providers to check: Cloudflare, AWS, Google Cloud, Microsoft Azure, Fastly, Akamai.

### When to suspect upstream cause

- Multiple unrelated domains failing simultaneously
- Failures concentrated at specific probe locations (regional provider outage)
- DNS timeouts across many checks (upstream DNS provider issue)
- SSL/TLS handshake failures across domains (CDN or certificate provider issue)
- Outage timing coincides with a known provider incident

### DNS-based provider detection

Even without explicit dependency monitoring, infer providers from DNS records:

| Record type | Pattern                | Provider         |
| ----------- | ---------------------- | ---------------- |
| CNAME       | `*.cloudfront.net`     | AWS CloudFront   |
| CNAME       | `*.cdn.cloudflare.net` | Cloudflare CDN   |
| CNAME       | `*.fastly.net`         | Fastly           |
| NS          | `*.cloudflare.com`     | Cloudflare DNS   |
| NS          | `awsdns-*`             | AWS Route 53     |
| MX          | `*.google.com`         | Google Workspace |
| MX          | `*.outlook.com`        | Microsoft 365    |

If DNS checks show CNAME pointing to `*.cloudfront.net` and AWS CloudFront is reporting an incident, that's the root cause.

### Reporting upstream correlation

> **Upstream incident detected**: Cloudflare is reporting degraded performance in EU regions (started 14:23 UTC). Your checks failing from EU probe locations are likely caused by this. No action needed on your side; monitor Cloudflare's status page for resolution.

## Escalation guidelines

### Check escalation rules

Before triaging, note whether alerting checks have escalation rules. If escalation rules exist, someone may already be notified. If not and the outage is critical, manually notify appropriate stakeholders.

### Immediate action needed

- HTTP check on production service with `is_up: false` for > 5 minutes
- SSL certificate expiring within 24 hours
- DNS NS failure (cascading impact)
- All checks for a domain failing simultaneously

### Can wait / investigate further

- Single probe location failure (likely probe issue)
- Blacklist alert (check if it's a false positive)
- WHOIS/RDAP alert (registration expiry is usually weeks away)
- Page Speed degradation (performance issue, not outage)

## False positives

False positives are alerts that fire when the service is actually healthy.

### Indicators

- Only 1 of N probe locations reports failure: likely a probe or regional network issue
- Alert fires and clears within one check interval: transient network blip
- Output shows timeout but other check types for the same target pass: check-specific timeout, not real downtime
- Repeated short-lived alerts from the same location: that location may have connectivity issues

### Tuning recommendations

If false positives are frequent, recommend adjusting the monitoring configuration:

| Problem                         | Fix                                                                  |
| ------------------------------- | -------------------------------------------------------------------- |
| Single-location flapping        | Increase sensitivity to >= 2 (require multiple locations to confirm) |
| Probe location unreliable       | Replace with a different location in the same region                 |
| Timeout-based false alerts      | Increase `timeout` value to accommodate normal latency variance      |
| Interval too aggressive         | Increase interval for non-critical checks (e.g. 1 min -> 5 min)      |
| All checks share same locations | Diversify locations across regions to reduce correlated false alerts |

## False negatives

False negatives are real outages that monitoring fails to detect. These are more dangerous than false positives because they create a false sense of security.

### Indicators

- Users report downtime but no alerts fired
- Outage visible in external tools (e.g. Down Detector) but not in Uptime.com
- Post-incident review reveals the service was down for minutes/hours without alerting
- CloudStatus shows upstream provider incident but no corresponding alerts on dependent checks

### Common causes

| Cause                    | Why it happens                                                                 | Fix                                                             |
| ------------------------ | ------------------------------------------------------------------------------ | --------------------------------------------------------------- |
| Missing check types      | Only HTTP monitored, but DNS was the actual failure point                      | Add DNS, SSL, ICMP checks for comprehensive coverage            |
| Wrong endpoint monitored | Health endpoint returns 200 even when the app is broken                        | Monitor a functional endpoint that exercises the real code path |
| `expect_string` not set  | HTTP check passes on any 200 response, even error pages                        | Add `expect_string` to verify response content                  |
| Too few locations        | All probes are in one region; regional outage goes undetected from that region | Use 3-5 locations across multiple continents                    |
| Check is paused          | Forgotten manual pause or stale maintenance window                             | Review paused checks; convert to scheduled maintenance windows  |
| No upstream monitoring   | Provider outage causes degradation but no check covers the dependency          | Add CloudStatus checks for critical upstream providers          |

### When false negatives are frequent

Frequent false negatives indicate the monitoring strategy needs a broader review. Recommend invoking the `monitoring-optimization` skill to run a full audit: gap analysis, configuration review, and upstream dependency check.

## Ignoring alerts

Use `ignore_alert` to exclude a confirmed false positive from outage calculations. This is important for accurate SLA reporting: ignored alerts don't count as downtime.

When to ignore:

- Confirmed false positive (probe issue, not real outage)
- Alert caused by planned maintenance that wasn't covered by a maintenance window
- One-time transient error that doesn't reflect real availability

When NOT to ignore:

- Real outages, even brief ones: they should be reflected in uptime stats
- Alerts you haven't investigated yet: investigate first, ignore after

Always confirm with the user before ignoring alerts, as it affects SLA metrics.

## Communicating findings

1. **Summary**: how many alerts, how many active vs resolved
2. **Root cause assessment**: what's most likely causing the alerts
3. **Impact**: which services/domains are affected
4. **Recommended action**: what to do next

monitoring-optimization

# Monitoring optimization

Workflow for reviewing existing monitoring, identifying gaps, optimizing configuration, and managing checks.

## Audit workflow

### Step 1: inventory

Gather the full picture:

1. `get_account_usage`: plan limits and current consumption (check slots, per-type limits).
2. `list_checks`: all checks with types, targets, and status.
3. `list_tags`: how checks are grouped.
4. `list_contacts`: verify notification routing exists.

All four calls are independent and can run in parallel.

### Step 2: group by domain

Organize checks by target domain:

- Extract the registered domain from each check's address
- Group checks under their domain
- Note which check types exist per domain

### Step 3: gap analysis

For each domain, compare against recommended coverage:

**Critical gaps** (should almost always exist):

- No HTTP check: service availability not monitored
- No SSL check: certificate expiry not monitored
- No DNS A check: name resolution not monitored

**Important gaps**:

- No DNS NS check: nameserver health not monitored (cascading risk)
- No WHOIS/RDAP: domain registration expiry not monitored
- Mail-sending domain without SMTP check: delivery not monitored
- Mail-sending domain without Blacklist check: IP reputation not monitored

**Nice-to-have gaps**:

- No Page Speed: performance regression not tracked
- No Malware check: compromise detection missing
- Single DNS record type: limited DNS coverage
- No Group check: no aggregate view of domain health (standard for production domains)

### Step 4: configuration review

Check for suboptimal configurations:

| Issue                     | Detection                                       | Recommendation                                    |
| ------------------------- | ----------------------------------------------- | ------------------------------------------------- |
| Sensitivity = 1           | `sensitivity` field on location-based checks    | Increase to >= 2 to reduce false positives        |
| Very long intervals       | `interval` > 10 for HTTP checks                 | Consider 1-5 min for critical services            |
| No contact group          | Empty `contact_groups`                          | Assign a contact group or checks alert silently   |
| No escalation rules       | Missing `escalations` on critical checks        | Add escalation so unacknowledged alerts intensify |
| All checks same locations | Same `locations` array everywhere               | Diversify to catch regional issues                |
| Paused checks             | `is_paused: true`                               | Verify if intentional or forgotten                |
| No tags                   | Empty `tags`                                    | Add domain-based tags for organization            |
| No Group check per domain | Checks exist but no aggregate                   | Create Group with tag-based auto-selection        |
| No maintenance windows    | Critical checks with no scheduled maintenance   | Set up windows for known maintenance periods      |
| IPv4-only on dual-stack   | `use_ip_version` = IPv4 on IPv6-capable targets | Consider `ANY` or explicit IPv6 checks            |

### Step 5: upstream dependency audit

Scan existing check addresses for known third-party domains (e.g. `api.stripe.com`, `*.auth0.com`). These are dependencies that should have CloudStatus monitoring.

For each detected dependency without a CloudStatus check, recommend adding one.

### Step 6: report

Structure the audit report as:

1. **Account usage**: plan limits vs. current consumption, remaining capacity, any types at >80% utilization
2. **Inventory summary**: N checks across M domains, K tags
3. **Coverage by domain**: table showing which check types exist per domain
4. **Gaps found**: prioritized list of missing checks with severity
5. **Upstream dependencies**: detected providers, which have status monitoring
6. **Configuration issues**: suboptimal settings that should be adjusted
7. **Recommendations**: specific checks to create, ordered by priority (flagging any that would exceed plan limits)

## Modifying existing checks

Use `update_check` to modify check properties in place. Only specified fields change; omitted fields retain current values.

### Update workflow

1. **Get current state**: `get_check` to see existing configuration.
2. **Apply changes**: `update_check` with only the fields that need to change.
3. **Verify**: `get_check` again to confirm.

### Commonly updated fields

| Field            | Notes                                                                             |
| ---------------- | --------------------------------------------------------------------------------- |
| `interval`       | Minutes between checks. Some types have minimums (e.g. Page Speed >= 1440).       |
| `locations`      | Probe locations. Only for location-based checks; never set on auto-located types. |
| `sensitivity`    | Number of confirming locations before alerting. Use >= 2.                         |
| `contact_groups` | Array of contact group names/IDs.                                                 |
| `is_paused`      | `true` to pause, `false` to resume.                                               |
| `tags`           | Replaces the full tag list. Include existing tags you want to keep.               |

### Batch updates

1. `list_checks` with tag filter to get the target set.
2. Issue `update_check` calls in parallel.
3. Verify a sample with `get_check`.

## Maintenance windows

Scheduled maintenance is preferred over manual pausing:

- Define recurring windows (e.g. every Sunday 02:00-04:00 UTC)
- Checks pause and resume automatically
- Better audit trail than manual pause/resume

Use manual pausing only for unplanned, one-off situations.

## Escalations

Escalations are separate from contact groups. They define how alerts intensify:

- Contact groups receive the initial alert
- Escalation rules trigger if the alert is not acknowledged within a configured window
- Can notify different contacts or use different channels (e.g. phone call after SMS)

Verify critical checks have both a contact group _and_ escalation rules.

## Tag hygiene

Common issues:

- **Ungrouped checks**: no tags, hard to manage at scale
- **Stale tags**: tags with no checks (orphaned after deletions)
- **Inconsistent naming**: mix of `example.com`, `Example.com`, `example`

Recommend: one tag per registered domain, optional tags for environment and team.

## Interval recommendations

| Service tier                    | HTTP interval | DNS interval | Other intervals |
| ------------------------------- | ------------- | ------------ | --------------- |
| Critical (revenue-generating)   | 1 min         | 5 min        | 5-10 min        |
| Standard (internal tools)       | 5 min         | 10 min       | 10-30 min       |
| Low priority (informational)    | 10 min        | 30 min       | 60 min          |
| Auto-located (SSL, WHOIS, etc.) | -             | -            | 60-1440 min     |

## Deleting checks

Use `delete_check` only for permanent removal. Deletion is irreversible; check history is lost. Prefer pausing for temporary deactivation.

## Known MCP server issues

- **`get_contact` type parsing**: may fail with integer/string type mismatch on the `id` parameter. Use `list_contacts` as a workaround.

monitoring-planning

# Monitoring planning

End-to-end workflow for planning and creating monitoring checks for a domain.

For check type parameters and constraints, see `references/check-types.md`. For domain-specific check recommendations, see `references/checklist-domain-monitoring.md`. For check selection and configuration guidance, see `references/guide-check-selection.md`. For alert routing and escalation design, see `references/guide-alerting-patterns.md`.

## Execution style

Execute the workflow directly. Do not present a plan for approval or ask the user to confirm each batch of checks. The tool invocation confirmations provide sufficient user control.

Only prompt the user when you need to:

- Clarify a vague or ambiguous request (e.g. which subdomains to monitor)
- Gather technical details you cannot infer (e.g. expected response content, auth requirements)
- Confirm actions with external visibility (status pages, CloudStatus dependencies)

## Quick reference: check categories

### Location-based checks (require explicit locations)

HTTP, DNS, ICMP, TCP, UDP, SMTP, IMAP, POP, SSH, NTP. Select 3-5 probe locations, set sensitivity >= 2.

### Constrained checks

**Page Speed** has unique restrictions:

- Maximum **1 location** (validation error if more)
- Minimum interval **1440 minutes** (1 day)
- Must use `Dedicated-*` location prefix (e.g. `Dedicated-United Kingdom-London`)
- Standard locations will fail; only dedicated probe nodes run Lighthouse

## Precondition: verify MCP tooling

Before starting any workflow, confirm that Uptime.com MCP tools are available (e.g. `list_checks`, `create_http_check`, `list_contacts`). If no `uptime` MCP tools are present:

1. Stop immediately. Do not proceed with domain analysis or check planning.
2. Tell the user: "The Uptime.com MCP server is not connected. Please run `/mcp` to check the server status and authenticate."
3. Do not attempt to look up the Uptime.com API documentation online or construct raw API payloads as a workaround.

## Setup workflow

### Phase 0: verify capacity and contacts

Before creating checks, query account limits and verify contacts exist.

#### Check account usage

Call `get_account_usage` to retrieve plan limits and current consumption. The response includes total check slots, per-type limits, and current counts.

- If the account has **no remaining capacity** for the planned checks, stop and tell the user. Show current usage vs. limits.
- If adding the planned checks would use **>80% of any limit**, warn the user before proceeding. Example:

> Your plan allows 50 HTTP checks and you currently have 43. Adding 5 more brings you to 96% capacity. Proceed?

- If the plan does not support a specific check type (limit = 0), skip that check type and note it in the summary.

Adapt the check plan to fit within available capacity. Prioritize critical checks (HTTP, SSL, DNS A) over nice-to-have checks (Page Speed, Malware) when capacity is constrained.

#### Verify notification contacts

1. `list_contacts` to see existing contact groups.
2. If a suitable group exists (e.g. "Default"), use it for all checks.
3. If the user needs a dedicated contact group, `create_contact` with:
   - **name**: descriptive (e.g. "Platform team", "On-call SRE")
   - **email**: notification email addresses
   - **sms**: phone numbers in E.164 format (e.g. `+15551234567`) for SMS alerts

Checks without a contact group alert silently (logged but no notification sent).

### Phase 1: create tag

Create a tag before any checks. The tag is the primary organizational unit in Uptime.com: it drives Group check auto-selection, dashboard filtering, and SLA reporting scope.

Tag naming conventions:

- Use the registered domain as the tag name (e.g. `example.com`, not `Example` or `www.example.com`)
- For multi-environment setups, include the environment: `example.com/production`, `example.com/staging`
- For upstream dependencies, use a dedicated tag: `upstream-dependencies`

### Phase 2: create checks

1. **Batch 1, location-based** (all parallel): HTTP, DNS (A/MX/NS), ICMP, TCP.
2. **Batch 2, auto-located** (all parallel): SSL, Blacklist, Malware, WHOIS, RDAP.
3. **Batch 3, constrained** (last): Page Speed.

All checks within a batch are independent and can be created in a single parallel tool call.

Every check must include:

- `tags`: always pass the domain tag. Every check must be tagged from the moment of creation. Untagged checks are invisible to Group checks, excluded from tag-based dashboards, and hard to manage at scale.
- `contact_groups`: at least one contact group so alerts are not silent.
- `name`: use a consistent pattern: `<domain> <check-type>` (e.g. `example.com HTTP`, `example.com DNS A`, `realworld.show SSL`).
- `notes`: brief description of purpose when not obvious from the name.

### Phase 3: create Group check

After individual checks exist, create a Group check for the domain:

- Use tag-based auto-selection with the domain tag so future checks are automatically included.
- Configure alert conditions (e.g. "any member down").
- Tag the Group check itself with the same domain tag.

### Phase 4: create dashboard

1. `create_dashboard` with a descriptive name (e.g. "example.com monitoring").
2. Add widgets for the checks created in Phase 1.
3. Group widgets by check type or service function.

### Phase 5: suggest status page (for public-facing services)

If the domain is public-facing, suggest creating a status page. Do not create automatically; it's a public asset that requires user confirmation. See `status-page-management` skill for the full workflow.

### Phase 6: upstream dependency monitoring

Detect and offer to monitor upstream dependencies using CloudStatus checks.

#### DNS-based provider detection

DNS records reveal infrastructure providers without user input:

| Record type | Pattern                                     | Reveals                 |
| ----------- | ------------------------------------------- | ----------------------- |
| CNAME chain | `*.cloudfront.net`                          | AWS CloudFront CDN      |
| CNAME chain | `*.cdn.cloudflare.net`                      | Cloudflare CDN          |
| CNAME chain | `*.fastly.net`                              | Fastly CDN              |
| CNAME chain | `*.akamaiedge.net`, `*.akamai.net`          | Akamai CDN              |
| CNAME chain | `*.azureedge.net`                           | Azure CDN               |
| CNAME chain | `*.herokuapp.com`                           | Heroku                  |
| CNAME chain | `*.netlify.app`                             | Netlify                 |
| CNAME chain | `*.vercel-dns.com`                          | Vercel                  |
| MX          | `*.google.com`, `*.googlemail.com`          | Google Workspace        |
| MX          | `*.outlook.com`, `*.protection.outlook.com` | Microsoft 365           |
| MX          | `*.pphosted.com`                            | Proofpoint              |
| MX          | `*.mimecast.com`                            | Mimecast                |
| NS          | `*.cloudflare.com`                          | Cloudflare DNS          |
| NS          | `awsdns-*`                                  | AWS Route 53            |
| NS          | `*.azure-dns.*`                             | Azure DNS               |
| NS          | `ns*.google.com`                            | Google Cloud DNS        |
| TXT (SPF)   | `include:_spf.google.com`                   | Google email sending    |
| TXT (SPF)   | `include:spf.protection.outlook.com`        | Microsoft email sending |
| TXT (SPF)   | `include:sendgrid.net`                      | SendGrid                |
| TXT (SPF)   | `include:amazonses.com`                     | AWS SES                 |
| TXT (SPF)   | `include:mailgun.org`                       | Mailgun                 |

Inspect DNS check results from Phase 1 for these patterns before proceeding.

#### User confirmation

Always confirm before creating dependency checks:

> I detected these upstream dependencies from your DNS records:
>
> - **CDN**: Cloudflare (CNAME -> cdn.cloudflare.net)
> - **Email**: Google Workspace (MX -> google.com)
> - **DNS**: Cloudflare (NS -> cloudflare.com)
>
> Would you like me to monitor their status pages? Are there other dependencies I should include (e.g. payment processor, auth provider)?

#### Creating CloudStatus checks

For each confirmed dependency:

1. **Create CloudStatus check**: use the MCP server's tools to discover available providers and service components. Uptime.com natively parses status feeds with proper UP/DOWN/MAINTENANCE mapping.
2. **Tag** with `upstream-dependencies` to keep them organized separately.

Add dependency checks to the domain's dashboard in a separate "Dependencies" widget group.

## DNS layering

For comprehensive DNS coverage, create three checks:

| Record | Target                             | Catches                             |
| ------ | ---------------------------------- | ----------------------------------- |
| A      | subdomain (e.g. `www.example.com`) | resolution failures for the service |
| MX     | parent domain (e.g. `example.com`) | mail routing breakage               |
| NS     | parent domain (e.g. `example.com`) | nameserver delegation issues        |

NS breakage cascades into A and MX failures, so it provides the earliest signal.

## Domain vs subdomain rules

| Check type                          | Target                     | Why                                                |
| ----------------------------------- | -------------------------- | -------------------------------------------------- |
| HTTP, ICMP, SSL, Blacklist, Malware | subdomain or full URL      | checks the actual service endpoint                 |
| DNS A/AAAA/CNAME                    | subdomain                  | resolves the specific host                         |
| DNS MX/NS                           | parent (registered) domain | MX and NS are zone-level records                   |
| WHOIS, RDAP                         | parent (registered) domain | WHOIS/RDAP data exists only for registered domains |

WHOIS and RDAP both require `expect_string` set to the domain name (e.g. `example.com`). Creating both provides redundancy since WHOIS servers can be unreliable.

## Common pitfalls

- **Page Speed with multiple locations**: "Max 1 locations allowed". Use exactly one `Dedicated-*` location.
- **Page Speed interval < 1440**: "minimum interval for this check type is 1 days". Use 1440 or higher.
- **WHOIS/RDAP on subdomain**: will fail or return no data. Always use the registered parent domain.
- **Sensitivity = 1 with many locations**: excessive false positives. Use >= 2.
- **Missing tag on creation**: checks become ungrouped. Always create and assign the tag first.

performance-reporting

# Performance reporting

Workflow for generating uptime reports, SLA evaluations, and performance trend analysis using `get_check_stats`.

## Reporting workflow

### Step 1: determine scope

Clarify with the user:

- **Time range**: last 24h, last week, last month, last quarter, last year
- **Scope**: single domain, all domains, specific tag group, specific checks
- **Metrics of interest**: uptime %, response time, outage count, downtime duration

### Step 2: gather data

For domain-level reporting:

1. `list_checks` filtered by tag to get all checks for the domain.
2. `get_check_stats` for each check with the target date range (YYYY-MM-DD format). Supports location filtering for regional breakdowns.

For account-wide reporting:

1. `list_checks` to get all checks.
2. `list_tags` to group by domain/team.
3. `get_check_stats` for representative checks per group.

### Step 3: analyze

Key metrics from `get_check_stats`:

| Metric                      | What it tells you                                                   |
| --------------------------- | ------------------------------------------------------------------- |
| Uptime %                    | Availability over the period. Compare against SLA targets.          |
| Response time (avg/p95)     | Performance trends. Rising times may indicate degradation.          |
| Outage count                | Reliability. Frequent short outages may be worse than one long one. |
| Downtime duration (seconds) | Total unavailability. Convert to minutes/hours for readability.     |

### Step 4: present findings

Structure the report by audience:

**Executive summary** (non-technical stakeholders):

1. Overall uptime % across key services
2. SLA compliance: met or missed, by how much
3. Notable incidents and their impact
4. Trend: improving, stable, or degrading compared to prior period

**Engineering report** (SRE/DevOps):

1. Per-domain uptime breakdown
2. Per-check-type performance (HTTP response times, DNS resolution times)
3. Outage timeline with root causes
4. Regional breakdown (per-location stats if relevant)
5. Recommendations for improving reliability

## SLA evaluation

### Common SLA targets

| SLA tier | Uptime % | Allowed downtime/month | Allowed downtime/year |
| -------- | -------- | ---------------------- | --------------------- |
| 99%      | 99.00%   | ~7h 18m                | ~3d 15h               |
| 99.9%    | 99.90%   | ~43m 50s               | ~8h 46m               |
| 99.95%   | 99.95%   | ~21m 55s               | ~4h 23m               |
| 99.99%   | 99.99%   | ~4m 23s                | ~52m 36s              |

### Evaluating SLA compliance

1. Get uptime % from `get_check_stats` for the SLA period.
2. Compare against the target. Report as met/missed with margin.
3. If missed, list contributing outages with durations.
4. Note any outages during maintenance windows: these may be excluded from SLA calculations depending on the agreement.

### Caveats

- Uptime % from `get_check_stats` is calculated per day. For sub-day precision, cross-reference with `list_outages`.
- Paused checks don't generate downtime data. If a check was paused during an outage, stats won't reflect the real availability.
- Ignored alerts (via `ignore_alert`) are excluded from outage calculations and will affect uptime %.

## Downtime calculations

`get_check_stats` returns downtime in seconds. Convert for readability:

| Seconds    | Human-readable |
| ---------- | -------------- |
| < 60       | Ns             |
| 60-3599    | Nm Ns          |
| 3600-86399 | Nh Nm          |
| >= 86400   | Nd Nh          |

To calculate downtime minutes from uptime percentage over a period:

```
downtime_minutes = total_minutes_in_period * (1 - uptime_pct / 100)
```

For a 30-day month (43,200 minutes): 99.9% uptime = 43.2 minutes of downtime.

## Multi-period comparison

When comparing across periods (e.g. this month vs last month):

1. Run `get_check_stats` for each period separately.
2. Align on the same check set: exclude checks that didn't exist in both periods.
3. Compare metrics side by side:

| Metric          | Previous period | Current period | Trend  |
| --------------- | --------------- | -------------- | ------ |
| Uptime %        | 99.95%          | 99.87%         | Down   |
| Avg response ms | 210             | 245            | Slower |
| Outage count    | 2               | 5              | Worse  |
| MTTR (min)      | 8               | 12             | Slower |

### Trend assessment

- **Uptime trend**: improving (higher %), stable, or degrading (lower %)
- **Response time trend**: faster, stable, or slower (watch for gradual increases)
- **Outage frequency**: fewer, same, or more incidents
- **Mean time to recovery (MTTR)**: average outage duration, shorter is better

If trends are negative, recommend a monitoring optimization review.

## Regional performance

Use location filtering in `get_check_stats` to break down performance by region:

- Identify regions with worse uptime or higher latency
- Compare against probe location distribution: poor coverage in a region means less visibility
- If a region consistently underperforms, it may indicate CDN configuration issues, DNS routing problems, or infrastructure gaps in that region

## Report formatting

Present reports as markdown tables for consistency. Example structure for a domain report:

### Domain summary

```
| Domain       | Uptime %  | Avg Response | Outages | Downtime  |
| ------------ | --------- | ------------ | ------- | --------- |
| example.com  | 99.97%    | 185ms        | 1       | 12m 30s   |
| api.acme.com | 99.99%    | 92ms         | 0       | 4m 18s    |
```

### Per-check detail (when requested)

```
| Check           | Type | Uptime % | Avg Response | P95 Response |
| --------------- | ---- | -------- | ------------ | ------------ |
| www.example.com | HTTP | 99.97%   | 185ms        | 420ms        |
| example.com     | DNS  | 100.00%  | 12ms         | 28ms         |
| example.com     | SSL  | 100.00%  | -            | -            |
```

For executive audiences, omit per-check detail and focus on the domain summary with SLA compliance status.

status-page-management

# Status page management — operational knowledge

Workflow patterns for creating status pages, mapping components to checks, and managing incidents through their lifecycle.

## Status page creation workflow

### Step 1 — Plan components

A status page represents services as **components** that map to monitoring checks. Plan the component structure before creating:

| Component           | Maps to                   | Rationale                          |
| ------------------- | ------------------------- | ---------------------------------- |
| Website             | HTTP check (main site)    | User-facing web availability       |
| API                 | HTTP check (API endpoint) | Developer/integration availability |
| DNS                 | DNS A check               | Name resolution                    |
| Email               | SMTP check + DNS MX       | Mail delivery chain                |
| CDN / Static Assets | HTTP check (CDN URL)      | Asset delivery                     |

Group related components. Users of your status page don't need to see every internal check — aggregate into meaningful service categories.

### Step 2 — Create status page

`create_status_page` with:

- **Name**: public-facing name (e.g. "Acme Corp Status")
- **Subdomain or custom domain**: the URL where the page will be accessible
- **Visibility**: public (anyone) or private (authenticated users)

### Step 3 — Add components

For each planned component:

- Create the component on the status page
- Link it to the corresponding monitoring check(s)
- Set the initial status (operational)

### Step 4 — Verify

- `get_status_page` to confirm the page is configured correctly
- Components should reflect current check states

## Component mapping patterns

### Simple mapping (1:1)

One component per check. Works for small setups:

```
Website → HTTP check
API → HTTP check (API)
Database → TCP check (port 5432)
```

### Aggregated mapping (N:1)

Multiple checks feed one component. Better for public-facing pages where users don't need internal granularity:

```
Website → HTTP + SSL + DNS A + Page Speed
  (component is "degraded" if any sub-check alerts)

Email → SMTP + DNS MX + Blacklist
  (component is "down" if SMTP fails, "degraded" if Blacklist alerts)
```

### Tiered components

Use component groups to organize by service tier:

```
Core Services:
  - Website
  - API
  - Authentication

Infrastructure:
  - DNS
  - CDN
  - Email

Monitoring:
  - SSL Certificates
  - Domain Registration
```

## Incident lifecycle

Status page incidents communicate service disruptions to users. They follow a defined lifecycle:

### 1. Create incident

When an outage is confirmed (not just a single alert):

- **Title**: clear, user-facing description (e.g. "Website experiencing intermittent errors")
- **Status**: `investigating`
- **Affected components**: set to appropriate state (`degraded_performance`, `partial_outage`, `major_outage`)
- **Message**: what you know so far

### 2. Update incident

As investigation progresses, post updates:

| Status          | When to use                            |
| --------------- | -------------------------------------- |
| `investigating` | Initial state — looking into the issue |
| `identified`    | Root cause found, working on fix       |
| `monitoring`    | Fix deployed, watching for recurrence  |
| `resolved`      | Incident is over, service restored     |

Each update should include a message explaining what changed.

### 3. Resolve incident

When the issue is confirmed fixed:

- Set status to `resolved`
- Set affected components back to `operational`
- Include a brief summary of what happened and what was done

## Public vs private status pages

| Aspect            | Public                                  | Private                               |
| ----------------- | --------------------------------------- | ------------------------------------- |
| Audience          | Customers, users, the internet          | Internal teams, specific stakeholders |
| Component detail  | Aggregated, user-friendly names         | Can be more granular/technical        |
| Incident language | Non-technical, user-impact focused      | Can include technical details         |
| Creation requires | User confirmation (public-facing asset) | Less formal                           |

**Always ask before creating a public status page** — it's a commitment to maintain and represents the organization publicly.

## Best practices

- **Don't auto-create status pages**: unlike checks and dashboards, status pages are public-facing. Always confirm with the user first.
- **Keep components user-facing**: "Website" not "HTTP check on www.example.com"
- **Update incidents promptly**: stale "investigating" status erodes trust
- **Use component groups**: organize by service area, not by check type
- **Map critical checks**: every component should have at least one monitoring check backing it so component status can be automated

transaction-scripting

# Transaction check scripting

Transaction checks monitor web user flows by executing sequential steps in a real Chromium browser. Each step is a command (action) or validation (assertion). If any step fails, execution stops and an alert is raised.

For the complete step type catalog, parameters, selectors, and variables, see `references/step-reference.md`.

## Script format

A transaction script is a JSON array of step objects executed in order:

```json
[
  { "step_def": "C_OPEN_URL", "values": { "url": "https://example.com" } },
  { "step_def": "C_FILL_FIELD", "values": { "element": "input[name='email']", "text": "user@example.com" } },
  { "step_def": "C_MOUSE_CLICK", "values": { "element": "button[type='submit']" } },
  { "step_def": "V_URL_CONTAINS", "values": { "text": "/dashboard" } }
]
```

The key is `step_def` (not `step_type`). Values use `element` for selectors (not `search_text`).

## Scripting workflow

1. Identify the user flow to monitor (login, checkout, signup, etc.)
2. Break it into discrete steps: navigate, interact, wait, validate
3. Use `C_AUTH_AND_SETTINGS` as the first step if you need custom viewport, headers, or TOTP
4. Add validation steps after key transitions to catch failures early
5. Use `C_WAIT_FOR_ELEMENT` before interacting with dynamically loaded content
6. Present the monitoring plan as a numbered list explaining what each step does in plain language
7. Confirm with the user before creating the check via `create_transaction_check`

## Common pitfalls

- **Using `C_CLICK_ELEMENT`**: deprecated. Use `C_MOUSE_CLICK` instead (realistic mouse events).
- **Missing waits on SPAs**: single-page apps need `C_WAIT_FOR_ELEMENT` or `wait_until: networkidle0` since navigation events don't fire on client-side routing.
- **Interacting before element exists**: always wait for dynamic content. The default wait timeout is 25 seconds.
- **Selecting by unstable attributes**: prefer `data-testid` or `name` attributes over auto-generated class names.
- **Long scripts without checkpoints**: add `V_URL_CONTAINS` or `V_ELEMENT_EXISTS` after each major page transition to pinpoint failures.

understanding-check-types

# Check types

Complete reference for all check types supported by the Uptime.com MCP server.

See `references/check-types.md` for the full matrix of check types, required fields, constraints, and field references.