baoyu-url-to-markdown

Fetch any URL and convert to markdown using baoyu-fetch CLI (Chrome CDP with site-specific adapters). Built-in adapters for X/Twitter, YouTube transcripts, Hacker News threads, and generic pages via Defuddle. Handles login/CAPTCHA via interaction wait modes. Use when user wants to save a webpage as markdown.

89.316,415 installsjimliu/baoyu-skills

skills.sh GitHub

Add to your agent

curl "https://api.skyll.app/skill/URL"

SKILL.md

# URL to Markdown

Fetches any URL via `baoyu-fetch` CLI (Chrome CDP + site-specific adapters) and converts it to clean markdown.

## User Input Tools

When this skill prompts the user, follow this tool-selection rule (priority order):

1. **Prefer built-in user-input tools** exposed by the current agent runtime — e.g., `AskUserQuestion`, `request_user_input`, `clarify`, `ask_user`, or any equivalent.
2. **Fallback**: if no such tool exists, emit a numbered plain-text message and ask the user to reply with the chosen number/answer for each question.
3. **Batching**: if the tool supports multiple questions per call, combine all applicable questions into a single call; if only single-question, ask them one at a time in priority order.

Concrete `AskUserQuestion` references below are examples — substitute the local equivalent in other runtimes.

## CLI Setup

**Important**: The CLI is provided by the npm package dependency `baoyu-fetch`. 

**Agent Execution Instructions**:
1. Determine this SKILL.md file's directory path as `{baseDir}`
2. Resolve `${BUN}` runtime: if `bun` installed → `bun`; else suggest installing Bun
3. If `{baseDir}/scripts/node_modules/.bin/baoyu-fetch` does not exist, run `${BUN} install --cwd {baseDir}/scripts`
4. `${READER}` = `{baseDir}/scripts/node_modules/.bin/baoyu-fetch`
5. Replace all `${READER}` in this document with the resolved value

## Preferences (EXTEND.md)

Check EXTEND.md in priority order — the first one found wins:

| Priority | Path | Scope |
|----------|------|-------|
| 1 | `.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | Project |
| 2 | `${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | XDG |
| 3 | `$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | User home |

| Result | Action |
|--------|--------|
| Found | Read, parse, apply settings |
| Not found | **MUST** run first-time setup (see below) — do NOT silently create defaults |

**EXTEND.md supports**: download media by default, default output directory.

### First-Time Setup ⛔ BLOCKING

When EXTEND.md is not found, you **MUST** use `AskUserQuestion` to gather preferences before creating EXTEND.md. **NEVER** create EXTEND.md with silent defaults. Generation is BLOCKED until setup completes. Batch all three questions into a single call:

- **Q1 — Media** (header "Media"): "How to handle images and videos in pages?"
  - "Ask each time (Recommended)" — Prompt after each save
  - "Always download" — Download to local `imgs/` and `videos/`
  - "Never download" — Keep remote URLs
- **Q2 — Output** (header "Output"): "Default output directory?"
  - "url-to-markdown (Recommended)" — Save to `./url-to-markdown/{domain}/{slug}.md`
  - User may pick "Other" and type a custom path
- **Q3 — Save** (header "Save"): "Where to save preferences?"
  - "User (Recommended)" — `~/.baoyu-skills/` (all projects)
  - "Project" — `.baoyu-skills/` (this project only)

After answers, write EXTEND.md, confirm "Preferences saved to [path]", then continue.

Full template: [references/config/first-time-setup.md](references/config/first-time-setup.md).

### Supported Keys

| Key | Default | Values | Description |
|-----|---------|--------|-------------|
| `download_media` | `ask` | `ask` / `1` / `0` | `ask` = prompt each time, `1` = always, `0` = never |
| `default_output_dir` | empty | path or empty | Default output directory (empty = `./url-to-markdown/`) |

**EXTEND.md → CLI mapping**:

| EXTEND.md key | CLI argument | Notes |
|---------------|-------------|-------|
| `download_media: 1` | `--download-media` | Requires `--output` to be set |
| `default_output_dir: ./posts/` | Agent constructs `--output ./posts/{domain}/{slug}.md` | Agent generates path, not a direct flag |

**Value priority**: CLI arguments → EXTEND.md → skill defaults.

## Usage

```bash
# Default: headless capture, markdown to stdout
${READER} <url>

# Save to file
${READER} <url> --output article.md

# Save with media download
${READER} <url> --output article.md --download-media

# Wait for interaction (login/CAPTCHA) — auto-detect and continue
${READER} <url> --wait-for interaction --output article.md

# Wait for interaction — manual control (Enter to continue)
${READER} <url> --wait-for force --output article.md

# JSON output
${READER} <url> --format json --output article.json

# Force specific adapter
${READER} <url> --adapter youtube --output transcript.md
```

## Options

| Option | Description |
|--------|-------------|
| `<url>` | URL to fetch |
| `--output <path>` | Output file path (default: stdout) |
| `--format <type>` | Output format: `markdown` (default) or `json` |
| `--json` | Shorthand for `--format json` |
| `--adapter <name>` | Force adapter: `x`, `youtube`, `hn`, or `generic` (default: auto-detect) |
| `--headless` | Force headless Chrome (no visible window) |
| `--wait-for <mode>` | Interaction wait mode: `none` (default), `interaction`, or `force` |
| `--wait-for-interaction` | Alias for `--wait-for interaction` |
| `--wait-for-login` | Alias for `--wait-for interaction` |
| `--timeout <ms>` | Page load timeout (default: 30000) |
| `--interaction-timeout <ms>` | Login/CAPTCHA wait timeout (default: 600000 = 10 min) |
| `--interaction-poll-interval <ms>` | Poll interval for interaction checks (default: 1500) |
| `--download-media` | Download images/videos to local `imgs/` and `videos/`, rewrite markdown links. Requires `--output` |
| `--media-dir <dir>` | Base directory for downloaded media (default: same as `--output` directory) |
| `--cdp-url <url>` | Reuse existing Chrome DevTools Protocol endpoint |
| `--browser-path <path>` | Custom Chrome/Chromium binary path |
| `--chrome-profile-dir <path>` | Chrome user data directory (default: `BAOYU_CHROME_PROFILE_DIR` env or `./baoyu-skills/chrome-profile`) |
| `--debug-dir <dir>` | Write debug artifacts (document.json, markdown.md, page.html, network.json) |

## Agent Quality Gate

**CRITICAL**: treat default headless capture as provisional. Some sites render differently in headless mode and can silently return low-quality content without failing the CLI.

After every headless run, inspect the saved markdown. See [references/quality-gate.md](references/quality-gate.md) for the full checklist, recovery workflow, and capture-mode table. Read it whenever a run looks suspicious or the user asks about login/CAPTCHA handling.

## Output Path Generation

The agent must construct the output file path — `baoyu-fetch` does not auto-generate paths.

**Algorithm**:
1. Determine base directory from EXTEND.md `default_output_dir` or default `./url-to-markdown/`
2. Extract domain from URL (e.g., `example.com`)
3. Generate slug from URL path or page title (kebab-case, 2-6 words)
4. Construct: `{base_dir}/{domain}/{slug}/{slug}.md` — each URL gets its own directory so media files stay isolated
5. Conflict resolution: append timestamp `{slug}-YYYYMMDD-HHMMSS/{slug}-YYYYMMDD-HHMMSS.md`

Pass the constructed path to `--output`. Media files (`--download-media`) are saved into subdirectories next to the markdown file, keeping each URL's assets self-contained.

## Adapters & Media

See [references/adapters.md](references/adapters.md) for the adapter catalog (X, YouTube, Hacker News, generic), per-adapter notes, the media download flow (`ask` / always / never), and the JSON output schema. Read it before answering adapter-specific questions or handling media prompts.

## Environment Variables

| Variable | Description |
|----------|-------------|
| `BAOYU_CHROME_PROFILE_DIR` | Chrome user data directory (can also use `--chrome-profile-dir`) |

**Troubleshooting**: Chrome not found → use `--browser-path`. Timeout → increase `--timeout`. Login/CAPTCHA → `--wait-for interaction`. Debug → `--debug-dir` to inspect captured HTML and network logs.

## Extension Support

Custom configurations via EXTEND.md. See **Preferences** section above for paths and supported keys.

References (3)

📎 adapters.md

# Adapters & Media

Read when choosing an adapter, handling media, or answering adapter-specific questions.

## Built-in Adapters

| Adapter | URLs | Key Features |
|---------|------|-------------|
| `x` | x.com, twitter.com | Tweets, threads, X Articles, media, login detection |
| `youtube` | youtube.com, youtu.be | Transcript/captions, chapters, cover image, metadata |
| `hn` | news.ycombinator.com | Threaded comments, story metadata, nested replies |
| `generic` | Any URL (fallback) | Defuddle extraction, Readability fallback, auto-scroll, network idle detection |

Adapter is auto-selected based on URL. Override with `--adapter <name>`.

### YouTube

- Extracts transcripts/captions when available
- Transcript format: `[MM:SS] Text segment` with chapter headings
- Availability depends on YouTube exposing a caption track; videos with captions disabled or restricted playback may produce description-only output
- Use `--wait-for force` if the page needs time to finish loading player metadata

### X/Twitter

- Extracts single tweets, threads, and X Articles
- Auto-detects login state; if logged out and content requires auth, JSON output shows `"status": "needs_interaction"`
- Use `--wait-for interaction` for login-protected content

### Hacker News

- Parses threaded comments with proper nesting and reply hierarchy
- Includes story metadata (title, URL, author, score, comment count)
- Shows comment deletion/dead status

## Media Download Workflow

Driven by `download_media` in EXTEND.md:

| Setting | Behavior |
|---------|----------|
| `1` (always) | Run CLI with `--download-media --output <path>` |
| `0` (never) | Run CLI with `--output <path>` (no media download) |
| `ask` (default) | Follow the ask-each-time flow below |

### Ask-Each-Time Flow

1. Run the CLI **without** `--download-media` with `--output <path>` → markdown saved
2. Check the saved markdown for remote media URLs (`https://` in image/video links)
3. **If no remote media found** → done, no prompt needed
4. **If remote media found** → ask via `AskUserQuestion`:
   - header: "Media", question: "Download N images/videos to local files?"
   - "Yes" — Download to local directories
   - "No" — Keep remote URLs
5. If the user confirms → run the CLI **again** with `--download-media --output <same-path>` (overwrites markdown with localized links)

### Media Layout

When `--download-media` is enabled:

- Images → `imgs/` next to the output file (or `--media-dir`)
- Videos → `videos/` next to the output file (or `--media-dir`)
- Markdown media links are rewritten to local relative paths

## Output Format

Markdown to stdout (or file with `--output`).

JSON output (`--format json`) returns structured data:

- `adapter` — which adapter handled the URL
- `status` — `"ok"` or `"needs_interaction"`
- `login` — login state detection (`logged_in`, `logged_out`, `unknown`)
- `interaction` — interaction gate details (kind, provider, prompt)
- `document` — structured content (url, title, author, publishedAt, content blocks, metadata)
- `media` — collected media assets with url, kind, role
- `markdown` — converted markdown text
- `downloads` — media download results (when `--download-media` used)

📎 first-time-setup.md

---
name: first-time-setup
description: First-time setup flow for baoyu-url-to-markdown preferences
---

# First-Time Setup

## Overview

When no EXTEND.md is found, guide user through preference setup.

**BLOCKING OPERATION**: This setup MUST complete before ANY other workflow steps. Do NOT:
- Start converting URLs
- Ask about URLs or output paths
- Proceed to any conversion

ONLY ask the questions in this setup flow, save EXTEND.md, then continue.

## Setup Flow

```
No EXTEND.md found
        |
        v
+---------------------+
| AskUserQuestion     |
| (all questions)     |
+---------------------+
        |
        v
+---------------------+
| Create EXTEND.md    |
+---------------------+
        |
        v
    Continue conversion
```

## Questions

**Language**: Use user's input language or saved language preference.

Use AskUserQuestion with ALL questions in ONE call:

### Question 1: Download Media

```yaml
header: "Media"
question: "How to handle images and videos in pages?"
options:
  - label: "Ask each time (Recommended)"
    description: "After saving markdown, ask whether to download media"
  - label: "Always download"
    description: "Always download media to local imgs/ and videos/ directories"
  - label: "Never download"
    description: "Keep original remote URLs in markdown"
```

### Question 2: Default Output Directory

```yaml
header: "Output"
question: "Default output directory?"
options:
  - label: "url-to-markdown (Recommended)"
    description: "Save to ./url-to-markdown/{domain}/{slug}.md"
```

Note: User will likely choose "Other" to type a custom path.

### Question 3: Save Location

```yaml
header: "Save"
question: "Where to save preferences?"
options:
  - label: "User (Recommended)"
    description: "~/.baoyu-skills/ (all projects)"
  - label: "Project"
    description: ".baoyu-skills/ (this project only)"
```

## Save Locations

| Choice | Path | Scope |
|--------|------|-------|
| User | `~/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | All projects |
| Project | `.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | Current project |

## After Setup

1. Create directory if needed
2. Write EXTEND.md
3. Confirm: "Preferences saved to [path]"
4. Continue with conversion using saved preferences

## EXTEND.md Template

```md
download_media: [ask/1/0]
default_output_dir: [path or empty]
```

## Modifying Preferences Later

Users can edit EXTEND.md directly or delete it to trigger setup again.

📎 quality-gate.md

# Quality Gate & Recovery

Headless Chrome can silently return low-quality content — layout shells, login walls, or framework payloads — without the CLI returning a non-zero exit code. Read this after every headless run so you can catch and recover from those cases.

## Checks the Agent Must Run

1. Confirm the markdown title matches the target page, not a generic site shell
2. Confirm the body contains the expected article/page content, not just navigation, footer, or a generic error
3. Watch for obvious failure signs:
- `Application error`
- `This page could not be found`
- Login, signup, subscribe, or verification shells
- Extremely short markdown for a page that should be long-form
- Raw framework payloads or mostly boilerplate content
4. Do NOT accept a run as successful just because the CLI exited `0`

**Tip**: run with `--format json` to get structured signals including `status`, `login.state`, and `interaction`. `"status": "needs_interaction"` means the page requires manual interaction.

## Recovery Workflow

1. Start headless (default) unless there is already a clear reason to use interaction mode
2. Review markdown quality immediately after the run
3. If the content is low quality or indicates login/CAPTCHA:
- `--wait-for interaction` for auto-detected gates (login, CAPTCHA, Cloudflare)
- `--wait-for force` when the page needs manual browsing, scroll loading, or complex interaction
4. If `--wait-for` is used, tell the user exactly what to do:
- Login required → sign in in the browser
- CAPTCHA visible → solve it
- Slow loading → wait until content is visible
- `--wait-for force` → press Enter when ready
5. If JSON output shows `"status": "needs_interaction"`, switch to `--wait-for interaction` automatically

## Capture Modes

| Mode | Behavior | Use When |
|------|----------|----------|
| Default | Headless Chrome, auto-extract on network idle | Public pages, static content |
| `--headless` | Explicit headless (same as default) | Clarify intent |
| `--wait-for interaction` | Opens visible Chrome, auto-detects login/CAPTCHA gates, waits for them to clear, then continues | Login-required, CAPTCHA-protected |
| `--wait-for force` | Opens visible Chrome, auto-detects OR accepts Enter keypress to continue | Complex flows, lazy loading, paywalls |

**Interaction gate auto-detection**: Cloudflare Turnstile / "just a moment" pages, Google reCAPTCHA, hCaptcha, custom challenge / verification screens.