baoyu-url-to-markdown
Fetch any URL and convert to markdown using baoyu-fetch CLI (Chrome CDP with site-specific adapters). Built-in adapters for X/Twitter, YouTube transcripts, Hacker News threads, and generic pages via Defuddle. Handles login/CAPTCHA via interaction wait modes. Use when user wants to save a webpage as markdown.
89.316,415 installsjimliu/baoyu-skills
Add to your agent
curl "https://api.skyll.app/skill/URL"SKILL.md
# URL to Markdown
Fetches any URL via `baoyu-fetch` CLI (Chrome CDP + site-specific adapters) and converts it to clean markdown.
## User Input Tools
When this skill prompts the user, follow this tool-selection rule (priority order):
1. **Prefer built-in user-input tools** exposed by the current agent runtime — e.g., `AskUserQuestion`, `request_user_input`, `clarify`, `ask_user`, or any equivalent.
2. **Fallback**: if no such tool exists, emit a numbered plain-text message and ask the user to reply with the chosen number/answer for each question.
3. **Batching**: if the tool supports multiple questions per call, combine all applicable questions into a single call; if only single-question, ask them one at a time in priority order.
Concrete `AskUserQuestion` references below are examples — substitute the local equivalent in other runtimes.
## CLI Setup
**Important**: The CLI is provided by the npm package dependency `baoyu-fetch`.
**Agent Execution Instructions**:
1. Determine this SKILL.md file's directory path as `{baseDir}`
2. Resolve `${BUN}` runtime: if `bun` installed → `bun`; else suggest installing Bun
3. If `{baseDir}/scripts/node_modules/.bin/baoyu-fetch` does not exist, run `${BUN} install --cwd {baseDir}/scripts`
4. `${READER}` = `{baseDir}/scripts/node_modules/.bin/baoyu-fetch`
5. Replace all `${READER}` in this document with the resolved value
## Preferences (EXTEND.md)
Check EXTEND.md in priority order — the first one found wins:
| Priority | Path | Scope |
|----------|------|-------|
| 1 | `.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | Project |
| 2 | `${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | XDG |
| 3 | `$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | User home |
| Result | Action |
|--------|--------|
| Found | Read, parse, apply settings |
| Not found | **MUST** run first-time setup (see below) — do NOT silently create defaults |
**EXTEND.md supports**: download media by default, default output directory.
### First-Time Setup ⛔ BLOCKING
When EXTEND.md is not found, you **MUST** use `AskUserQuestion` to gather preferences before creating EXTEND.md. **NEVER** create EXTEND.md with silent defaults. Generation is BLOCKED until setup completes. Batch all three questions into a single call:
- **Q1 — Media** (header "Media"): "How to handle images and videos in pages?"
- "Ask each time (Recommended)" — Prompt after each save
- "Always download" — Download to local `imgs/` and `videos/`
- "Never download" — Keep remote URLs
- **Q2 — Output** (header "Output"): "Default output directory?"
- "url-to-markdown (Recommended)" — Save to `./url-to-markdown/{domain}/{slug}.md`
- User may pick "Other" and type a custom path
- **Q3 — Save** (header "Save"): "Where to save preferences?"
- "User (Recommended)" — `~/.baoyu-skills/` (all projects)
- "Project" — `.baoyu-skills/` (this project only)
After answers, write EXTEND.md, confirm "Preferences saved to [path]", then continue.
Full template: [references/config/first-time-setup.md](references/config/first-time-setup.md).
### Supported Keys
| Key | Default | Values | Description |
|-----|---------|--------|-------------|
| `download_media` | `ask` | `ask` / `1` / `0` | `ask` = prompt each time, `1` = always, `0` = never |
| `default_output_dir` | empty | path or empty | Default output directory (empty = `./url-to-markdown/`) |
**EXTEND.md → CLI mapping**:
| EXTEND.md key | CLI argument | Notes |
|---------------|-------------|-------|
| `download_media: 1` | `--download-media` | Requires `--output` to be set |
| `default_output_dir: ./posts/` | Agent constructs `--output ./posts/{domain}/{slug}.md` | Agent generates path, not a direct flag |
**Value priority**: CLI arguments → EXTEND.md → skill defaults.
## Usage
```bash
# Default: headless capture, markdown to stdout
${READER} <url>
# Save to file
${READER} <url> --output article.md
# Save with media download
${READER} <url> --output article.md --download-media
# Wait for interaction (login/CAPTCHA) — auto-detect and continue
${READER} <url> --wait-for interaction --output article.md
# Wait for interaction — manual control (Enter to continue)
${READER} <url> --wait-for force --output article.md
# JSON output
${READER} <url> --format json --output article.json
# Force specific adapter
${READER} <url> --adapter youtube --output transcript.md
```
## Options
| Option | Description |
|--------|-------------|
| `<url>` | URL to fetch |
| `--output <path>` | Output file path (default: stdout) |
| `--format <type>` | Output format: `markdown` (default) or `json` |
| `--json` | Shorthand for `--format json` |
| `--adapter <name>` | Force adapter: `x`, `youtube`, `hn`, or `generic` (default: auto-detect) |
| `--headless` | Force headless Chrome (no visible window) |
| `--wait-for <mode>` | Interaction wait mode: `none` (default), `interaction`, or `force` |
| `--wait-for-interaction` | Alias for `--wait-for interaction` |
| `--wait-for-login` | Alias for `--wait-for interaction` |
| `--timeout <ms>` | Page load timeout (default: 30000) |
| `--interaction-timeout <ms>` | Login/CAPTCHA wait timeout (default: 600000 = 10 min) |
| `--interaction-poll-interval <ms>` | Poll interval for interaction checks (default: 1500) |
| `--download-media` | Download images/videos to local `imgs/` and `videos/`, rewrite markdown links. Requires `--output` |
| `--media-dir <dir>` | Base directory for downloaded media (default: same as `--output` directory) |
| `--cdp-url <url>` | Reuse existing Chrome DevTools Protocol endpoint |
| `--browser-path <path>` | Custom Chrome/Chromium binary path |
| `--chrome-profile-dir <path>` | Chrome user data directory (default: `BAOYU_CHROME_PROFILE_DIR` env or `./baoyu-skills/chrome-profile`) |
| `--debug-dir <dir>` | Write debug artifacts (document.json, markdown.md, page.html, network.json) |
## Agent Quality Gate
**CRITICAL**: treat default headless capture as provisional. Some sites render differently in headless mode and can silently return low-quality content without failing the CLI.
After every headless run, inspect the saved markdown. See [references/quality-gate.md](references/quality-gate.md) for the full checklist, recovery workflow, and capture-mode table. Read it whenever a run looks suspicious or the user asks about login/CAPTCHA handling.
## Output Path Generation
The agent must construct the output file path — `baoyu-fetch` does not auto-generate paths.
**Algorithm**:
1. Determine base directory from EXTEND.md `default_output_dir` or default `./url-to-markdown/`
2. Extract domain from URL (e.g., `example.com`)
3. Generate slug from URL path or page title (kebab-case, 2-6 words)
4. Construct: `{base_dir}/{domain}/{slug}/{slug}.md` — each URL gets its own directory so media files stay isolated
5. Conflict resolution: append timestamp `{slug}-YYYYMMDD-HHMMSS/{slug}-YYYYMMDD-HHMMSS.md`
Pass the constructed path to `--output`. Media files (`--download-media`) are saved into subdirectories next to the markdown file, keeping each URL's assets self-contained.
## Adapters & Media
See [references/adapters.md](references/adapters.md) for the adapter catalog (X, YouTube, Hacker News, generic), per-adapter notes, the media download flow (`ask` / always / never), and the JSON output schema. Read it before answering adapter-specific questions or handling media prompts.
## Environment Variables
| Variable | Description |
|----------|-------------|
| `BAOYU_CHROME_PROFILE_DIR` | Chrome user data directory (can also use `--chrome-profile-dir`) |
**Troubleshooting**: Chrome not found → use `--browser-path`. Timeout → increase `--timeout`. Login/CAPTCHA → `--wait-for interaction`. Debug → `--debug-dir` to inspect captured HTML and network logs.
## Extension Support
Custom configurations via EXTEND.md. See **Preferences** section above for paths and supported keys.References (3)
📎 adapters.md
# Adapters & Media Read when choosing an adapter, handling media, or answering adapter-specific questions. ## Built-in Adapters | Adapter | URLs | Key Features | |---------|------|-------------| | `x` | x.com, twitter.com | Tweets, threads, X Articles, media, login detection | | `youtube` | youtube.com, youtu.be | Transcript/captions, chapters, cover image, metadata | | `hn` | news.ycombinator.com | Threaded comments, story metadata, nested replies | | `generic` | Any URL (fallback) | Defuddle extraction, Readability fallback, auto-scroll, network idle detection | Adapter is auto-selected based on URL. Override with `--adapter <name>`. ### YouTube - Extracts transcripts/captions when available - Transcript format: `[MM:SS] Text segment` with chapter headings - Availability depends on YouTube exposing a caption track; videos with captions disabled or restricted playback may produce description-only output - Use `--wait-for force` if the page needs time to finish loading player metadata ### X/Twitter - Extracts single tweets, threads, and X Articles - Auto-detects login state; if logged out and content requires auth, JSON output shows `"status": "needs_interaction"` - Use `--wait-for interaction` for login-protected content ### Hacker News - Parses threaded comments with proper nesting and reply hierarchy - Includes story metadata (title, URL, author, score, comment count) - Shows comment deletion/dead status ## Media Download Workflow Driven by `download_media` in EXTEND.md: | Setting | Behavior | |---------|----------| | `1` (always) | Run CLI with `--download-media --output <path>` | | `0` (never) | Run CLI with `--output <path>` (no media download) | | `ask` (default) | Follow the ask-each-time flow below | ### Ask-Each-Time Flow 1. Run the CLI **without** `--download-media` with `--output <path>` → markdown saved 2. Check the saved markdown for remote media URLs (`https://` in image/video links) 3. **If no remote media found** → done, no prompt needed 4. **If remote media found** → ask via `AskUserQuestion`: - header: "Media", question: "Download N images/videos to local files?" - "Yes" — Download to local directories - "No" — Keep remote URLs 5. If the user confirms → run the CLI **again** with `--download-media --output <same-path>` (overwrites markdown with localized links) ### Media Layout When `--download-media` is enabled: - Images → `imgs/` next to the output file (or `--media-dir`) - Videos → `videos/` next to the output file (or `--media-dir`) - Markdown media links are rewritten to local relative paths ## Output Format Markdown to stdout (or file with `--output`). JSON output (`--format json`) returns structured data: - `adapter` — which adapter handled the URL - `status` — `"ok"` or `"needs_interaction"` - `login` — login state detection (`logged_in`, `logged_out`, `unknown`) - `interaction` — interaction gate details (kind, provider, prompt) - `document` — structured content (url, title, author, publishedAt, content blocks, metadata) - `media` — collected media assets with url, kind, role - `markdown` — converted markdown text - `downloads` — media download results (when `--download-media` used)
📎 first-time-setup.md
---
name: first-time-setup
description: First-time setup flow for baoyu-url-to-markdown preferences
---
# First-Time Setup
## Overview
When no EXTEND.md is found, guide user through preference setup.
**BLOCKING OPERATION**: This setup MUST complete before ANY other workflow steps. Do NOT:
- Start converting URLs
- Ask about URLs or output paths
- Proceed to any conversion
ONLY ask the questions in this setup flow, save EXTEND.md, then continue.
## Setup Flow
```
No EXTEND.md found
|
v
+---------------------+
| AskUserQuestion |
| (all questions) |
+---------------------+
|
v
+---------------------+
| Create EXTEND.md |
+---------------------+
|
v
Continue conversion
```
## Questions
**Language**: Use user's input language or saved language preference.
Use AskUserQuestion with ALL questions in ONE call:
### Question 1: Download Media
```yaml
header: "Media"
question: "How to handle images and videos in pages?"
options:
- label: "Ask each time (Recommended)"
description: "After saving markdown, ask whether to download media"
- label: "Always download"
description: "Always download media to local imgs/ and videos/ directories"
- label: "Never download"
description: "Keep original remote URLs in markdown"
```
### Question 2: Default Output Directory
```yaml
header: "Output"
question: "Default output directory?"
options:
- label: "url-to-markdown (Recommended)"
description: "Save to ./url-to-markdown/{domain}/{slug}.md"
```
Note: User will likely choose "Other" to type a custom path.
### Question 3: Save Location
```yaml
header: "Save"
question: "Where to save preferences?"
options:
- label: "User (Recommended)"
description: "~/.baoyu-skills/ (all projects)"
- label: "Project"
description: ".baoyu-skills/ (this project only)"
```
## Save Locations
| Choice | Path | Scope |
|--------|------|-------|
| User | `~/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | All projects |
| Project | `.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | Current project |
## After Setup
1. Create directory if needed
2. Write EXTEND.md
3. Confirm: "Preferences saved to [path]"
4. Continue with conversion using saved preferences
## EXTEND.md Template
```md
download_media: [ask/1/0]
default_output_dir: [path or empty]
```
## Modifying Preferences Later
Users can edit EXTEND.md directly or delete it to trigger setup again.
📎 quality-gate.md
# Quality Gate & Recovery Headless Chrome can silently return low-quality content — layout shells, login walls, or framework payloads — without the CLI returning a non-zero exit code. Read this after every headless run so you can catch and recover from those cases. ## Checks the Agent Must Run 1. Confirm the markdown title matches the target page, not a generic site shell 2. Confirm the body contains the expected article/page content, not just navigation, footer, or a generic error 3. Watch for obvious failure signs: - `Application error` - `This page could not be found` - Login, signup, subscribe, or verification shells - Extremely short markdown for a page that should be long-form - Raw framework payloads or mostly boilerplate content 4. Do NOT accept a run as successful just because the CLI exited `0` **Tip**: run with `--format json` to get structured signals including `status`, `login.state`, and `interaction`. `"status": "needs_interaction"` means the page requires manual interaction. ## Recovery Workflow 1. Start headless (default) unless there is already a clear reason to use interaction mode 2. Review markdown quality immediately after the run 3. If the content is low quality or indicates login/CAPTCHA: - `--wait-for interaction` for auto-detected gates (login, CAPTCHA, Cloudflare) - `--wait-for force` when the page needs manual browsing, scroll loading, or complex interaction 4. If `--wait-for` is used, tell the user exactly what to do: - Login required → sign in in the browser - CAPTCHA visible → solve it - Slow loading → wait until content is visible - `--wait-for force` → press Enter when ready 5. If JSON output shows `"status": "needs_interaction"`, switch to `--wait-for interaction` automatically ## Capture Modes | Mode | Behavior | Use When | |------|----------|----------| | Default | Headless Chrome, auto-extract on network idle | Public pages, static content | | `--headless` | Explicit headless (same as default) | Clarify intent | | `--wait-for interaction` | Opens visible Chrome, auto-detects login/CAPTCHA gates, waits for them to clear, then continues | Login-required, CAPTCHA-protected | | `--wait-for force` | Opens visible Chrome, auto-detects OR accepts Enter keypress to continue | Complex flows, lazy loading, paywalls | **Interaction gate auto-detection**: Cloudflare Turnstile / "just a moment" pages, Google reCAPTCHA, hCaptcha, custom challenge / verification screens.