docpull

Fetch the web.
Get clean Markdown.

Turn server-rendered docs into clean Markdown your agents can actually use. No browser, no API key, no data leaving your machine.

pip install docpullSee examples

Best for static docs, API references, and server-rendered sites. JS-rendered SPAs are detected and skipped — pass --strict-js-required to make that an error so your agent can route elsewhere.

How it works

Three steps from URL to usable Markdown.

STEP 01
https://
STEP 02
discovered000 / 247
HTMLMD
STEP 03
Vector store
RAG pipeline
Agent skill

Point

Give docpull a docs URL, public or gated.

Fetch

It discovers pages, respects robots.txt, and converts server HTML.

Use

Use the Markdown in search, RAG, offline archives, or skills.

Features

The boring pieces that make documentation ingestion dependable.

Markdown Agents Can Use

Every page includes clean Markdown plus frontmatter for title, source URL, headings, and description. Drop it into RAG, search, or a skill directory.

No Duplicate Slop

Pages are SHA-256 hashed while they stream in, so duplicates are caught before they hit disk instead of cleaned up later.

Safe for Agent-Chosen URLs

HTTPS-only, robots.txt compliant, SSRF-protected, and DNS-pinned at connect time. Use --require-pinned-dns when proxy settings weaken that guarantee.

Cheap to Re-run

Cached pages use If-None-Match and If-Modified-Since. Re-runs fetch what changed, and saved frontier state lets interrupted crawls resume.

Crawl the Parts That Matter

Include and exclude path globs during discovery, so your model gets the relevant docs instead of every route the site exposes.

Profiles

Choose the output shape before you crawl.

RAG

Clean Markdown with metadata and deduping for retrieval.

docpull URL --profile rag

Mirror

A fuller local archive with cache, resume, and stable paths.

docpull URL --profile mirror

Quick

A 50-page sample when you need to inspect output first.

docpull URL --profile quick

LLM

Token-aware NDJSON chunks that skip JS-only pages unless strict mode is enabled.

docpull URL --profile llm --stream | jq .

Examples

See the command, then see the artifact it leaves behind.

Input
docpull https://docs.stripe.com
Output
./docs/authentication.md:

---
title: "Authentication"
source: https://docs.stripe.com/authentication
---

# Authentication

The Stripe API uses API keys to authenticate requests.
You can view and manage your API keys in the Stripe
Dashboard.

Test mode secret keys have the prefix sk_test_ and live
mode secret keys have the prefix sk_live_...

Install

Install once, then crawl from your terminal, scripts, or agent workflow. Requires Python 3.10 or newer.

pip install docpull

Why docpull?

Answers to questions people ask before installing.