Webpage Screenshot for AI Agents: One HTTP Call

Short answer: call https://screenshotrender.com/api/v1/screenshot?apiKey=YOUR_API_KEY&url=https://news.ycombinator.com&fullPage=true from your agent, pass the returned data.screenshot URL straight into Claude, GPT-4o, or Gemini as an image input, and the vision model sees the page exactly the way a human does.

By the end you will have a working call that feeds Claude, GPT-4o, or Gemini a pixel-perfect PNG of any public URL, including Cloudflare protected and JavaScript heavy pages, with one HTTP GET and zero headless browser code in your agent loop.

Why can't you just give an AI agent the raw HTML?

Raw HTML is the wrong tool for three reasons: it is huge, it is layout blind, and it strips out the exact visual signals vision models were built to read. The HTTP Archive page weight report puts the median web page at over 2 MB of total transfer, and the HTML alone is hundreds of kilobytes once you include analytics scripts, hydration payloads, and inline CSS the agent does not care about. A 1280 by 720 PNG of the same page is typically smaller, and it contains the information the model actually needs.

Layout blindness is the bigger problem. HTML tells the model what elements exist, not what the user sees. A "Buy now" button hidden behind a cookie banner is in the DOM, but no human can click it, and an agent that trusts the HTML will report success on a page that is in fact unusable. The vision model sees the cookie banner; the HTML parser does not.

Then there is the cost. Pasting a full HTML page into Claude or GPT-4o burns tens of thousands of input tokens before the model even reads the question, because every character of script, every inline style, every analytics payload counts. Sending a screenshot costs a flat per-image token rate published by each vendor, regardless of how complex the underlying page is. For an agent making many page reads per task, the screenshot path is an order of magnitude cheaper.

How do you turn any URL into a screenshot for a vision model in one call?

Send one HTTP GET to ScreenshotRender and you get back a hosted PNG URL that any vision model can read directly. The complete request is one line you can copy:

https://screenshotrender.com/api/v1/screenshot?apiKey=YOUR_API_KEY&url=https://news.ycombinator.com&fullPage=true

Replace YOUR_API_KEY with your sr- prefixed key from the ScreenshotRender dashboard. The url parameter is the page you want to capture. fullPage=true tells the renderer to capture the whole scrollable document instead of the 1280 by 720 viewport, which is what you want when the agent needs to reason about anything below the fold.

The response is JSON. The image is at data.screenshot, a hosted URL valid for download. You can hand that URL directly to a vision model, no base64 step required, no temp file written to disk inside your agent.

Two more parameters matter for agents. wait takes a millisecond value and is what you reach for when the target is a React or Vue dashboard that hydrates after the initial paint: append &wait=2000 and the renderer pauses two seconds before snapping. timeout caps the whole request so a slow page does not stall an agent step.

Test the exact call against any target URL in the interactive playground before wiring it into an agent that will hit it a thousand times.

How do you pass the screenshot into Claude, GPT-4o, or Gemini?

Claude and GPT-4o accept image URLs directly in the message payload; Gemini wants base64 or a Files API upload. The ScreenshotRender response gives you a hosted URL, which is the format the first two providers prefer and the format that adds zero latency to your agent loop.

Three patterns by provider:

Anthropic Claude: in the user message, add a content block of type image with source.type = "url" and source.url set to data.screenshot from the ScreenshotRender response. The Anthropic vision docs have the full payload shape.
OpenAI GPT-4o: in the chat completions messages array, add a content part of type image_url with image_url.url set to the same data.screenshot value. The OpenAI vision guide covers the request format.
Google Gemini: Gemini does not accept remote URLs directly. Fetch the URL inside your agent, base64 encode the bytes per the MDN base64 reference, then pass it as inline_data with mime_type = "image/png". The Gemini vision docs show the exact field names.

Skip the Chromium build, the Cloudflare fight, and the EC2 fleet.

One HTTP GET from your agent to ScreenshotRender returns a hosted PNG with cookie banners and ads already stripped. No headless browser inside the agent loop, no 170 MB Chromium binary in your Docker image, no proxy budget for Cloudflare.

Get an API key

Why do AI agents need full page screenshots and not viewport ones?

Agents that take action on a page (clicking, filling forms, comparing prices) need to see the whole document because the next reasoning step usually depends on something below the fold. A 1280 by 720 viewport screenshot of a pricing page often shows only the Free tier; the Hobby and Growth tiers your agent was asked to compare are 1,500 pixels further down. Reading the viewport, the agent will confidently report "the only plan is Free." Reading the full page, it gets the actual answer.

ScreenshotRender's fullPage=true handles the scrolling, lazy-load triggering, and image stitching so the PNG the model sees is the complete document. Cookie banners, GDPR consent popups, ad overlays, and chat widgets are removed automatically before the capture, which means the vision model is not burning tokens describing a "We use cookies" banner on every screenshot.

For Cloudflare protected targets (which a surprising number of agent use cases hit, especially anything involving G2, Capterra, or marketplace listings), vanilla headless Chromium gets a 403 because of the bot fingerprints documented in the Cloudflare Turnstile docs. ScreenshotRender's Stealth Mode (bundled from the Hobby plan, see pricing) handles those challenges so the agent never sees a "Checking your browser" page.

The agent reads the same page a human would, not the page a bot would.

When does this approach fail?

Three scenarios where the URL plus hosted API path needs a workaround:

Login gated pages. A public screenshot API takes a URL, not a session cookie. If the agent needs to read a page behind authentication, generate a signed share URL on the source app first, host the browser yourself, or ask the user to paste a screenshot.
Pixel-budget limits per model. Every vision model down-samples oversized images before reading them (see the OpenAI vision guide for the exact rules). A full-page screenshot of a very tall landing page gets compressed before the model reads it, which can lose small text. When fine detail matters, send the viewport screenshot or split a long page into segments at known H2 boundaries.
High frequency agent loops on the free tier. The free plan is rate limited to 40 requests per minute. An aggressive agent fan-out can blow through that in seconds. Throttle the agent's concurrency, or upgrade to a plan with a higher per-minute cap.

Everything else (the 90 percent of cases where an agent wants to look at a public marketing page, docs site, product listing, search result, or competitor page) the HTTP GET path is the right answer.

Common questions about screenshots for AI agents

Which AI models accept screenshot images directly?

All current frontier vision models accept screenshot images: Anthropic Claude (Sonnet, Opus, Haiku), OpenAI GPT-4o and GPT-4 Turbo with Vision, Google Gemini (1.5 Pro and Flash, 2.0), and most open source vision models like Llama 3.2 Vision and Qwen-VL. Claude and GPT-4o accept image URLs directly; Gemini wants base64 or a Files API upload. ScreenshotRender returns a hosted PNG URL, which means Claude and GPT-4o can read it with zero extra steps.

How big should the screenshot be for a vision model?

For a viewport screenshot, 1280 by 720 is the default and matches what every vision model is happiest with. For a full page screenshot of a long landing page, you can end up with images 4000 pixels tall or taller, which most models down sample anyway. The rule: send full page when the agent needs to reason about the whole document; send viewport when it only needs the above the fold view. ScreenshotRender's fullPage=true switches between the two with no other config.

Can my AI agent screenshot a page behind a login wall?

Not with a public screenshot API like ScreenshotRender, which takes a URL but not a session cookie. For login gated pages your options are: (1) generate a signed share URL on the source app and screenshot that, (2) host your own headless browser inside the same authenticated session, or (3) ask the user to paste a screenshot directly. For 90 percent of agent use cases (public marketing pages, docs, product listings, search results) a URL-only API is the right tradeoff.

Is this cheaper than running Puppeteer inside my agent?

Almost always yes. A warm Chromium uses hundreds of megabytes of RAM per concurrent screenshot, which means a cheap VPS caps out at a handful of parallel agent calls before it falls over. ScreenshotRender's Hobby plan is $10 per month billed annually for 2,000 screenshots, roughly 0.5 cents per render with zero infrastructure overhead. Once you factor in server time, font installs, and the hours spent on stealth flags, the API tier pays for itself almost immediately.

Do I need a paid plan to send screenshots to Claude or GPT-4o?

No. ScreenshotRender's free plan includes 100 screenshots per month with no credit card required, full page support, cookie banner removal, and ad blocking enabled by default. That is enough to prototype an agent end to end before deciding if the workflow is worth paying for. If your target sites are Cloudflare protected, Stealth Mode starts on the Hobby plan.

The honest verdict: if your agent reads webpages, give it screenshots. HTML is the wrong shape, raw screenshots from Puppeteer cost more than they save, and a hosted URL plugs into Claude and GPT-4o with no glue code. The time saved on Chromium maintenance and the tokens saved on HTML bloat usually pay for the API tier inside the first week of an agent actually running in production.