Vision-Based Web Scraping into Structured JSON + Sheets
This SOP extracts product data from web pages without relying on brittle XPath/CSS selectors by using screenshots as the primary source of truth. You keep a URL list in Google Sheets, capture a full-page screenshot with ScrapingBee, then ask Gemini (Gemini 1.5 Pro) to read the screenshot and return strict JSON.
If a page is ambiguous (dynamic price blocks, small text, variants), you fall back to HTML extraction and re-run the same JSON schema. To control token cost, convert HTML into a compact markdown representation before sending it to the model. This is built for e-commerce scraping, but it generalizes to directories, marketplaces, and SaaS pricing pages. [file:81][web:82]
Who Is This For?
What Problem Does It Solve?
Challenge
Selectors break whenever the DOM changes.
Scraping dynamic pages requires constant engineering babysitting.
Costs explode when you send full HTML to LLMs.
Solution
Use screenshots as the stable extraction surface, then only fall back to HTML when needed.
Vision extraction handles many layout changes and reduces maintenance overhead.
Convert HTML to compact markdown + enforce strict JSON outputs to reduce tokens and rework.
What You'll Achieve with This Toolkit
A resilient scraping methodology that keeps working when the DOM changes, delivering structured JSON you can trust in a spreadsheet.
Extract data without selector fragility
Vision-based extraction reads what users see, so minor DOM shifts stop breaking your pipeline.
Keep accuracy with HTML fallback
When screenshots are ambiguous, HTML provides a deterministic fallback for edge cases.
Control token spend with markdown compaction
Compact markdown reduces tokens versus raw HTML, while strict JSON output reduces cleanup.
How It Works
Step 1: Queue URLs in Google Sheets
Create a table in Google Sheets with columns like url, product_name, target_fields, and status so scraping becomes trackable and auditable.
Google Sheets queue of URLs to scrape
Chosen for its shared, row-based queue pattern so non-technical operators can add URLs and review extracted JSON without engineering support.
Google Sheets
Smart, collaborative spreadsheets with Gemini AI power
Step 2: Capture full-page screenshots
Use ScrapingBee to capture a full-page screenshot for each URL so the model can read the page visually (including content loaded after render). [web:82]
Full-page screenshot of a product page
ScrapingBee is selected for its dedicated Screenshot API that captures full-page renders, giving the vision model stable pixels even when the DOM is brittle.
Step 3: Extract structured JSON with Gemini Vision
Send the screenshot to Gemini and request a strict schema (product name, price, currency, availability, variants). Reject outputs that are not valid JSON and retry with clearer constraints. [file:81]
Structured JSON extracted from a product screenshot
Chosen because its multimodal vision capability can read the rendered page layout, which makes extraction resilient when HTML structure is inconsistent.
Step 4: Fallback to HTML when screenshots are ambiguous
If the model cannot confidently extract a field from pixels, fetch the HTML and convert it to compact markdown to reduce token cost, then re-run the same schema extraction. [file:81]
HTML-to-markdown compact view used for extraction
Step 5: Write results back to Google Sheets
Append the extracted JSON fields to Google Sheets and keep a run timestamp so changes in price/availability are easy to track over time. [file:81]
Sheets table with extracted product fields and timestamps
Chosen because appending rows creates an audit trail, and the table format makes price monitoring and QA simple without additional BI tooling.
Google Sheets
Smart, collaborative spreadsheets with Gemini AI power
Similar Workflows
Looking for different tools? Explore these alternative workflows.
This workflow fully automates the creation and social media distribution of AI-generated news videos. Combine GPT-4o for caption writing, HeyGen for avatar video generation, and Postiz for unified publishing to Instagram, Facebook, and YouTube.
Turn one campaign brief into platform-optimized posts using GPT-4o and Gemini, run double approvals via Gmail, then schedule publishing with Buffer and send status updates to Telegram.
Solo AI Media Factory is a comprehensive Content Creation workflow designed to transform creative ideas into 4K photorealistic videos in hours. By integrating GPT-4o, Sora, and ElevenLabs, this toolkit helps revenue teams automate storytelling and replace expensive film crews with automated AI loops. Ideal for Solopreneurs looking to scale cinematic output.
Frequently Asked Questions
Yes. The manual method is: capture a screenshot, ask Gemini for strict JSON, then paste results into Google Sheets. Automation only scales the queue.
Because screenshots stay stable when DOM structure changes. While selectors are precise, they break often; vision extraction reads the rendered truth and reduces maintenance.
Cost and ambiguity. Images can increase processing cost, and small text/overlays may reduce certainty, which is why HTML fallback and strict schemas are important. [file:81]
Start with screenshots only for pages where selectors fail, compress HTML to markdown when using text fallback, and enforce strict JSON to avoid repeated retries. [file:81]
It depends. You must comply with local laws, the site's terms of service, and robots guidance, especially for e-commerce data. The template explicitly warns to check legal regulations. [file:81]
If the DOM is stable, use classic HTML parsing with selectors and only add Gemini for edge-case cleanup. Vision-based extraction is best when selector maintenance is your bottleneck. [file:81]