B2B Prospecting

Data Scraping for Lead Generation: Tools, Methods, and Ethics (2026)

15 min read
MK

Mitchell Keller

Founder & CEO, LeadGrow · Managed 3,626+ cold email campaigns. 6.74% average reply rate. Booked 2,230+ meetings in 2025.

TL;DR

  • **Data quality determines campaign quality.** The best cold email copy in the world doesn't matter if you're emailing the wrong people at wrong addresses. We process 5M+ leads per month. The scraping and enrichment layer is where campaigns are won or lost.
  • **Use a waterfall scraping approach.** Spider ($0.0004/page) for most sites, Spider Unblocker ($0.003/page) for anti-bot protected sites, Firecrawl ($0.00067/page) for heavily protected sites. This costs 60% less than using one tool for everything.
  • **Ethics matter.** Scrape public data. Respect robots.txt. Never scrape personal information from platforms that explicitly prohibit it. The legal line is clear: publicly available business information is fair game. Private data behind logins is not.

By Mitchell Keller, Founder & CEO, LeadGrow. Managed 3,626+ cold email campaigns. 6.74% average reply rate. 2,230+ meetings booked in 2025.

Why data scraping matters for outbound

Every cold email campaign starts with a list. The quality of that list determines everything downstream. Send to the wrong people and you get silence. Send to the right people with wrong email addresses and you get bounces. Send to the right people with verified emails and the right situation signal, and you get replies. The full B2B list building process depends on the quality of this data layer.

We process 5M+ leads per month through our pipeline. The scraping and enrichment layer is where we spend the most engineering time. Not on copy. Not on sequencing. On data.

That's because data platforms like Apollo, ZoomInfo, and Cognism have a fundamental problem: their data is stale. Apollo updates on a 3 to 6 month cycle. ZoomInfo claims real-time, but job titles are often months behind. By the time you pull a list from a database, 20% to 30% of the contacts have changed roles, changed companies, or changed email addresses.

Scraping lets you build lists from the source. Company websites, job boards, Google Maps, industry directories, conference attendee lists, government filings. The data is current because you're pulling it in real time.

Key Statistic: Scraped and enriched lists in our pipeline average 4.2% bounce rate compared to 8.7% for lists pulled directly from data platforms without enrichment.

>

Source: LeadGrow internal data, 5M+ leads processed, January 2026

Before we get into tools and methods, let's be clear about what's legal and what's not. This isn't a gray area. The rules are well established.

What you can scrape

    • Publicly available business information. Company websites, "About Us" pages, team pages, press releases. If someone published it on the public internet, it's public.
    • Government filings and public records. SEC filings, patent databases, business registration databases. These are public by law.
    • Job postings. Posted publicly on company websites, Indeed, LinkedIn Jobs. The posting itself is public content.
    • Conference and event attendee lists. Published speaker lists, sponsor pages, attendee directories that are publicly accessible.
    • Google Maps business listings. Publicly listed businesses with contact information they chose to publish.
    • Industry directories. Professional associations, trade group member lists, industry databases.

What you should not scrape

    • Data behind login walls. If you need to authenticate to access it, don't scrape it. That includes LinkedIn profiles (beyond what's publicly visible), private databases, and gated content.
    • Personal data on platforms that prohibit scraping. LinkedIn's terms of service explicitly prohibit scraping. While the hiQ Labs v. LinkedIn case established some legal precedent for public data, the safest approach is to use LinkedIn's own tools (Sales Navigator) for LinkedIn data.
    • Healthcare, financial, or other regulated data. HIPAA, GLBA, and other regulations have strict rules. Don't touch this data without legal counsel.
    • Data from sites that explicitly block scraping in robots.txt. Respect it. There are plenty of other sources.

The practical rule

If a business published the information publicly to attract customers or partners, it's fair game for B2B outreach. If the information was shared in a private context or behind access controls, leave it alone.

We've never had a legal issue with our scraping practices. The framework is simple: public business data is fine. Everything else, don't touch it.

The scraping tool stack

We use a waterfall approach for scraping. Start with the cheapest tool. If it fails, escalate to the next one. This saves 60% compared to using one tool for everything.

Tier 1: Spider ($0.0004/page)

Spider is our default scraper. It handles 80% of sites with no issues. Fast, cheap, and reliable for standard web pages without heavy anti-bot protection.

Best for: Company websites, blog posts, simple directories, job boards, news sites, documentation pages.

Optimal parameters:

    • Timeout: 120 seconds (gives enough time for slow-loading pages without wasting resources)
    • readability: true (strips navigation, ads, and boilerplate so you get clean content)
    • Block images and stylesheets (cuts page load time by 40% to 60%)
    • Return format: markdown (easier to parse downstream than raw HTML)

Limitations: Fails on sites with JavaScript-heavy rendering, CAPTCHAs, or aggressive bot detection. When Spider fails, it fails fast. You'll know within 10 seconds if the site needs a different approach.

Tier 2: Spider Unblocker ($0.003/page)

Spider Unblocker is Spider with anti-bot bypass. It handles JavaScript rendering, basic CAPTCHAs, and most bot-detection systems. Costs 7.5x more than base Spider, so you only use it when Tier 1 fails.

Best for: Sites with moderate anti-bot protection. E-commerce sites, larger SaaS company websites, some directories that use Cloudflare or similar protection.

When to use: When base Spider returns empty results, error codes, or CAPTCHA pages. The waterfall handles this automatically. Spider fails, Unblocker picks up.

Tier 3: Firecrawl ($0.00067/page)

Firecrawl is the heavy hitter. It handles the most protected sites with full browser rendering and advanced anti-bot techniques. More expensive than base Spider but cheaper than Unblocker for some use cases.

Best for: Heavily protected sites, sites that require full browser rendering, and specific platforms that block traditional scrapers.

Protected sites to send straight to Firecrawl: Crunchbase, ZoomInfo (public pages), PitchBook, some government databases with anti-bot protection. Don't waste Spider attempts on these. They'll fail 100% of the time. Go straight to Firecrawl.

Cost comparison

ToolCost per Page10,000 Pages100,000 PagesBest For
Spider$0.0004$4.00$40.00Standard sites (80% of scraping)
Spider Unblocker$0.003$30.00$300.00Anti-bot protected sites
Firecrawl$0.00067$6.70$67.00Heavily protected sites
Waterfall (blended)~$0.0008~$8.00~$80.00All sites, optimized cost

The waterfall approach averages about $0.0008 per page because 80% of pages go through base Spider. If you used Firecrawl for everything, you'd spend 67% more. If you used Unblocker for everything, you'd spend 275% more.

Key Statistic: Our waterfall scraping approach processes 100,000 pages for approximately $80, compared to $300 if we used Spider Unblocker for everything.

>

Source: LeadGrow cost analysis, Q1 2026

Scraping methods by data source

Google Maps scraping for local businesses

Google Maps is one of the richest data sources for local B2B lead generation. Every business listing includes name, address, phone number, website, hours, reviews, and sometimes the owner's name.

How we scrape Google Maps:

    • Define search queries by geography and category. "Dental practices in Austin, TX" or "Manufacturing companies in Ohio." Be specific. Broad queries return noisy results.
    • Extract listing data. Business name, address, phone, website URL, review count, rating, categories. This gives you the company-level data.
    • Scrape company websites from the listings. Follow the website URL from each listing and scrape the team page, about page, and contact page. This gets you individual names, titles, and sometimes email addresses.
    • Enrich with email finding. Use the company domain and person name to find verified email addresses through enrichment tools (Hunter, Apollo, or our AI Arc pipeline).

Google Maps scraping works well for industries where businesses are listed locally: dental practices, law firms, manufacturing, construction, restaurants (for foodtech companies), real estate, and healthcare facilities.

Volume expectation: a well-structured Google Maps scrape for a major metro area returns 500 to 2,000 relevant businesses per category. Across 10 metro areas, that's 5,000 to 20,000 businesses. More than enough for a targeted cold email campaign.

Company website scraping

When you need to build a list of people at specific companies, scraping their websites directly is often more accurate than using a database.

Target pages:

    • Team/About pages: Names, titles, sometimes email and LinkedIn
    • Leadership pages: C-suite and VP-level contacts
    • Contact pages: Direct email addresses, phone numbers
    • Blog author pages: Content creators and thought leaders
    • Career pages: Current openings (situation signals for hiring companies)

We process these pages through Claude Code to extract structured data from unstructured HTML. The model identifies names, titles, and contact information with 95%+ accuracy on well-structured pages.

Job board scraping (situation signals)

Job postings are gold for outbound. A company hiring an SDR is a company that needs pipeline. A company hiring a VP of Marketing is a company about to invest in growth. These are buying situations, not just data points.

What to scrape from job boards:

    • Company name and size (from the job listing)
    • Role being hired (the situation signal)
    • Requirements section (reveals their tech stack, budget signals, and priorities)
    • Posting date (recency matters for situation-based outreach)

Sources: company career pages (most reliable), Indeed, LinkedIn Jobs (use the API, don't scrape the site), Greenhouse/Lever public boards, AngelList/Wellfound for startups.

Public records and filings

Government databases are underused for lead generation. They're public, they're accurate (legally required to be), and they're updated regularly.

    • SEC EDGAR: Public company filings. 10-K reports reveal strategic priorities, technology spending, and vendor relationships. When a company mentions a pain point in their annual report, that's a situation signal.
    • SBA databases: Small business data, government contract awards. If a company just won a federal contract, they're growing.
    • State business registrations: New business filings, address changes, officer changes. Useful for detecting recent activity.
    • Patent databases: Companies filing patents in your space are investing in innovation. They need tools and services to support that innovation.

Conference and event attendee lists

Conferences publish speaker lists, sponsor directories, and sometimes full attendee lists. These are high-intent prospects because they've already shown interest in the topic by attending.

Scraping approach:

    • Find event websites for conferences in your target industry
    • Scrape speaker pages, sponsor pages, and exhibitor directories
    • Extract company names and individual names
    • Cross-reference with LinkedIn or enrichment tools for contact details
    • Time your outreach around the event (before for pre-booking, after for follow-up)

One of our clients (data center company) used this approach to pre-book 48 meetings from a single industry event. They scraped the attendee list 3 weeks before the conference, ran a targeted cold email campaign, and showed up with a full calendar. That's the power of combining scraping with situation-based outreach.

Data cleaning and deduplication

Raw scraped data is messy. Names are inconsistent. Email formats vary. Duplicates are everywhere. Cleaning is where most teams cut corners, and it costs them in deliverability and reply rates.

Our cleaning pipeline

    • Deduplication: Match on email address first, then company domain + name combination. Remove exact and fuzzy duplicates. We typically find 15% to 25% duplicate rates in raw scraped data.
    • Email verification: Waterfall verification through NeverBounce > ZeroBounce > MillionVerifier. Any email that doesn't pass all three gets flagged. Target: under 2% bounce rate.
    • Title normalization: "VP Sales," "Vice President of Sales," "VP, Sales," and "Vice Pres. Sales" are the same person. Normalize to standard titles so your targeting filters work.
    • Company name normalization: "Acme Inc," "Acme, Inc.," "ACME," and "Acme Corporation" need to match. This prevents sending 4 emails to the same company.
    • Data enrichment: Fill in missing fields. If you have the email, find the title. If you have the name and company, find the email. AI Arc does this at 100x cheaper than Apollo for bulk processing.

We run this entire pipeline through Claude Code. 272K rows per second. A list of 100,000 contacts gets cleaned, deduplicated, and enriched in under 10 minutes. Details on the Claude Code setup in our Claude Code guide.

Building custom scrapers vs. using platforms

The question every outbound team faces: do you build custom scrapers or use existing platforms?

FactorCustom ScrapersPlatforms (Apollo, ZoomInfo, etc.)
Data freshnessReal-time (scraped on demand)3 to 6 months stale
Cost at scale$0.0004 to $0.003 per page$0.01 to $0.05 per contact
Setup time2 to 4 weeksMinutes
MaintenanceOngoing (sites change)None (platform handles it)
CustomizationTotal controlLimited to platform fields
Niche dataCan scrape any public sourceLimited to platform's database

Our recommendation: use both. Platforms for quick initial lists and enrichment. Custom scrapers for niche data, situation signals, and fresh data that platforms don't have.

The companies that build their own scraping infrastructure have a compounding advantage. Every month, their data gets better, their targeting gets tighter, and their cost per lead goes down. The companies that rely entirely on platforms are paying the same price for the same stale data as everyone else.

Frequently Asked Questions

Want us to run this playbook for you?

Book a strategy call and we'll show you how these frameworks apply to your business.

Book Strategy Call