Why In-House Web Scraping Falls Apart After 20 Sites

Last Updated: January 27, 2026

Contents

At 10 sites, scraper maintenance is a task. At 20, it starts becoming a job. At 50, it's half an engineer's year. Here's where the math breaks — and how to know when in-house stops making sense.

You built in-house scraping. It works. Your engineer proved the value, leadership loves the data, and now they want more — 30 sites, maybe 50, eventually "all our major competitors." The natural assumption: expanding to 50 sites means 5× more of the same work. That assumption is wrong.

At 10 sites, scraper maintenance is a task. At 20 sites, the warning signs start. At 50 sites, it's consuming half an engineer's year. The relationship between sites and maintenance isn't linear. It's compounding — driven by four forces that don't exist at small scale but activate together as you grow.

This article breaks down exactly where the workload turns, what the numbers look like at each level, and how to know when in-house stops making economic sense.

1-2% Weekly scraper break rate, even in mature systems
7-11× Maintenance multiplier from 10 to 50 sites
2,500+ Scraping jobs we operate daily

Your 10-site success doesn't predict your 50-site reality

Let's start with what you already know, because you're right about it: in-house scraping works at small scale. A team monitoring 10 competitor sites, collecting prices weekly, typically runs about 30–80 scraping jobs (each site may need separate jobs for product pages, search results, variant tracking).

At that volume, you'll see roughly one break every week or two — maybe 15–30 per year. Total maintenance including break-fix, spot-checking, and infrastructure: somewhere around 60–120 hours per year.

That's 3% of one engineer's time. It barely registers.

DIY genuinely makes sense here. At 10 sites with weekly collection, the economics favor in-house. The maintenance is real but minor. If this is your scale and things are working, keep doing what you're doing — and bookmark this article for when leadership asks for 30 more.

The exception is when high SKU volume or brittle sites push the work up even at small site counts. Our customer Landmark Group, a furniture retailer in the Middle East, had roughly 6 competitor sites — but tens of thousands of products across them. Their retail analyst was spending 6 hours per week on maintenance and still only achieving 60–70% data coverage.

Landmark's workload wasn't driven by site count. It was driven by SKU volume and data complexity. Small site count doesn't guarantee low maintenance if the data is brittle.

But for most teams at 10 simple sites, the experience is genuinely smooth. And that's the danger. When it works, your brain builds a model: sites and effort scale together in a straight line. Ten sites costs 3 hours a week, so fifty sites should cost about 15 hours a week.

Four forces conspire to make that model wrong.

Sites vs. Annual Maintenance Hours 0300600900Hours / year10 sites20 sites50 sites120 hrs350 hrs1,075 hrs~600 hrsGap:~2× yourestimateWhat actually happensWhat teams expect (linear)
The dashed line is what most teams budget for. The solid line is what we see in takeover audits. The gap at 50 sites is where expansion plans break.

The four forces that break linear scaling

Individual scrapers break at predictable rates — roughly 1–2% of active jobs per week — and each fix runs through a multi-step process most teams don't realize they're running. We manage 2,500+ scraping jobs daily, so we see these rates at scale. What changes as you grow isn't the per-incident reality — it's how incidents interact with each other and with everything else your team does.

Force 1: Interaction effects. At 10 sites, breaks are isolated — fix one, move on. At 50, they overlap. You're diagnosing Site A when Site B goes down. You fix B, then realize Site C's data has been wrong since Tuesday — but you didn't notice because you were buried in A and B.

Each break increases the cost of every other break happening that week by fragmenting attention.

Force 2: Coordination overhead. At 10 sites, one person knows everything. At 20+, knowledge distributes. The engineer who handles most scrapers is out sick — nobody knows why the Zalando scraper uses a different proxy configuration. Status meetings appear. Slack threads multiply.

"Who's handling the ASOS issue?" becomes a daily question. At 50 sites, that's 2–3 hours per week of people talking about scraper problems instead of fixing them.

Those first two forces are already enough to push past linear. But what happens to the infrastructure underneath?

Force 3: Infrastructure complexity. At 10 sites, one proxy provider and a single scheduling setup. At 50, you need multiple proxy providers (no single pool works against every anti-bot system), dedicated browser sessions for dynamically-loaded sites, scheduling that spaces out requests so targets don't detect patterns, and monitoring for all of it.

An hour per week becomes 4–5 — and it's the kind of work that generates urgent interruptions, not planned tasks.

Force 4: Firefighting overlap. At 10 sites, Black Friday means keeping 10 scrapers alive under pressure — stressful, but one person can manage. At 50 sites, all scrapers run at peak frequency while the sites they target simultaneously deploy extra anti-bot measures because they're under traffic pressure too. Critical periods stop being "all hands" events and become triage — you're choosing which data to lose.

None of these four forces operate at 10 sites. All four activate somewhere in the 15–30 range — 20 is where most teams first feel the shift. By 50, they're fully compounding.

So what does this actually cost in hours? Here's the math when you account for all four forces — not just break-fix, but QA, spot-checking, silent-failure investigations, infrastructure management, and coordination. The full breakdown across all seven maintenance categories is documented here.

SitesScraping JobsAnnual Maint. Hours% of Engineer FTEWhat It Feels Like
1030–8060–1203%A task. Barely registers.
2060–160250–45012–22%"Just a quick fix" — 2–3×/week. Other projects slip.
50150–400850–1,30040–63%Half an engineer’s year. $65–150K labor.
150450–1,2002,500–5,000+1.2–2.5 FTEYou're staffing a scraping team, not maintaining a tool.
Based on maintenance patterns across customer takeovers (2023–2026). Ranges reflect site complexity, anti-bot protection, collection frequency, and team distribution.

Look at the jump from 10 to 50 sites. Not 5×. Seven to eleven times more total maintenance hours.

The gap is coordination, validation, and infrastructure — the work that compounds. At 150 sites, the multiplier reaches 20–40×.

The linear model your team is using to plan the expansion is wrong by a factor of two to four.

The expansion math is broken. If maintenance scaled linearly, 50 sites would require about 300–600 hours per year. The actual range is 850–1,300. The expansion budget your team is planning is based on one person's slice of the work, not the total across everyone who touches the data.
Get your actual number

Send us your current site list and collection requirements. Within 48 hours, we'll scope what your operation actually costs — in-house vs. managed — so you can compare real numbers before committing to an expansion.

Get a Scaling Assessment
No commitment. If your current approach is working, we'll tell you that.

The 20-site inflection: where the warning signs appear

The title says "after 20 sites" deliberately. Not because 20 is a magic number — but because it's the threshold where the four forces first become visible. Below 20, maintenance is absorbed into existing work. Above 20, it starts becoming the work.

This is a qualitative shift, not just a quantitative one. At 10 sites, your engineer fixes a broken scraper and goes back to their real project. At 25 sites, they fix one scraper, discover another broke while they were fixing the first, get a Slack message about suspicious data on a third site, and realize they haven't started their planned work for the day.

The context-switching alone erodes productivity on everything they touch.

Look at the table above: 20 sites means 250–450 maintenance hours per year. That's a full day every week that nobody planned for and nobody's tracking. The projects that slip aren't scrapers — they're the product features, the analytics work, the strategic initiatives your engineer was actually hired to deliver. At 12–22% of an FTE, it doesn't show up in any resource plan. But it shows up in every sprint that runs long and every deadline that slides.

That's what "falls apart" looks like at 20. Not a catastrophic failure — a slow, invisible reallocation of your best people toward work that shouldn't be theirs.

Our customer WiTailor, an eCommerce agency, lived this trajectory. Their business analysts started by writing Python scripts and managing proxies to collect marketplace data for brand clients — beginning with a single website. As they scaled past their first few clients, reaching dozens of sites across brands and countries, the maintenance burden consumed analyst time that should have gone to client insights. By the time they needed 100+ sites across multiple markets, building in-house was no longer an option — they'd been our customer for five years because the alternative was consuming their entire analytics team.

That trajectory — manageable at single digits, creeping at 20, overwhelming past 50 — is what virtually every in-house team we've worked with has experienced. Nobody decides to build a maintenance operation. It grows one incident at a time.

Here's why it stays invisible: the engineer sees 4 hours of break-fix, the analyst sees 2 hours of data validation, the product manager sees 1 hour of "checking if the numbers look right." Nobody adds it up.

If your team estimates 10 hours a month on scraper maintenance, the real number is probably 40–60. The gap is invisible because the hours are scattered across roles nobody is aggregating. Based on the operations we audit before onboarding, the underestimate runs 4–6× consistently.

That gap — between what teams budget and what they actually spend — is where the verification tax compounds fastest.

The 20-site mark is where this distributed maintenance first forces multiple people to touch the problem in the same week, and coordination overhead appears for the first time. By 50 sites, it's undeniable. But by then, you've already committed engineering resources, built technical debt, and concentrated knowledge in one person's head. The transition is harder the longer you wait.

Score your operation: the 17-indicator assessment

Before planning your next expansion — or deciding whether your current approach is sustainable — honestly assess where you stand. Two minutes.

Self-Assessment — 2 Minutes
Where Does Your Operation Stand?
Complexity Indicators
We're scraping 30+ sites
We collect daily or more frequently
Some of our sites have anti-bot protection
Some sites need browser simulation to load properly
We have regional or language variations
Product matching is a challenge
Strain Indicators
Block rates are increasing over time
We've had to "de-prioritize" sites we wanted
Maintenance takes more than 40% of engineer time
We don't have documentation someone new could follow
QA happens manually or not at all
We've been surprised by cost spikes
Risk Indicators
One person has most of the knowledge
We've never calculated true cost per record
We don't have monitoring for data quality
Business decisions are waiting on data coverage gaps
We don't have a plan for 2× scale
0–3 checks: Your current approach is likely sustainable. Keep going.
4–8 checks: You're approaching breaking points. The expansion you're planning will hit the compounding forces above. Decision time.
9+ checks: Fundamental change needed. The longer you wait, the more technical debt accumulates and the harder the transition becomes.

Whatever your score, you now know where you stand. That clarity alone changes the conversation from "maybe we have a problem" to "here's exactly what we need to decide."

What scaling looks like when maintenance isn't your problem

The four forces above don't disappear with a managed service. Scrapers still break. Sites still deploy anti-bot. Coordination and QA are still required. The difference is whose team absorbs it — yours, or a team that does nothing else.

For the teams where the inflection above applies — here's what the alternative looks like. These are three real expansions where the maintenance burden didn't grow with the customer's site count.

Today they monitor 400 sites — across Amazon in 15 countries, eBay, Walmart, Google Shopping, and hundreds of individual retailer sites. After switching, they reached full coverage and discovered over 700 unauthorized sellers they'd never known about. Their head of eCommerce went from troubleshooting scraper issues to actual MAP enforcement work.

That's not incremental improvement. That's a different category of visibility.

One of our customers, a global luxury marketplace, started at 48 sites monitoring seller assortments across 200+ brands. Their account sales team — roughly 20 people — was collectively spending over 100 hours per week on manual data collection, and still achieving only about 10% of the coverage they needed. We expanded their monitoring to 150 sites across 10+ categories — 300,000 pages checked weekly.

Assortment visibility went from patchy to near-complete. Seller negotiations shifted from vague requests to specific, benchmarked conversations. The time that came back didn't go to maintaining scrapers — it went to using the data.

Our customer Arthur D. Little (management consultancy) needed comparable pricing data across 32 pharmacy and beauty websites in four countries — Italy, Sweden, Netherlands, Romania — for a client engagement on competitive pricing strategy. They needed it within a week. Building scrapers for 32 sites with different structures, currencies, and anti-bot systems wasn't feasible in that timeline.

We delivered a complete, analysis-ready dataset within 48 hours. Custom schema per site. Variant-level pricing with discount detection. Their analysts went straight to insight work — zero time on data plumbing.

The common thread: these teams didn't scale by getting better at maintenance. They scaled by removing maintenance from their team's workload entirely. In every case, the expansion they needed was only possible once maintenance was no longer their problem.

The decision framework: when does in-house stop making sense?

This isn't "always outsource." Plenty of operations should stay in-house. The question is whether yours is one of them — and the answer is arithmetic, not philosophy.

Here's a formula you can run in five minutes — with a worked example for a 30-site operation monitoring anti-bot-protected competitors daily:

StepFormula30 Sites Example
1. Annual incidents(Sites × jobs/site) × break rate × 5230 × 8 × 2% × 52 = ~250/year
2. Break-fix hoursIncidents × avg fix time (50–90 min)~210–375 hrs/year
3. Total maintenanceBreak-fix × 2.2–2.9 (multiplier based on audits)~470–1,070 hrs/year
4. Annual costTotal hours × loaded rate ($85/hr)~$40K–$91K/year
5. CompareGet managed service quote for same scope

Where DIY still makes sense: 5–15 sites, weekly collection, simple sites, engineering team with spare capacity. Total maintenance under 300 hours/year. Stay in-house.

Where managed service wins: 30+ sites, daily collection, sites with anti-bot, reliability needed during critical periods. Maintenance above 500 hours/year — outsourcing wins on economics.

The gray zone (15–30 sites): Depends on site complexity, team capacity, trajectory. If you're at 20 sites planning 40, run the math at 40.

For the complete cost comparison across DIY, SaaS tools, and managed services — including the infrastructure and opportunity costs this formula doesn't capture — here's the full TCO breakdown with three scenarios.

What to do with this

Run the formula above for your operation. Plug in your actual site count, estimated jobs per site, and collection frequency. Compare the projected maintenance hours to what your team currently estimates. If there's a 3×+ gap, you're in the underestimation pattern that affects virtually every team we've audited.

If you're planning expansion, run the formula at your target site count — not your current one. The expansion math is what breaks teams, not the current maintenance load. Twenty sites might feel manageable. Fifty will not be "just more of the same." Here's what each individual break actually costs in time and downstream disruption, so you can pressure-test the per-incident numbers.

If you want to see what the alternative looks like:

Portwest Global Safety Brand · 400 sites across 30+ countries
Before: Previous provider delivering 60% success rate. Limited to 15 sites. Expansion estimated at 6+ months of engineering time.
After: 400 sites monitored. 700+ unauthorized sellers discovered. Full MAP enforcement evidence. 4-year customer.
Read the Portwest case study
Global Luxury Marketplace 150 sites across 200+ brands
Before: 48 sites. Account team collectively spending 100+ hours/week on manual collection. ~10% data coverage achieved.
After: 150 sites, 300,000 pages/week. Assortment visibility from ~10% to 90–98%. Manual collection hours eliminated.
Request a sample to see the difference
Arthur D. Little Management Consultancy · 32 sites across 4 countries
Before: Needed pricing data across 32 pharmacy/beauty websites in 4 countries. One-week deadline. Building in-house wasn't feasible in that timeline.
After: Complete dataset delivered in 48 hours. Custom schema per site. Variant-level pricing with discount detection. Analysts went straight to insights.
Request a sample delivery
Two Ways to Get Your Number

Find out what your current setup actually costs — or see what managed data delivery looks like for your competitors.

Get a TCO Estimate
48 hours. No commitment. If your current approach is working, we'll tell you that.