Contents
At 10 sites, scraper maintenance is a task. At 20, it starts becoming a job. At 50, it's half an engineer's year. Here's where the math breaks — and how to know when in-house stops making sense.
You built in-house scraping. It works. Your engineer proved the value, leadership loves the data, and now they want more — 30 sites, maybe 50, eventually "all our major competitors." The natural assumption: expanding to 50 sites means 5× more of the same work. That assumption is wrong.
At 10 sites, scraper maintenance is a task. At 20 sites, the warning signs start. At 50 sites, it's consuming half an engineer's year. The relationship between sites and maintenance isn't linear. It's compounding — driven by four forces that don't exist at small scale but activate together as you grow.
This article breaks down exactly where the workload turns, what the numbers look like at each level, and how to know when in-house stops making economic sense.
Let's start with what you already know, because you're right about it: in-house scraping works at small scale. A team monitoring 10 competitor sites, collecting prices weekly, typically runs about 30–80 scraping jobs (each site may need separate jobs for product pages, search results, variant tracking).
At that volume, you'll see roughly one break every week or two — maybe 15–30 per year. Total maintenance including break-fix, spot-checking, and infrastructure: somewhere around 60–120 hours per year.
That's 3% of one engineer's time. It barely registers.
The exception is when high SKU volume or brittle sites push the work up even at small site counts. Our customer Landmark Group, a furniture retailer in the Middle East, had roughly 6 competitor sites — but tens of thousands of products across them. Their retail analyst was spending 6 hours per week on maintenance and still only achieving 60–70% data coverage.
Landmark's workload wasn't driven by site count. It was driven by SKU volume and data complexity. Small site count doesn't guarantee low maintenance if the data is brittle.
But for most teams at 10 simple sites, the experience is genuinely smooth. And that's the danger. When it works, your brain builds a model: sites and effort scale together in a straight line. Ten sites costs 3 hours a week, so fifty sites should cost about 15 hours a week.
Four forces conspire to make that model wrong.
Individual scrapers break at predictable rates — roughly 1–2% of active jobs per week — and each fix runs through a multi-step process most teams don't realize they're running. We manage 2,500+ scraping jobs daily, so we see these rates at scale. What changes as you grow isn't the per-incident reality — it's how incidents interact with each other and with everything else your team does.
Force 1: Interaction effects. At 10 sites, breaks are isolated — fix one, move on. At 50, they overlap. You're diagnosing Site A when Site B goes down. You fix B, then realize Site C's data has been wrong since Tuesday — but you didn't notice because you were buried in A and B.
Each break increases the cost of every other break happening that week by fragmenting attention.
Force 2: Coordination overhead. At 10 sites, one person knows everything. At 20+, knowledge distributes. The engineer who handles most scrapers is out sick — nobody knows why the Zalando scraper uses a different proxy configuration. Status meetings appear. Slack threads multiply.
"Who's handling the ASOS issue?" becomes a daily question. At 50 sites, that's 2–3 hours per week of people talking about scraper problems instead of fixing them.
Those first two forces are already enough to push past linear. But what happens to the infrastructure underneath?
Force 3: Infrastructure complexity. At 10 sites, one proxy provider and a single scheduling setup. At 50, you need multiple proxy providers (no single pool works against every anti-bot system), dedicated browser sessions for dynamically-loaded sites, scheduling that spaces out requests so targets don't detect patterns, and monitoring for all of it.
An hour per week becomes 4–5 — and it's the kind of work that generates urgent interruptions, not planned tasks.
Force 4: Firefighting overlap. At 10 sites, Black Friday means keeping 10 scrapers alive under pressure — stressful, but one person can manage. At 50 sites, all scrapers run at peak frequency while the sites they target simultaneously deploy extra anti-bot measures because they're under traffic pressure too. Critical periods stop being "all hands" events and become triage — you're choosing which data to lose.
None of these four forces operate at 10 sites. All four activate somewhere in the 15–30 range — 20 is where most teams first feel the shift. By 50, they're fully compounding.
So what does this actually cost in hours? Here's the math when you account for all four forces — not just break-fix, but QA, spot-checking, silent-failure investigations, infrastructure management, and coordination. The full breakdown across all seven maintenance categories is documented here.
| Sites | Scraping Jobs | Annual Maint. Hours | % of Engineer FTE | What It Feels Like |
|---|---|---|---|---|
| 10 | 30–80 | 60–120 | 3% | A task. Barely registers. |
| 20 | 60–160 | 250–450 | 12–22% | "Just a quick fix" — 2–3×/week. Other projects slip. |
| 50 | 150–400 | 850–1,300 | 40–63% | Half an engineer’s year. $65–150K labor. |
| 150 | 450–1,200 | 2,500–5,000+ | 1.2–2.5 FTE | You're staffing a scraping team, not maintaining a tool. |
Look at the jump from 10 to 50 sites. Not 5×. Seven to eleven times more total maintenance hours.
The gap is coordination, validation, and infrastructure — the work that compounds. At 150 sites, the multiplier reaches 20–40×.
The linear model your team is using to plan the expansion is wrong by a factor of two to four.
Send us your current site list and collection requirements. Within 48 hours, we'll scope what your operation actually costs — in-house vs. managed — so you can compare real numbers before committing to an expansion.
Get a Scaling AssessmentThe title says "after 20 sites" deliberately. Not because 20 is a magic number — but because it's the threshold where the four forces first become visible. Below 20, maintenance is absorbed into existing work. Above 20, it starts becoming the work.
This is a qualitative shift, not just a quantitative one. At 10 sites, your engineer fixes a broken scraper and goes back to their real project. At 25 sites, they fix one scraper, discover another broke while they were fixing the first, get a Slack message about suspicious data on a third site, and realize they haven't started their planned work for the day.
The context-switching alone erodes productivity on everything they touch.
Look at the table above: 20 sites means 250–450 maintenance hours per year. That's a full day every week that nobody planned for and nobody's tracking. The projects that slip aren't scrapers — they're the product features, the analytics work, the strategic initiatives your engineer was actually hired to deliver. At 12–22% of an FTE, it doesn't show up in any resource plan. But it shows up in every sprint that runs long and every deadline that slides.
That's what "falls apart" looks like at 20. Not a catastrophic failure — a slow, invisible reallocation of your best people toward work that shouldn't be theirs.
Our customer WiTailor, an eCommerce agency, lived this trajectory. Their business analysts started by writing Python scripts and managing proxies to collect marketplace data for brand clients — beginning with a single website. As they scaled past their first few clients, reaching dozens of sites across brands and countries, the maintenance burden consumed analyst time that should have gone to client insights. By the time they needed 100+ sites across multiple markets, building in-house was no longer an option — they'd been our customer for five years because the alternative was consuming their entire analytics team.
That trajectory — manageable at single digits, creeping at 20, overwhelming past 50 — is what virtually every in-house team we've worked with has experienced. Nobody decides to build a maintenance operation. It grows one incident at a time.
Here's why it stays invisible: the engineer sees 4 hours of break-fix, the analyst sees 2 hours of data validation, the product manager sees 1 hour of "checking if the numbers look right." Nobody adds it up.
If your team estimates 10 hours a month on scraper maintenance, the real number is probably 40–60. The gap is invisible because the hours are scattered across roles nobody is aggregating. Based on the operations we audit before onboarding, the underestimate runs 4–6× consistently.
The 20-site mark is where this distributed maintenance first forces multiple people to touch the problem in the same week, and coordination overhead appears for the first time. By 50 sites, it's undeniable. But by then, you've already committed engineering resources, built technical debt, and concentrated knowledge in one person's head. The transition is harder the longer you wait.
Before planning your next expansion — or deciding whether your current approach is sustainable — honestly assess where you stand. Two minutes.
Whatever your score, you now know where you stand. That clarity alone changes the conversation from "maybe we have a problem" to "here's exactly what we need to decide."
The four forces above don't disappear with a managed service. Scrapers still break. Sites still deploy anti-bot. Coordination and QA are still required. The difference is whose team absorbs it — yours, or a team that does nothing else.
For the teams where the inflection above applies — here's what the alternative looks like. These are three real expansions where the maintenance burden didn't grow with the customer's site count.
Today they monitor 400 sites — across Amazon in 15 countries, eBay, Walmart, Google Shopping, and hundreds of individual retailer sites. After switching, they reached full coverage and discovered over 700 unauthorized sellers they'd never known about. Their head of eCommerce went from troubleshooting scraper issues to actual MAP enforcement work.
That's not incremental improvement. That's a different category of visibility.
One of our customers, a global luxury marketplace, started at 48 sites monitoring seller assortments across 200+ brands. Their account sales team — roughly 20 people — was collectively spending over 100 hours per week on manual data collection, and still achieving only about 10% of the coverage they needed. We expanded their monitoring to 150 sites across 10+ categories — 300,000 pages checked weekly.
Assortment visibility went from patchy to near-complete. Seller negotiations shifted from vague requests to specific, benchmarked conversations. The time that came back didn't go to maintaining scrapers — it went to using the data.
Our customer Arthur D. Little (management consultancy) needed comparable pricing data across 32 pharmacy and beauty websites in four countries — Italy, Sweden, Netherlands, Romania — for a client engagement on competitive pricing strategy. They needed it within a week. Building scrapers for 32 sites with different structures, currencies, and anti-bot systems wasn't feasible in that timeline.
We delivered a complete, analysis-ready dataset within 48 hours. Custom schema per site. Variant-level pricing with discount detection. Their analysts went straight to insight work — zero time on data plumbing.
The common thread: these teams didn't scale by getting better at maintenance. They scaled by removing maintenance from their team's workload entirely. In every case, the expansion they needed was only possible once maintenance was no longer their problem.
This isn't "always outsource." Plenty of operations should stay in-house. The question is whether yours is one of them — and the answer is arithmetic, not philosophy.
Here's a formula you can run in five minutes — with a worked example for a 30-site operation monitoring anti-bot-protected competitors daily:
| Step | Formula | 30 Sites Example |
|---|---|---|
| 1. Annual incidents | (Sites × jobs/site) × break rate × 52 | 30 × 8 × 2% × 52 = ~250/year |
| 2. Break-fix hours | Incidents × avg fix time (50–90 min) | ~210–375 hrs/year |
| 3. Total maintenance | Break-fix × 2.2–2.9 (multiplier based on audits) | ~470–1,070 hrs/year |
| 4. Annual cost | Total hours × loaded rate ($85/hr) | ~$40K–$91K/year |
| 5. Compare | Get managed service quote for same scope | — |
Where DIY still makes sense: 5–15 sites, weekly collection, simple sites, engineering team with spare capacity. Total maintenance under 300 hours/year. Stay in-house.
Where managed service wins: 30+ sites, daily collection, sites with anti-bot, reliability needed during critical periods. Maintenance above 500 hours/year — outsourcing wins on economics.
The gray zone (15–30 sites): Depends on site complexity, team capacity, trajectory. If you're at 20 sites planning 40, run the math at 40.
For the complete cost comparison across DIY, SaaS tools, and managed services — including the infrastructure and opportunity costs this formula doesn't capture — here's the full TCO breakdown with three scenarios.
Run the formula above for your operation. Plug in your actual site count, estimated jobs per site, and collection frequency. Compare the projected maintenance hours to what your team currently estimates. If there's a 3×+ gap, you're in the underestimation pattern that affects virtually every team we've audited.
If you're planning expansion, run the formula at your target site count — not your current one. The expansion math is what breaks teams, not the current maintenance load. Twenty sites might feel manageable. Fifty will not be "just more of the same." Here's what each individual break actually costs in time and downstream disruption, so you can pressure-test the per-incident numbers.
If you want to see what the alternative looks like:
Find out what your current setup actually costs — or see what managed data delivery looks like for your competitors.
Get a TCO Estimate