A while back I needed to crawl B*teB*teGo, a solid system design learning platform, and ran straight into a wall that every scraper eventually hits: Single Page Applications.
Here’s what I learned.
The SPA Problem
B*teB*teGo is built on React. When you open View Page Source, you get something like this:
<!DOCTYPE html>
<html>
<head>...</head>
<body>
<div id="root"></div>
<script src="/static/js/main.chunk.js"></script>
</body>
</html>
No content. Just a shell. Everything (titles, articles, images) is rendered client-side after JavaScript executes.
This kills the traditional scraping stack immediately:
# This gets you nothing useful on a React site
import requests
from bs4 import BeautifulSoup
res = requests.get("https://b\*teb\*teg\*.com/courses/system-design-interview/scale-from-zero-to-millions-of-users")
soup = BeautifulSoup(res.text, "html.parser")
print(soup.find("article")) # None
Two Approaches
Approach 1: Selenium or Playwright
The standard solution for SPA scraping is to drive a real browser. Playwright is the modern choice.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://b\*teb\*teg\*.com/...")
page.wait_for_selector("article") # wait for JS to render
content = page.inner_html("article")
browser.close()
This works, but comes with real costs:
- URL discovery: You need to build a crawler to traverse the full URL tree. B*teB*teGo’s content is nested across topics, chapters, and lessons. Mapping this reliably takes time.
- Edge cases: Lazy-loaded images, gated content, auth cookies, navigation state. Each one is a rabbit hole.
- Maintenance: SPAs update their DOM structure. Your selectors break silently.
Worth it for a long-running production scraper. Overkill for a one-off data collection job.
Approach 2: Semi-Automated with Console Script + SingleFile (recommended)
B*teB*teGo organizes content by Topic, where each Topic maps to a distinct URL. This structure makes a much simpler approach viable.
Step 1: Extract all URLs from the browser console
Log into B*teB*teGo, open the course page, then open DevTools > Console and run:
// Finds all lesson links in the sidebar navigation tree
const links = [...document.querySelectorAll('nav a[href]')]
.map(a => a.href)
.filter(href => href.includes('/courses/'))
.filter((v, i, arr) => arr.indexOf(v) === i); // deduplicate
console.log(links.join('\n'));
// Copy the output
You get a clean list of every lesson URL in seconds, no crawler needed.
Step 2: Open in batch tabs
Paste the URLs into a script that opens them in batches:
// Run in console, opens 20 tabs at a time
const urls = `
https://b\*teb\*teg\*.com/courses/...
https://b\*teb\*teg\*.com/courses/...
`.trim().split('\n');
const BATCH_SIZE = 20; // keep this low to avoid RAM issues
urls.slice(0, BATCH_SIZE).forEach(url => window.open(url, '_blank'));
Keep batches at 10-30 tabs. Opening 100+ at once will either crash your browser or trigger rate limiting.
Step 3: Save with SingleFile
Install the SingleFile browser extension. Once your batch of tabs is loaded:
- Click the SingleFile icon > Save tabs > Save all tabs
- Each tab gets saved as a self-contained
.htmlfile with CSS, images, and text all inlined
The output is clean, portable, and offline-readable. No dependencies, no broken image links.
Comparison
| Playwright | Semi-automated | |
|---|---|---|
| Setup time | Hours | ~15 minutes |
| Handles auth | Yes (with work) | Yes (you’re already logged in) |
| Full automation | Yes | Partial (manual batch opening) |
| Fragility | Medium (selector changes) | Low |
| Best for | Production pipelines | One-off data collection |
When to Use Which
Use Playwright when you need fully automated, repeatable scraping at scale: CI pipelines, scheduled jobs, large sites with thousands of pages.
Use the semi-automated approach when you need data once (or occasionally) from a site you’re logged into and the content is organized into a discoverable URL structure. You trade automation for speed-to-data.
For B*teB*teGo specifically, the semi-automated approach took about 20 minutes end-to-end. Writing a robust Playwright crawler for the same site would have taken most of a day.
SingleFile extension: github.com/gildas-lormeau/SingleFile