Crawling SPA Websites: Lessons from Scraping a React Learning Platform

Available in: Tiếng Việt
Reading time 3 min read
python web-scraping tips

A while back I needed to crawl B*teB*teGo, a solid system design learning platform, and ran straight into a wall that every scraper eventually hits: Single Page Applications.

Here’s what I learned.

The SPA Problem

B*teB*teGo is built on React. When you open View Page Source, you get something like this:

html
<!DOCTYPE html>
<html>
  <head>...</head>
  <body>
    <div id="root"></div>
    <script src="/static/js/main.chunk.js"></script>
  </body>
</html>

No content. Just a shell. Everything (titles, articles, images) is rendered client-side after JavaScript executes.

This kills the traditional scraping stack immediately:

python
# This gets you nothing useful on a React site
import requests
from bs4 import BeautifulSoup

res = requests.get("https://b\*teb\*teg\*.com/courses/system-design-interview/scale-from-zero-to-millions-of-users")
soup = BeautifulSoup(res.text, "html.parser")
print(soup.find("article"))  # None

Two Approaches

Approach 1: Selenium or Playwright

The standard solution for SPA scraping is to drive a real browser. Playwright is the modern choice.

python
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://b\*teb\*teg\*.com/...")
    page.wait_for_selector("article")  # wait for JS to render
    content = page.inner_html("article")
    browser.close()

This works, but comes with real costs:

  • URL discovery: You need to build a crawler to traverse the full URL tree. B*teB*teGo’s content is nested across topics, chapters, and lessons. Mapping this reliably takes time.
  • Edge cases: Lazy-loaded images, gated content, auth cookies, navigation state. Each one is a rabbit hole.
  • Maintenance: SPAs update their DOM structure. Your selectors break silently.

Worth it for a long-running production scraper. Overkill for a one-off data collection job.

B*teB*teGo organizes content by Topic, where each Topic maps to a distinct URL. This structure makes a much simpler approach viable.

Step 1: Extract all URLs from the browser console

Log into B*teB*teGo, open the course page, then open DevTools > Console and run:

javascript
// Finds all lesson links in the sidebar navigation tree
const links = [...document.querySelectorAll('nav a[href]')]
  .map(a => a.href)
  .filter(href => href.includes('/courses/'))
  .filter((v, i, arr) => arr.indexOf(v) === i); // deduplicate

console.log(links.join('\n'));
// Copy the output

You get a clean list of every lesson URL in seconds, no crawler needed.

Step 2: Open in batch tabs

Paste the URLs into a script that opens them in batches:

javascript
// Run in console, opens 20 tabs at a time
const urls = `
https://b\*teb\*teg\*.com/courses/...
https://b\*teb\*teg\*.com/courses/...
`.trim().split('\n');

const BATCH_SIZE = 20; // keep this low to avoid RAM issues
urls.slice(0, BATCH_SIZE).forEach(url => window.open(url, '_blank'));

Keep batches at 10-30 tabs. Opening 100+ at once will either crash your browser or trigger rate limiting.

Step 3: Save with SingleFile

Install the SingleFile browser extension. Once your batch of tabs is loaded:

  1. Click the SingleFile icon > Save tabs > Save all tabs
  2. Each tab gets saved as a self-contained .html file with CSS, images, and text all inlined

The output is clean, portable, and offline-readable. No dependencies, no broken image links.

Comparison

PlaywrightSemi-automated
Setup timeHours~15 minutes
Handles authYes (with work)Yes (you’re already logged in)
Full automationYesPartial (manual batch opening)
FragilityMedium (selector changes)Low
Best forProduction pipelinesOne-off data collection

When to Use Which

Use Playwright when you need fully automated, repeatable scraping at scale: CI pipelines, scheduled jobs, large sites with thousands of pages.

Use the semi-automated approach when you need data once (or occasionally) from a site you’re logged into and the content is organized into a discoverable URL structure. You trade automation for speed-to-data.

For B*teB*teGo specifically, the semi-automated approach took about 20 minutes end-to-end. Writing a robust Playwright crawler for the same site would have taken most of a day.


SingleFile extension: github.com/gildas-lormeau/SingleFile