cd ..>How Web Scraping Works: A Deep Dive Using Node.js and Cheerio
May 19, 2025 Comments

How Web Scraping Works: A Deep Dive Using Node.js and Cheerio

Learn how to scrape structured data from websites using Node.js and Cheerio, with real examples from an Internshala Automation project.
How Web Scraping Works: A Deep Dive Using Node.js and Cheerio

🕸️ What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. It's useful when the data you need is not available via an API. With tools like Cheerio in Node.js, you can parse and navigate HTML just like jQuery.

In this post, we'll explore:

  • How web scraping works
  • Traversing the DOM using Cheerio
  • Handling anti-scraping techniques
  • Ethical considerations
  • Real code from an Internshala Automation project

⚙️ Tools We’ll Use

  • Node.js – Server-side JavaScript
  • Axios – For making HTTP requests
  • Cheerio – For parsing and traversing HTML
  • Internshala – The target site (educational purposes only)

📄 Installing Dependencies

npm install axios cheerio

🔍 Scraping Internshala – The Basics

Here’s a sample scraping function to fetch internship listings from Internshala.

const axios = require("axios");
const cheerio = require("cheerio");

const scrapeInternships = async (keywords = "web development") => {
  const url = `https://internshala.com/internships/keywords-${keywords.replace(
    " ",
    "-"
  )}`;

  const { data } = await axios.get(url);
  const $ = cheerio.load(data);

  const internships = [];

  $(".internship_meta").each((i, el) => {
    const title = $(el).find(".profile").text().trim();
    const company = $(el).find(".company_name").text().trim();
    const location = $(el).find(".location_link").text().trim();
    const startDate = $(el).find(".start_immediately_desktop").text().trim();

    internships.push({ title, company, location, startDate });
  });

  return internships;
};

scrapeInternships().then(console.log).catch(console.error);

🧭 How DOM Traversal Works

Using Cheerio, you can traverse HTML using jQuery-like selectors.

Example:

const title = $(el).find(".profile").text().trim();

This finds the .profile class inside each internship block and extracts the title.


🛡️ Anti-Scraping Protections

Websites often try to block scrapers. Here’s how to handle common techniques:

1. User-Agent Blocking

Some sites block requests that don’t have a real browser User-Agent.

const { data } = await axios.get(url, {
  headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
  },
});

2. Rate Limiting

Avoid sending too many requests too quickly. Use setTimeout or libraries like p-queue.

3. JavaScript-Rendered Content

If the site loads data via JavaScript, tools like Cheerio won’t work. You’ll need Puppeteer or Playwright for such cases.


✅ Ethical Web Scraping

While scraping is powerful, it comes with responsibilities:

  • Respect robots.txt – Check if scraping is allowed
  • Don’t overload servers – Use rate limits
  • Use scraped data only for personal or educational use
  • Avoid login-based or paid content without permission

🧪 Real Use Case – Automating Internshala

In our Internshala Automation Project, we scrape internship data and filter based on user preferences like location, skills, and start date.

Here’s a simplified snippet:

const filterByLocation = (data, location) =>
  data.filter((job) =>
    job.location.toLowerCase().includes(location.toLowerCase())
  );

const filtered = filterByLocation(await scrapeInternships(), "remote");
console.log(filtered);

📦 Wrapping Up

Web scraping using Node.js and Cheerio is a practical skill that opens up automation and data collection possibilities. Whether you're tracking job listings or extracting prices from e-commerce sites, the fundamentals remain the same.

✅ Keep it ethical ✅ Avoid scraping sensitive or protected data ✅ Respect site policies


📚 Resources


Happy Scraping! 🚀