DevPortfolio

🕸️ What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. It's useful when the data you need is not available via an API. With tools like Cheerio in Node.js, you can parse and navigate HTML just like jQuery.

In this post, we'll explore:

How web scraping works
Traversing the DOM using Cheerio
Handling anti-scraping techniques
Ethical considerations
Real code from an Internshala Automation project

⚙️ Tools We’ll Use

Node.js – Server-side JavaScript
Axios – For making HTTP requests
Cheerio – For parsing and traversing HTML
Internshala – The target site (educational purposes only)

📄 Installing Dependencies

npm install axios cheerio

🔍 Scraping Internshala – The Basics

Here’s a sample scraping function to fetch internship listings from Internshala.

const axios = require("axios");
const cheerio = require("cheerio");

const scrapeInternships = async (keywords = "web development") => {
  const url = `https://internshala.com/internships/keywords-${keywords.replace(
    " ",
    "-"
  )}`;

  const { data } = await axios.get(url);
  const $ = cheerio.load(data);

  const internships = [];

  $(".internship_meta").each((i, el) => {
    const title = $(el).find(".profile").text().trim();
    const company = $(el).find(".company_name").text().trim();
    const location = $(el).find(".location_link").text().trim();
    const startDate = $(el).find(".start_immediately_desktop").text().trim();

    internships.push({ title, company, location, startDate });
  });

  return internships;
};

scrapeInternships().then(console.log).catch(console.error);

🧭 How DOM Traversal Works

Using Cheerio, you can traverse HTML using jQuery-like selectors.

Example:

const title = $(el).find(".profile").text().trim();

This finds the .profile class inside each internship block and extracts the title.

🛡️ Anti-Scraping Protections

Websites often try to block scrapers. Here’s how to handle common techniques:

1. User-Agent Blocking

Some sites block requests that don’t have a real browser User-Agent.

const { data } = await axios.get(url, {
  headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
  },
});

2. Rate Limiting

Avoid sending too many requests too quickly. Use setTimeout or libraries like p-queue.

3. JavaScript-Rendered Content

If the site loads data via JavaScript, tools like Cheerio won’t work. You’ll need Puppeteer or Playwright for such cases.

✅ Ethical Web Scraping

While scraping is powerful, it comes with responsibilities:

Respect robots.txt – Check if scraping is allowed
Don’t overload servers – Use rate limits
Use scraped data only for personal or educational use
Avoid login-based or paid content without permission

🧪 Real Use Case – Automating Internshala

In our Internshala Automation Project, we scrape internship data and filter based on user preferences like location, skills, and start date.

Here’s a simplified snippet:

const filterByLocation = (data, location) =>
  data.filter((job) =>
    job.location.toLowerCase().includes(location.toLowerCase())
  );

const filtered = filterByLocation(await scrapeInternships(), "remote");
console.log(filtered);

📦 Wrapping Up

Web scraping using Node.js and Cheerio is a practical skill that opens up automation and data collection possibilities. Whether you're tracking job listings or extracting prices from e-commerce sites, the fundamentals remain the same.

✅ Keep it ethical ✅ Avoid scraping sensitive or protected data ✅ Respect site policies

📚 Resources

Happy Scraping! 🚀

How Web Scraping Works: A Deep Dive Using Node.js and Cheerio