Let's Learn Web Scraping with Crawlee and Puppeteer

Let's Learn Web Scraping with Crawlee and Puppeteer

Journey to the Web Scrapping code

ยท

4 min read

Welcome to Your Web Scraping Journey!

Hey friend! ๐Ÿ‘‹ I'm excited to guide you through the wonderful world of web scraping using Crawlee and Puppeteer. Whether you're a complete beginner or have some experience, we'll explore everything together, step by step.

"But Wait, What Exactly is Crawlee?" ๐Ÿค”

Great question! Think of Crawlee as your trusty assistant for web scraping. It's like having a super-smart robot that can visit websites, collect data, and organize it for you. When we pair it with Puppeteer, we get a powerful duo that can handle even the trickiest websites with lots of JavaScript and dynamic content.

Let's Get Started! ๐Ÿš€

First Things First: Setting Up Your Project

"How do I begin?" I hear you ask. Let's start by setting up your project! Open your terminal and type these commands:

npm init -y   # Creates your project
npm install crawlee playwright @crawlee/core @crawlee/puppeteer  # Installs the tools we need

Think of this like packing your backpack before an adventure - we're just getting our tools ready!

"How Should I Organize My Files?" ๐Ÿ“

I recommend this simple structure to keep things neat and tidy:

your-awesome-project/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ main.ts      # Where the magic begins!
โ”‚   โ””โ”€โ”€ routes.ts    # Your scraping instructions
โ”œโ”€โ”€ package.json
โ””โ”€โ”€ tsconfig.json

Let's Build Something Cool! ๐Ÿ› ๏ธ

"Enough theory, show me something real!" - I got you! Let's create a simple scraper that collects product information. I'll explain each part as we go:

import { PuppeteerCrawler, Dataset } from 'crawlee';

// Here's our crawler - think of it as your personal web explorer
const crawler = new PuppeteerCrawler({
    maxConcurrency: 10,  // We'll run 10 pages at once - not too greedy!

    async requestHandler({ page, request, log }) {
        // Let's tell our scraper what to do on each page
        log.info(`๐Ÿ” Taking a look at ${request.url}`);

        // Wait for products to appear - patience is key!
        await page.waitForSelector('.product-item');

        // Now, let's collect some data!
        const products = await page.$$eval('.product-item', (elements) =>
            elements.map((el) => ({
                title: el.querySelector('.product-title')?.textContent?.trim(),
                price: el.querySelector('.product-price')?.textContent?.trim(),
                description: el.querySelector('.product-description')?.textContent?.trim(),
                imageUrl: el.querySelector('img')?.getAttribute('src'),
            }))
        );

        // Save our treasure (data)!
        await Dataset.pushData(products);
        log.info('๐Ÿ’ซ Successfully saved product data!');
    },
});

// Time to start our adventure!
await crawler.run(['https://example-shop.com/products']);

"But What If I Need to Log Into a Website?" ๐Ÿ”

No worries! Here's how we can handle login pages:

const crawler = new PuppeteerCrawler({
    async requestHandler({ page, request }) {
        if (request.label === 'LOGIN') {
            log.info('๐Ÿ”‘ Logging in...');
            await page.type('#username', 'your-username');
            await page.type('#password', 'your-password');
            await page.click('#login-button');
            await page.waitForNavigation();
            log.info('โœ… Successfully logged in!');
        }
        // Continue with our scraping adventure...
    },
});

Common Challenges and Solutions ๐Ÿ’ก

"Help! The Content Keeps Loading as I Scroll!" ๐Ÿ“œ

Got you covered! Here's how to handle infinite scroll:

const crawler = new PuppeteerCrawler({
    async requestHandler({ page }) {
        log.info('๐Ÿ”„ Scrolling to load more content...');
        await page.evaluate(async () => {
            await new Promise((resolve) => {
                let scrolled = 0;
                const interval = setInterval(() => {
                    window.scrollBy(0, 100);
                    scrolled += 100;

                    if (scrolled >= document.body.scrollHeight) {
                        clearInterval(interval);
                        resolve();
                    }
                }, 100);
            });
        });
        log.info('โœจ Finished loading all content!');
    },
});

Pro Tips from a Friend! ๐ŸŒŸ

  1. Be Nice to Websites
const crawler = new PuppeteerCrawler({
    maxRequestsPerMinute: 60,  // Don't overwhelm the website!
    respectRobotstxt: true,    // Follow the rules!
});
  1. Handle Errors Like a Pro
const crawler = new PuppeteerCrawler({
    async failedRequestHandler({ request, error, log }) {
        log.error(`๐Ÿ˜ข Oops! ${request.url} failed: ${error.message}`);
        // Save the error so we can fix it later
        await Dataset.pushData({
            url: request.url,
            error: error.message,
            timestamp: new Date(),
        }, { key: 'ERRORS' });
    },
});

"What Can I Do with All This Data?" ๐Ÿ“Š

Great question! Let's save our findings:

// Save as JSON - perfect for APIs
await Dataset.export('my-findings.json');

// Save as CSV - spreadsheet lovers rejoice!
await Dataset.exportToCSV('my-findings.csv');

// Save as Excel - for that professional touch
await Dataset.exportToExcel('my-findings.xlsx');

Fun Project Ideas! ๐ŸŽจ

  1. Price Tracker

    • Track your favorite products

    • Get notified when prices drop

    • Find the best deals!

  2. News Aggregator

    • Collect articles you love

    • Never miss important news

    • Create your personal digest

  3. Job Hunter

    • Find your dream job

    • Track new openings

    • Compare salaries

Before You Go! ๐Ÿ‘‹

Remember these friendly tips:

  • Always check if a website allows scraping

  • Use delays between requests (be nice!)

  • Save your data regularly

  • Handle errors gracefully

Need Help? ๐Ÿ†˜

Don't hesitate to:

Let's Stay in Touch! ๐Ÿค

I hope this guide helps you start your web scraping journey! Remember, everyone was a beginner once, so don't feel discouraged if things don't work perfectly the first time. Keep experimenting, and most importantly, have fun with it!

Got questions? Need clarification? Feel stuck? That's totally normal! Drop your questions in the comments or reach out to the community. We're all here to help each other learn and grow!

Happy scraping! ๐ŸŽ‰

Did you find this article valuable?

Support The art of Code by becoming a sponsor. Any amount is appreciated!

ย