Let's Learn Web Scraping with Crawlee and Puppeteer

Welcome to Your Web Scraping Journey!

Hey friend! 👋 I'm excited to guide you through the wonderful world of web scraping using Crawlee and Puppeteer. Whether you're a complete beginner or have some experience, we'll explore everything together, step by step.

"But Wait, What Exactly is Crawlee?" 🤔

Great question! Think of Crawlee as your trusty assistant for web scraping. It's like having a super-smart robot that can visit websites, collect data, and organize it for you. When we pair it with Puppeteer, we get a powerful duo that can handle even the trickiest websites with lots of JavaScript and dynamic content.

Let's Get Started! 🚀

First Things First: Setting Up Your Project

"How do I begin?" I hear you ask. Let's start by setting up your project! Open your terminal and type these commands:

npm init -y   # Creates your project
npm install crawlee playwright @crawlee/core @crawlee/puppeteer  # Installs the tools we need

Think of this like packing your backpack before an adventure - we're just getting our tools ready!

"How Should I Organize My Files?" 📁

I recommend this simple structure to keep things neat and tidy:

your-awesome-project/
├── src/
│   ├── main.ts      # Where the magic begins!
│   └── routes.ts    # Your scraping instructions
├── package.json
└── tsconfig.json

Let's Build Something Cool! 🛠️

"Enough theory, show me something real!" - I got you! Let's create a simple scraper that collects product information. I'll explain each part as we go:

import { PuppeteerCrawler, Dataset } from 'crawlee';

// Here's our crawler - think of it as your personal web explorer
const crawler = new PuppeteerCrawler({
    maxConcurrency: 10,  // We'll run 10 pages at once - not too greedy!

    async requestHandler({ page, request, log }) {
        // Let's tell our scraper what to do on each page
        log.info(`🔍 Taking a look at ${request.url}`);

        // Wait for products to appear - patience is key!
        await page.waitForSelector('.product-item');

        // Now, let's collect some data!
        const products = await page.$$eval('.product-item', (elements) =>
            elements.map((el) => ({
                title: el.querySelector('.product-title')?.textContent?.trim(),
                price: el.querySelector('.product-price')?.textContent?.trim(),
                description: el.querySelector('.product-description')?.textContent?.trim(),
                imageUrl: el.querySelector('img')?.getAttribute('src'),
            }))
        );

        // Save our treasure (data)!
        await Dataset.pushData(products);
        log.info('💫 Successfully saved product data!');
    },
});

// Time to start our adventure!
await crawler.run(['https://example-shop.com/products']);

"But What If I Need to Log Into a Website?" 🔐

No worries! Here's how we can handle login pages:

const crawler = new PuppeteerCrawler({
    async requestHandler({ page, request }) {
        if (request.label === 'LOGIN') {
            log.info('🔑 Logging in...');
            await page.type('#username', 'your-username');
            await page.type('#password', 'your-password');
            await page.click('#login-button');
            await page.waitForNavigation();
            log.info('✅ Successfully logged in!');
        }
        // Continue with our scraping adventure...
    },
});

Common Challenges and Solutions 💡

"Help! The Content Keeps Loading as I Scroll!" 📜

Got you covered! Here's how to handle infinite scroll:

const crawler = new PuppeteerCrawler({
    async requestHandler({ page }) {
        log.info('🔄 Scrolling to load more content...');
        await page.evaluate(async () => {
            await new Promise((resolve) => {
                let scrolled = 0;
                const interval = setInterval(() => {
                    window.scrollBy(0, 100);
                    scrolled += 100;

                    if (scrolled >= document.body.scrollHeight) {
                        clearInterval(interval);
                        resolve();
                    }
                }, 100);
            });
        });
        log.info('✨ Finished loading all content!');
    },
});

Pro Tips from a Friend! 🌟

Be Nice to Websites

const crawler = new PuppeteerCrawler({
    maxRequestsPerMinute: 60,  // Don't overwhelm the website!
    respectRobotstxt: true,    // Follow the rules!
});

Handle Errors Like a Pro

const crawler = new PuppeteerCrawler({
    async failedRequestHandler({ request, error, log }) {
        log.error(`😢 Oops! ${request.url} failed: ${error.message}`);
        // Save the error so we can fix it later
        await Dataset.pushData({
            url: request.url,
            error: error.message,
            timestamp: new Date(),
        }, { key: 'ERRORS' });
    },
});

"What Can I Do with All This Data?" 📊

Great question! Let's save our findings:

// Save as JSON - perfect for APIs
await Dataset.export('my-findings.json');

// Save as CSV - spreadsheet lovers rejoice!
await Dataset.exportToCSV('my-findings.csv');

// Save as Excel - for that professional touch
await Dataset.exportToExcel('my-findings.xlsx');

Fun Project Ideas! 🎨

Price Tracker
- Track your favorite products
- Get notified when prices drop
- Find the best deals!
News Aggregator
- Collect articles you love
- Never miss important news
- Create your personal digest
Job Hunter
- Find your dream job
- Track new openings
- Compare salaries

Before You Go! 👋

Remember these friendly tips:

Always check if a website allows scraping
Use delays between requests (be nice!)
Save your data regularly
Handle errors gracefully

Need Help? 🆘

Don't hesitate to:

Check Crawlee's documentation
Join developer communities Discord
Experiment and have fun (Break your Visual Code Studio)!

Let's Stay in Touch! 🤝

I hope this guide helps you start your web scraping journey! Remember, everyone was a beginner once, so don't feel discouraged if things don't work perfectly the first time. Keep experimenting, and most importantly, have fun with it!

Got questions? Need clarification? Feel stuck? That's totally normal! Drop your questions in the comments or reach out to the community. We're all here to help each other learn and grow!

Happy scraping! 🎉

Let's Learn Web Scraping with Crawlee and Puppeteer

Journey to the Web Scrapping code

Table of contents

Welcome to Your Web Scraping Journey!

"But Wait, What Exactly is Crawlee?" 🤔

Let's Get Started! 🚀

First Things First: Setting Up Your Project

"How Should I Organize My Files?" 📁

Let's Build Something Cool! 🛠️

"But What If I Need to Log Into a Website?" 🔐

Common Challenges and Solutions 💡

"Help! The Content Keeps Loading as I Scroll!" 📜

Pro Tips from a Friend! 🌟

"What Can I Do with All This Data?" 📊

Fun Project Ideas! 🎨

Before You Go! 👋

Need Help? 🆘

Let's Stay in Touch! 🤝

Let's Learn Web Scraping with Crawlee and Puppeteer

Journey to the Web Scrapping code

Table of contents

Welcome to Your Web Scraping Journey!

"But Wait, What Exactly is Crawlee?" 🤔

Let's Get Started! 🚀

First Things First: Setting Up Your Project

"How Should I Organize My Files?" 📁

Let's Build Something Cool! 🛠️

"But What If I Need to Log Into a Website?" 🔐

Common Challenges and Solutions 💡

"Help! The Content Keeps Loading as I Scroll!" 📜

Pro Tips from a Friend! 🌟

"What Can I Do with All This Data?" 📊

Fun Project Ideas! 🎨

Before You Go! 👋

Need Help? 🆘

Let's Stay in Touch! 🤝

Did you find this article valuable?