Let's Learn Web Scraping with Crawlee and Puppeteer
Journey to the Web Scrapping code
Table of contents
- Welcome to Your Web Scraping Journey!
- "But Wait, What Exactly is Crawlee?" ๐ค
- Let's Get Started! ๐
- Let's Build Something Cool! ๐ ๏ธ
- "But What If I Need to Log Into a Website?" ๐
- Common Challenges and Solutions ๐ก
- Pro Tips from a Friend! ๐
- "What Can I Do with All This Data?" ๐
- Fun Project Ideas! ๐จ
- Before You Go! ๐
- Need Help? ๐
- Let's Stay in Touch! ๐ค
Welcome to Your Web Scraping Journey!
Hey friend! ๐ I'm excited to guide you through the wonderful world of web scraping using Crawlee and Puppeteer. Whether you're a complete beginner or have some experience, we'll explore everything together, step by step.
"But Wait, What Exactly is Crawlee?" ๐ค
Great question! Think of Crawlee as your trusty assistant for web scraping. It's like having a super-smart robot that can visit websites, collect data, and organize it for you. When we pair it with Puppeteer, we get a powerful duo that can handle even the trickiest websites with lots of JavaScript and dynamic content.
Let's Get Started! ๐
First Things First: Setting Up Your Project
"How do I begin?" I hear you ask. Let's start by setting up your project! Open your terminal and type these commands:
npm init -y # Creates your project
npm install crawlee playwright @crawlee/core @crawlee/puppeteer # Installs the tools we need
Think of this like packing your backpack before an adventure - we're just getting our tools ready!
"How Should I Organize My Files?" ๐
I recommend this simple structure to keep things neat and tidy:
your-awesome-project/
โโโ src/
โ โโโ main.ts # Where the magic begins!
โ โโโ routes.ts # Your scraping instructions
โโโ package.json
โโโ tsconfig.json
Let's Build Something Cool! ๐ ๏ธ
"Enough theory, show me something real!" - I got you! Let's create a simple scraper that collects product information. I'll explain each part as we go:
import { PuppeteerCrawler, Dataset } from 'crawlee';
// Here's our crawler - think of it as your personal web explorer
const crawler = new PuppeteerCrawler({
maxConcurrency: 10, // We'll run 10 pages at once - not too greedy!
async requestHandler({ page, request, log }) {
// Let's tell our scraper what to do on each page
log.info(`๐ Taking a look at ${request.url}`);
// Wait for products to appear - patience is key!
await page.waitForSelector('.product-item');
// Now, let's collect some data!
const products = await page.$$eval('.product-item', (elements) =>
elements.map((el) => ({
title: el.querySelector('.product-title')?.textContent?.trim(),
price: el.querySelector('.product-price')?.textContent?.trim(),
description: el.querySelector('.product-description')?.textContent?.trim(),
imageUrl: el.querySelector('img')?.getAttribute('src'),
}))
);
// Save our treasure (data)!
await Dataset.pushData(products);
log.info('๐ซ Successfully saved product data!');
},
});
// Time to start our adventure!
await crawler.run(['https://example-shop.com/products']);
"But What If I Need to Log Into a Website?" ๐
No worries! Here's how we can handle login pages:
const crawler = new PuppeteerCrawler({
async requestHandler({ page, request }) {
if (request.label === 'LOGIN') {
log.info('๐ Logging in...');
await page.type('#username', 'your-username');
await page.type('#password', 'your-password');
await page.click('#login-button');
await page.waitForNavigation();
log.info('โ
Successfully logged in!');
}
// Continue with our scraping adventure...
},
});
Common Challenges and Solutions ๐ก
"Help! The Content Keeps Loading as I Scroll!" ๐
Got you covered! Here's how to handle infinite scroll:
const crawler = new PuppeteerCrawler({
async requestHandler({ page }) {
log.info('๐ Scrolling to load more content...');
await page.evaluate(async () => {
await new Promise((resolve) => {
let scrolled = 0;
const interval = setInterval(() => {
window.scrollBy(0, 100);
scrolled += 100;
if (scrolled >= document.body.scrollHeight) {
clearInterval(interval);
resolve();
}
}, 100);
});
});
log.info('โจ Finished loading all content!');
},
});
Pro Tips from a Friend! ๐
- Be Nice to Websites
const crawler = new PuppeteerCrawler({
maxRequestsPerMinute: 60, // Don't overwhelm the website!
respectRobotstxt: true, // Follow the rules!
});
- Handle Errors Like a Pro
const crawler = new PuppeteerCrawler({
async failedRequestHandler({ request, error, log }) {
log.error(`๐ข Oops! ${request.url} failed: ${error.message}`);
// Save the error so we can fix it later
await Dataset.pushData({
url: request.url,
error: error.message,
timestamp: new Date(),
}, { key: 'ERRORS' });
},
});
"What Can I Do with All This Data?" ๐
Great question! Let's save our findings:
// Save as JSON - perfect for APIs
await Dataset.export('my-findings.json');
// Save as CSV - spreadsheet lovers rejoice!
await Dataset.exportToCSV('my-findings.csv');
// Save as Excel - for that professional touch
await Dataset.exportToExcel('my-findings.xlsx');
Fun Project Ideas! ๐จ
Price Tracker
Track your favorite products
Get notified when prices drop
Find the best deals!
News Aggregator
Collect articles you love
Never miss important news
Create your personal digest
Job Hunter
Find your dream job
Track new openings
Compare salaries
Before You Go! ๐
Remember these friendly tips:
Always check if a website allows scraping
Use delays between requests (be nice!)
Save your data regularly
Handle errors gracefully
Need Help? ๐
Don't hesitate to:
Experiment and have fun (Break your Visual Code Studio)!
Let's Stay in Touch! ๐ค
I hope this guide helps you start your web scraping journey! Remember, everyone was a beginner once, so don't feel discouraged if things don't work perfectly the first time. Keep experimenting, and most importantly, have fun with it!
Got questions? Need clarification? Feel stuck? That's totally normal! Drop your questions in the comments or reach out to the community. We're all here to help each other learn and grow!
Happy scraping! ๐