Certain skills add a feather to your cap in the bustling world of web development and data analysis. One such skill is web scraping - the art of extracting useful information from the vast expanse of the internet. Web scraping with the Puppeteer tool, a powerful Node.js library, can be a game-changer in your data extraction tasks. Therefore, this article introduces you to the Puppeteer Python API and gain hands-on experience with real-world web scraping examples. Whether you're a programmer, data analyst, IT student, or web developer with a basic Python knowledge, this guide will equip you with the practical knowledge needed to efficiently use Puppeteer.
This is the 2nd article from the series “A web scraping tools in practice”. In the previous one, we discovered the basics of beautiful soup.
Web scraping is a technique employed to extract large amounts of data from websites, which can then be saved to a local file or database in your computer for later use. There is many solutions on the market that allows you to web scrape data. One of them is Puppeteer.
Puppeteer is a Node.js library providing a high-level API to control headless Chrome or Chromium browsers. While it's native to JavaScript, it has also been ported to other languages, such as Python, giving Python developers a chance to leverage its powerful features. It can automate a variety of tasks, including generating screenshots and PDFs of pages, testing browser compatibility, automating form submission, UI testing, keyboard input, and much more. It's especially effective in web scraping because it simulates the actions of a real user, allowing you to scrape even the most JavaScript-heavy websites.
Puppeteer has many powerful features, making it one of the most versatile web scraping tools. Here are some of its core features:
Now that we've covered the basics, let's delve into some practical examples of web scraping with Puppeteer.
When it comes to web scraping in Python, Beautiful Soup is another popular tool in the arsenal of many developers. But how does it compare to Puppeteer? Let's examine both tools in depth to help you make an informed choice.
Beautiful Soup is a Python library designed for web scraping purposes to extract data from HTML and XML documents. It creates a parse tree from page source code that can be used to extract data in a hierarchical and readable manner.
Beautiful Soup Pros:
Beautiful Soup Cons:
As we've discussed, Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium browsers.
Puppeteer Pros:
Puppeteer Cons:
The choice between Puppeteer and Beautiful Soup primarily depends on your project needs and familiarity with the languages involved.
Beautiful Soup is a straightforward, effective choice if you're dealing with static websites and prefer working in Python. However, if the website is JavaScript-heavy and dynamically generates content, Puppeteer might be a better tool for the job due to its ability to interact with JavaScript and mimic a real user's actions.
In conclusion, there's no "one size fits all" tool. Both Puppeteer and Beautiful Soup have their strengths and are excellent in their own right. Assess your needs, experiment with both, and pick the one that works best for your project.
Here's a step-by-step guide on how to use Puppeteer Python API (Pyppeteer) to extract article titles, links, and dates from the blog posts on https://www.forloop.ai/blog
.
First, make sure you have Python and pip (Python package manager) installed in your environment. If not, you can download Python here.
Next, install Pyppeteer using pip by running the following command in your terminal:
pip install pyppeteer
Create a new Python script and import the necessary libraries:
import asyncio
from pyppeteer import launch
asyncio
is a library to write single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources. We will use it to manage our asynchronous tasks.
Define an asynchronous function called scrape_data()
that will take care of launching the browser, opening a new page, and navigating to the blog.
async def scrape_data():
browser = await launch()
page = await browser.newPage()
await page.goto('<https://www.forloop.ai/blog>')
Now, add the code to scrape the blog titles, links, and dates. The page.evaluate()
method is used to execute JavaScript code in the context of the page.
titles = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.blog-card .title')).map(elem => elem.innerText);
}''')
links = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.blog-card a')).map(elem => elem.href);
}''')
dates = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.blog-card .date')).map(elem => elem.innerText);
}''')
This script extracts the text inside the blog title, link, and date elements.
Close the browser once the data has been extracted, and return the scraped data:
await browser.close()
return titles, links, dates
Finally, call the scrape_data()
function using the asyncio event loop:
data = asyncio.get_event_loop().run_until_complete(scrape_data())
print(data)
This will print the scraped blog titles, links, and dates.
To run the script, execute the Python file in your terminal.
While Puppeteer is powerful and flexible, there are some best practices you should follow to ensure smooth operation, particularly in production environments:
waitUntil
option in page.goto()
to ensure that the page is fully loaded before proceeding. You can set waitUntil
to 'networkidle0' or 'networkidle2' to wait until there are no more than 0 or 2 network connections for at least 500 ms.try/catch
block around your Puppeteer code to handle errors gracefully.Puppeteer is a powerful tool that allows you to explore the vast world of web scraping in Python. It offers many possibilities, from simple data extraction to more complex tasks like automated testing and form submission.
As you’ve seen in our hands-on example, extracting data from a webpage is quite straightforward with Puppeteer. It's all about understanding the webpage's structure and using Puppeteer’s API to navigate and extract the necessary data.
It's important to remember that while web scraping is a powerful tool, it's also your responsibility to use it ethically and respect the terms of service of the websites you are scraping.
By now, you should understand how to use Puppeteer for your web scraping tasks. It’s time to apply what you’ve learned. Start your journey into web scraping by creating your project, experimenting with different websites, and see what amazing data you can uncover.
For more detailed information, refer to the official Puppeteer documentation. It's an excellent resource, with comprehensive explanations of Puppeteer's API and more complex examples.
Keep exploring, keep learning, and most importantly, keep scraping!