Hello there, fellow developers, data scientists, and tech enthusiasts! Today, we are diving into the exciting world of web scraping. Specifically, we'll focus on harnessing the power of the Playwright Python API for our data extraction ventures. Whether you're a beginner dipping your toes into the web scraping scene or an intermediate programmer looking for an efficient tool, you're at the right place. By the end of this comprehensive guide, you'll be adept at web scraping with Playwright, allowing you to effortlessly extract data from even the most dynamic and complex web pages. So, roll up your sleeves, rev up your IDE, and let's get started!
It’s the 5th article from the series “A web scraping tools in practice”.
Web scraping is a process of extracting information from websites. It involves making HTTP requests to the URLs of specific websites, downloading the HTML of the pages, and then parsing that data to extract meaningful information. In the era of data-driven decisions, web scraping has become an essential tool in various fields, from data science and market research to journalism and SEO.
But why should we use Playwright for web scraping?
Playwright is an open-source library developed by Microsoft that provides a high-level API to control browsers over the DevTools Protocol. It supports all modern web browsers including Chrome, Firefox, and Safari, and allows you to write scripts in JavaScript, Python, and C#. It also offers automatic waiting which ensures that the UI is in a stable state before the command is executed. This makes it a powerful and reliable tool for web scraping, automating form submissions, UI testing, and more. Playwright's Python API provides an easy-to-use, Pythonic way of harnessing the power of that tool.
Web scraping has been around for a while, and there are several well-known tools available, including Selenium, Scrapy, Beautiful Soup, and Puppeteer. So how does Playwright fit in, and why should you consider using it?
Selenium is a popular tool for automating browsers. While it's robust and flexible, it has its drawbacks. It's slower compared to Playwright and Puppeteer because it doesn't communicate directly with the browsers but rather through a WebDriver.
Scrapy and Beautiful Soup are powerful Python libraries used for web scraping. However, they struggle with modern, dynamic websites that heavily rely on JavaScript for loading and displaying data.
Puppeteer is another popular tool for browser automation and web scraping developed by the Chrome team. It's similar to Playwright, and they share many functionalities. However, Playwright supports more browsers, has a better handling of waiting mechanisms, and offers a more Pythonic API for Python developers.
So, if you're dealing with JavaScript-heavy websites and want a fast, reliable, and flexible tool for web scraping, Playwright is a fantastic choice.
Before we can start using Playwright, we need to install it. The installation process is straightforward. Open your terminal or command prompt and run the following command:
pip install playwright
After installing the package, you have to run the following command to download the supported browsers:
playwright install
Voila! You've successfully installed Playwright and its required browsers. You're now ready to embark on your web scraping journey.
Let's begin our web scraping venture. Our target for this exercise is https://www.forloop.ai/blog. We'll be navigating this website and extracting information from it. Here's how you can open a webpage.
from playwright.async_api import async_playwright
url = "https://www.forloop.ai/blog"
# Open webpage
pw = await async_playwright().start()
browser = await pw.chromium.launch(headless = False)
page = await browser.new_page()
await page.goto(url)
In the above script, we first import the async_playwright
function from the playwright.async_api
module. We then launch a new Chromium browser, create a new page, and navigate to the URL of the website. After completing these steps, we close the browser.
We need to run Playwright in Jupyter Notebooks by making use of Playwright's async API. This is required because Jupyter Notebooks use an asyncio event loop and you need to use Playwright's async API as well.
While navigating websites, you might also want to interact with them, like clicking buttons or filling out forms. With Playwright, this is quite simple. Let's say you want to click a button. Here's how you can do it:
import asyncio
from playwright.sync_api import sync_playwright
from playwright.async_api import async_playwright
url = "https://www.forloop.ai/blog"# Open the 1st article.
async def run(playwright):
browser = await playwright.chromium.launch(headless=False) # Set headless to False to see the browser in action
context = await browser.new_context()
# Open a new page
page = await context.new_page()
# Navigate to your website
await page.goto(url)
# Use the CSS selector to find the link to the first article and click it
# In this case, the CSS selector is 'a.cms-item-link' which selects <a> elements with the class 'cms-item-link'
await page.click('a.cms-item-link')
# Wait a bit for the article page to load
await asyncio.sleep(5) # waits for 5 seconds
# Close the browser
await browser.close()
# Run the function
async with async_playwright() as playwright:
await run(playwright)
You just have to replace 'button_css_selector'
with the CSS selector of the button you want to click.
Now, let's get into the meat of web scraping – extracting data. Our goal is to extract the title, author, and publication date of all articles from https://www.forloop.ai/blog.
To do this, we need to find the CSS selectors for these elements. Once we have the CSS selectors, we can use the page.query_selector
or page.query_selector_all
method to locate the elements, and the element_handle.inner_text
property to extract the text from these elements.
Here's an example of how you can do this:
import asyncio
url = "https://www.forloop.ai/blog"
from playwright.async_api import async_playwright
async def run(playwright):
browser = await playwright.chromium.launch()
context = await browser.new_context()
# Open a new page
page = await context.new_page()
# Navigate to the forloop.ai blog page
await page.goto(url)
# Find all articles
articles = await page.query_selector_all('div.article-item')
# Extract titles, tags, and dates
for item in articles:
title_element = await item.query_selector('h4')
title = await title_element.inner_text()
tag_element = await item.query_selector('div.text-white')
tag = await tag_element.inner_text()
date_element = await item.query_selector('div.blog-post-date')
date = await date_element.inner_text()
print(f'Title: {title}\\nTag: {tag}\\nDate: {date}\\n---')
# Close the browser
await browser.close()
# Run the function
async with async_playwright() as playwright:
await run(playwright)
This script will print the title, author, and publication date of all articles on the webpage.
Web scraping isn't always a walk in the park. You may encounter various challenges and errors, from not being able to locate elements and handling dynamic content, to dealing with websites that load content via AJAX or dealing with anti-bot measures. Let's explore how you can handle these challenges.
If an element isn't immediately available, Playwright's automatic waiting will come in handy. However, sometimes you might need to manually wait for an element to be available before interacting with it. You can use the page.wait_for_selector
method to wait for an element:
page.wait_for_selector('selector')
Many websites nowadays load content dynamically using AJAX. If you try to scrape such websites, you might find that the content you're trying to extract isn't available immediately. Playwright provides several ways to handle this, like waiting for a network idle:
page.goto(url, wait_until="networkidle")
Some websites use anti-bot measures to prevent scraping. These measures might include checking whether JavaScript and cookies are enabled, or whether user interactions like mouse movements and keystrokes are present. Playwright, being a browser automation tool, enables JavaScript and cookies by default. It can also emulate user interactions, thus allowing you to bypass most anti-bot measures.
When dealing with websites that open pop-ups or new tabs, you can use Playwright's context.on('page')
event to handle new pages:
def handle_new_page(page):
print(f'New page URL: {page.url}')
context.on('page', handle_new_page)
Despite your best efforts, errors might still occur. Maybe the selectors have changed, the website layout has been updated, or your IP has been blocked. When this happens, it's important to handle errors gracefully. Wrap your scraping code in a try-except block, log the errors, and handle them appropriately.
This feature is essential for debugging. If your script doesn't work as expected, you can capture screenshots or record a video to observe what's happening in the browser. Here's how to capture a screenshot:
import asyncio
from playwright.async_api import async_playwright
from IPython.display import Image
from IPython.core.display import HTML
async def run(playwright):
browser = await playwright.chromium.launch()
page = await browser.new_page()
await page.goto(url)
await page.wait_for_load_state('networkidle')
await page.screenshot(path='screenshot.png')
await browser.close()
async with async_playwright() as playwright:
await run(playwright)
Image("screenshot.png")
Here you have a screenshot from the Jupyter Notebook.
In modern web applications, Shadow DOM is a popular technique for encapsulating components and keeping the global namespace clean. However, it's tricky to interact with these elements using many automation tools. Playwright can handle Shadow DOM elements seamlessly. Here's a simplified example:
element_handle = await page.query_selector('custom-player')
shadow_root = await element_handle.shadow_root()
await shadow_root.click('button')
Websites often load content dynamically. Playwright's explicit wait functions can pause your script until an element is visible (wait_for_selector
) or hidden (wait_for_selector_hidden
), helping ensure that your interactions or data extractions don't fail because an element hasn't loaded yet.
await page.goto('<https://www.forloop.ai>')
await page.wait_for_selector('div.article-item')
This feature is crucial when dealing with websites that show different content based on the user's location. Playwright allows you to emulate different geographical locations by setting geolocation coordinates. Here's how you can set geolocation to New York
import asyncio
from playwright.async_api import async_playwright
async def run(playwright):
browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context(
geolocation={"longitude": -74.1, "latitude": 40.7}, permissions=['geolocation']
)
page = await context.new_page()
await page.goto('<https://browserleaks.com/geo>')
await asyncio.sleep(5) # Wait for a few seconds to check the results on the page
print(f"Current URL: {page.url}")
#await browser.close()
async with async_playwright() as playwright:
await run(playwright)
When testing or scraping multiple websites concurrently, isolation is critical. A change in one scenario shouldn't affect the others. Playwright lets you create isolated browser contexts to ensure that scenarios run independently. Here's how you can create a new context:
import asyncio
from playwright.async_api import async_playwright
async def run(playwright):
browser = await playwright.chromium.launch()
# Create a new context
context = await browser.new_context()
# Open a new page within this context
page = await context.new_page()
await page.goto(url)
await browser.close()
async with async_playwright() as playwright:
await run(playwright)
These features, among many others, make Playwright an incredibly powerful tool for automating, testing, and scraping websites. Be sure to dive deep into Playwright's extensive documentation to explore its full potential.
In the rapidly advancing tech landscape, web scraping has become an essential skill. Understanding and using tools like Playwright Python API can enhance your data gathering and contribute to your professional development. The best way to learn is by doing, so go ahead, experiment with the API on different websites, and share your experiences with us!
Don't forget to join our slack channel for more insights and tutorials in the tech world. Keep scraping, and keep learning!