In today's fast-paced business world, data is more important than ever. Companies are constantly looking for ways to collect data and later analyze them to make better decisions, improve their products and services, and gain a competitive edge. However, the process that allows you to collect data can often be time-consuming and tedious. But... what if I told you there is a way to collect data from any website in minutes?
In the following article, we will guide you through a step-by-step process on how to collect data from a Wikipedia page using a Simplescraper.
As you probably might guess, looking at the tool, in this tutorial, we will perform web scraping. However, if you are unfamiliar with the term, we highly recommend checking A web scraping quick guide with a hands-on tutorial.
The main goal of the following tutorial is to focus on a straightforward and quick way to collect data from a website. It also is important to mention that the final result will not be a production-ready solution. For that one, you often need increased security and privacy standard, rotating proxies, schedulers, and more... We’ll cover this in a different article.
We decided to scrape Wikipedia because... “Friendly, low-speed bots are welcome viewing article pages.”
In order to collect data, we will use a Simplescraper that has recently been a Product of the day on the Product Hunt. It is an easy-to-use tool that enables users to efficiently extract data from any website and convert it into organized information. The user-friendly Chrome extension provided by Simplescraper makes it easy to select and extract content from any website and instantly accessible as an API endpoint, ready to be downloaded in CSV or JSON format or even sent directly to your preferred web applications. The Simplescraper dashboard allows you to manage all your scraping recipes with ease.
In the following tutorial, we will extract the title, text, and image from the Wikipedia page with a Hot dog. Let’s dive in!
First, you need to install the Chrome extension that can be downloaded here. The extension allows you to visually select parts of the website that you would like to extract. A simplescraper allows you to create a scraper also via their dashboard. However, using the Chrome extension is a much faster and much more intuitive approach.
When the Chrome extension is installed. Please go to en.wikipedia.org/wiki/Hot_dog page. On the top, you should see a Simplescraper’s navbar that appears every time you run the extension.
Now we will collect data. To extract the first value - a page title, please click on the + button in the top left corner, name the value (e.g. “Title”), and click on it on the webpage. A simplescraper should automatically detect the area and saves it. See the short gif below.
Perform the same action to collect data - text, and image. The ready mage scraper should look as follows:
When a scraper is ready, you can click View Results in the top right corner, and it’s done!
A simplescraper will collect data, and you will be redirected to a dashboard where you will see your values. You can download them in the form of data.csv or data.json.
In the above tutorial, we showed a quick and simple way to collect data from a Wikipedia website in just a few minutes using a Simplescraper. The main goal of this tutorial was to give a sense of what web scraping is and how it can be performed quickly to obtain values that interest you most. Of course, it is possible to improve that scraper significantly. We add more data to extract or automate the process for a few pages rather than only one. If you are interested in web scraping, we highly encourage you to join our weekly webinars and slack channel to learn more.