Introduction to Web-Scraping with Python: Extracting Data from a Page

Last Updated on September 15, 2019
Python, Tutorials

In this serie of articles/tutorials, I want to introduce you to the world of web scraping. By the end of it, my goal is for you to have the skills and know-how to be able to go on most of the websites around the web and pull out their data for your own research purposes.

Web-Scraping is a lot of fun and the possibilities it opens should not be underestimated. Research is the future of Content Marketing and as such, having the capability of performing your own research and analysis will definitely open up your perspectives and opportunities.

This serie of tutorials does not come without prerequisite though. You should know the basics of Python before throwing yourself here. If you don’t know where to start, I highly recommend following the free beginner lessons offered by Dataquest.io . It’s free, it worked for me and it will give you the basic understanding that you need to go further here.

If you already know a bit of Python but are not sure whether this will be enough, you only need to know how to create lists, dictionaries, and loops. Knowledge of functions is definitely a bonus. Furthermore, a basic grasp of how HTML works is a plus but not mandatory.

I suggest that you tag along with me, open a Jupyter Notebook and replicate the steps as we’re moving forward together. I believe one can only learn by doing. If you don’t have a Python environment at the ready, you can read my articles that explains how to code in Python in your browser. Google Colaboratory is one of the tools mentioned so you should be able to follow my lead from within your browser :).

If you don’t want to stay longer on this page, I also compiled this tutorial in a Jupyter Notebook that you can access through this Google Colaboratory link.

No more talking. Ready ? Let’s do it.

Import Necessary Libraries

A library is the equivalent of an add-on for Google Sheets but for Python. A library adds functionality to your Python code. There is a library for pretty much everything. Today we want to learn how to use one of the libraries used for web-scraping that is called BeautifulSoup.

For web-scraping, you will need the following libraries :

Requests (http://docs.python-requests.org/en/master/user/quickstart/#make-a-request)
BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Every library is different and has their own variables and functions. Knowing Python is knowing how to create lists, dictionaries and how to learn libraries. When you do not remember how to use a library, don’t hesitate to Google it. The documentation of this kind of popular libraries is super detailed and will cover every possible use case you can think of.

The code below shows you how to import a library to your current environment. In other words, importing a library is teaching your Python code how to speak a new language. So by importing BeautifulSoup and Requests, we now taught our Python code how to scrape websites.

If your environment does not have BeautifulSoup installed, open up your terminal, type this up and run your code again:

pip install beautifulsoup4

Getting the content of a page

Now that your Python code knows how to web-scrape, we want to get ready. One of the common problems faced by scrapers is that websites will recognize them as robots. To counter that, we use “headers”. “Headers” are a way for the robot to disguise itself and look like a real computer.

The code below shows how to create a header.

The second thing we want to input is the url we want to extract data from. For the purpose of this tutorial, we will look into extracting the cost of a macbook Pro from Apple’s website.

Extracting Data from a page

Now that we have our url and that Python speaks the right language, it is time to call the libraries and get the code of this webpage.

The first thing we want to do to achieve that is doing a request. Meaning calling the URL we gave to see if it exists or not.
Then we want to extract the content out of it
Finally, we want to give all of this content to BeautifulSoup, which is the library that is going to give us the power to look into the content of the page and do stuff with it

The variable “soup” now contains all of the code of the webpage. If you’re familiar with HTML/CSS, the soup is basically the source code of the page.

So now that we have the content. We can basically interact with it and look for specific things.

The way content works in HTML is that everything is in tags like “<p>Content</p>“. Everything between these tags can be extracted.

Usually, tags like this also contains class or ids.

To look for content, we need to inspiect the webpage we’re looking to extract the data of.

Right click on thelement of the page you want to extract
Click “Inspect”

This will bring a panel in your window with the code that contains the element you inspected. In this case we want to extract the “title” of the computer. The code looks like this :

Now, we know that we’re looking to extract the text that is inside the “h3” tag and has a class named “as-macbundle-modelvariationtitle”

The code to find elements is as follow :

need to find a way to show results inside the gist ?

As you can notice, the variable “find_text” now contains the exact text that we were inspecting.

Using the same method, we can now look for the price of this laptop.

By right-click -> Inspect on the price in the wbepage, we find out that the price of the laptop is in the content of this tag

<span class="as-price-currentprice" data-autom="price-MNYF2">
        <span>
            $1,299.00
        </span>
</span>

There is two ways to find this specific price :

Also, all of this code does not actually have to be in so many lines. You can get the same content in one line like this :

Summary

This wraps up this introduction on how to get data from a webpage.

In the next article, we will learn how to apply this in conjunction with loops in order to go through a big amount of data and automating its retrieving.

The goal there will be to scrape an entire table from a Numbeo City Page… If you’re ever interested in building a dataset about the cost of living across cities, this is exactly what we’re going to do…

If you liked this tutorial and want to see more of it, I’ll link the second part here as soon as it’s published but you can definitely subscribe to my newsletter where I’ll be sure to mention the release of the subsequent articles as soon as they come out.

If you have any question, feel free to leave a reply below and I’ll get back to you as soon as I can to sort this out1

As always, I hope you learned something new today, and thank you for your time.

Join the top marketers who read our newsletter each week.

Yaniss Illoul

You might also like these posts:

One Response

Pingback: How to Create Twitter Lists with Python (+Scraping Example)

Python Tutorials

Google Sheets Tutorials

Mautic

Email Marketing

Social Media

Introduction to Web-Scraping with Python: Extracting Data from a Page

Table of Contents

Import Necessary Libraries

Getting the content of a page

Extracting Data from a page

Summary

Yaniss Illoul

You might also like these posts:

How to Export any Business’ Google Reviews for Free

How to Add Published Date and other Custom Fields to your Google Analytics’ Data

How to get a list of all pages from a website with Python

One Response

Leave a Reply Cancel reply