My job involves a lot of scraping. So much actually that I created a starter code that I use in every one of my scraping projects. It helps me remember how to use the most common functions and make sure I have imported everything I need in my Python notebook. I realized this could be useful for others so I finally decided to share it with all of you.
Now, I realize this article would be a very short one if I was just sharing my starter kit so I decided to add a few tips in separate sections on how to query specific things in a page. I find myself doing some things over and over again for each scraping project and because I’m sure I am not alone here, I thought it would be interesting to share my methods with all of you. So feel free to browse through the rest of the post once you retrieved my Starter Code :). Thank you!
You obviously need BeautifulSoup to be installed.
pip install beautifulsoup4
One of my favorite libraries is called “slugify”. It allows you to “slugify” a string into something computers can comprehend. What I mean by that is that if you have a string like “Book #1: Chapter 6”, slugifying this string would output “book-1-chapter-6”. It’s ane extremely useful library that I use really often for creating filenames based on titles for instance but it can be used for many other things.
In order to installer this library, simply paste the following code in your terminal or command shell:
pip install python-slugify
Scraping in Python: Starter Code
Without further ado, here is the code I am using at the beginning of each one of my projects:
Let’s debrief the use of every set of lines here.
Lines 1-4 are the regular libraries import that a lot of people use for their scraping projects.
Lines 6-7 are about the extra slugify library I mentioned in the Requirements section. Line 6 is a specific example of how to use the library.
Lines 9-10 are about the User-Agent I use when scraping. Line 8 is a link to a website that catalogues all of the User-Agents one can use to simulate any device when they’re querying content through the requests library.
Lines 14-16 is your classic requests to soup combo of code. You just have to specify the first URL you are interested in in Line 11.
Finally, Line 18 is an example on how to get a list of specific HTML tags that contain a specific attribute.
Basically, all of these lines are things I tend to forget the formatting of and so by using this starter code, I’m able to get a new scraping project up and running in no time.
How to get the Href value for Every Link with Python and BeautifulSoup
One thing I find myself doing very often is getting a list of all of the internal and external links of a webpage. To do that, this is the code I am using:
Boom, super simple.
How to Extract Domain from URL with Python
This function is extremely simple but goes often in hand with the code used above. Extracting the href values of every links makes it that your list comes back with just the end part of the URL without its domain attached to it so I often use this custom-built function to extract the domain name out of an URL and later on append it to the extracted href values:
If you see this right after only a few “tips”, this means I published this article super recently and thought I would rather publish it with a small amount of information rather than having it not published at all. I am working on cleaning a few of my most used scraping functions in order to distribute them here.
Anyway, I hope you found this article useful and that this starter code will help you kickstart more efficiently your future scraping projects.
If you have any questions or comments, feel free to reach out by email or by using the comments section down below.
If you’re interested in more Python tutorials, you can read some of my tutorials down below or go to the Python category of my blog here.
With all of that said, thanks again for tuning in. Hope I’ll see you soon.