While creating a variety of tools for technical SEO purposes, I found that one of the common functions I had to develop to get the job done was one that could find all of the indexed URLs of a website with Python.
After a bit of research, I found what I believe to be the most effective way of doing so and I wanted to share this method with you all.
This article is going to be a short one but I hope you will find it helpful. So without further ado, let’s get into it.
In my search for this method, I found this third-party Python library that will do the job for us. The name of this module is Ultimate Sitemap Parser and its project page can be found there: https://github.com/mediacloud/ultimate-sitemap-parser
In order to install it, simply “pip install ultimate-sitemap-parser” and this should then be a part of your Python environment.
Method to Get All Webpages from a Website with Python
The code is quite simple, really. Here are the functions I came up with using this library in order to perform this job:
The first function is using the library with a domain URL as an argument in order to find the sitemap associated with the domain provided and scour it in order to find all indexed webpages.
You end up with a list containing all of the webpages the library could find.
The only downside of this function is that it does not remove duplicate pages by itself.
In order to counter that downside, I created a second function that takes the previously created list and purges it from all of its duplicate so that we end up with a clean list of every unique URL the website is hosting.
It’s as easy as that. I did say it would be a short one. Now, if you came across this article without a use case for this method, keep reading as I am going to share with you the few projects where this particular method found its way within.
The latest project I used this method for was when I created a web application that would find any broken external link on a web property. You can find the tutorial and a link to this app over there .
One of the steps needed to realise this project was the ability to retrieve all of the webpages of a website, which I achieved with the aforementioned method.
Another project of mine where this method came in clutch was my “External Link Finder” that you can find at this address . This project works by scouring an entire website’s pages in order to retrieve a list of all external links present on a website.
I will keep updating this post whenever I use this specific method so that you can keep being inspired and find ideas on how to use this specific method in your own projects.
If you have any ideas or suggestions on how to use this specific function, do not hesitate to share it with us in the comments section down below, that would be extremely appreciated!
And of course, if you have any questions or comments, you can also use the comments section at the bottom of this article or reach out directly to me by email and I will do my best to get back to you.