Join the top marketers who read our newsletter each week:

How to get a list of all pages from a website with Python

Table of Contents

While creating a variety of tools for technical SEO purposes, I found that one of the common functions I had to develop to get the job done was one that could find all of the indexed URLs of a website with Python.

After a bit of research, I found what I believe to be the most effective way of doing so and I wanted to share this method with you all.

This article is going to be a short one but I hope you will find it helpful. So without further ado, let’s get into it.

Requirements

In my search for this method, I found this third-party Python library that will do the job for us. The name of this module is Ultimate Sitemap Parser and its project page can be found there: https://github.com/mediacloud/ultimate-sitemap-parser

In order to install it, simply “pip install ultimate-sitemap-parser” and this should then be a part of your Python environment.

Method to Get All Webpages from a Website with Python

The code is quite simple, really. Here are the functions I came up with using this library in order to perform this job:

The first function is using the library with a domain URL as an argument in order to find the sitemap associated with the domain provided and scour it in order to find all indexed webpages.

You end up with a list containing all of the webpages the library could find.

The only downside of this function is that it does not remove duplicate pages by itself.

In order to counter that downside, I created a second function that takes the previously created list and purges it from all of its duplicate so that we end up with a clean list of every unique URL the website is hosting.

It’s as easy as that. I did say it would be a short one. Now, if you came across this article without a use case for this method, keep reading as I am going to share with you the few projects where this particular method found its way within.

Use Cases

The latest project I used this method for was when I created a web application that would find any broken external link on a web property. You can find the tutorial and a link to this app over there .

One of the steps needed to realise this project was the ability to retrieve all of the webpages of a website, which I achieved with the aforementioned method.

Another project of mine where this method came in clutch was my “External Link Finder” that you can find at this address . This project works by scouring an entire website’s pages in order to retrieve a list of all external links present on a website.

I will keep updating this post whenever I use this specific method so that you can keep being inspired and find ideas on how to use this specific method in your own projects.

Conclusion

If you have any ideas or suggestions on how to use this specific function, do not hesitate to share it with us in the comments section down below, that would be extremely appreciated!

And of course, if you have any questions or comments, you can also use the comments section at the bottom of this article or reach out directly to me by email and I will do my best to get back to you.

Join the top marketers who read our newsletter each week.

Yaniss Illoul

Share on twitter
Share on linkedin
Share on facebook
Share on reddit

You might also like these posts:

11 Responses

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

HELLO!

Get more tutorials, guides and curated content !

In your inbox, once a week.

wait!

Get more tutorials, guides and curated content !

In your inbox, once a week.