One of my favorite types of quick side projects are ones that involve web scraping with Python. Obviously, the Internet houses a ton of useful data, and you may want to fetch lots of that data to use within your own programs. Python has a few excellent tools which can be combined to create a powerful, easy to use system for quickly harvesting this data from webpages. I’ve used Python’s web scraping tools for fun projects like downloading music lyrics, and then using that to make a Twitter bot that replies to you with Taylor Swift lyrics. But these tools could also be useful for serious projects, like aggregating course enrollment or evaluation data from your college. So how would you go about simple web scraping in Python? Let’s dive in!
Web scraping involves writing code that uses HTTP or HTTPS (like a web browser) to request pages from websites. The Python library requests is great for that, and as a bonus, it is widely recognized as having one of the best APIs out there! Requests will make an HTTP(S) request and get you the page you asked for. Once you have the page, instead of displaying it to a user like a browser would, you want to search through it for information. This isn’t a completely automatic process. Typically, you need to go to the website you want to scrape, and look at the HTML source to figure out where your information is in the page.
Since webpage content is primarily written in HTML, getting your program to find the data you want could be difficult. Thankfully, the Python library lxml makes things a lot easier. Not only will it parse HTML, but it includes a powerful search tool called XPath, which allows you to craft a “query” that can match particular HTML tags in a webpage. You could think of it like regular expressions, but for HTML (because regular expressions won’t work on HTML… but that’s a discussion for another, lengthy blog post).
With these tools, the fundamental tasks of web scraping are very easy. For example, this code will get you the biosketch on the front page of my website (as of this writing):
>>> import requests, lxml.html
>>> page = requests.get('http://brennan.io')
>>> html = lxml.html.fromstring(page.text)
>>> content_divs = html.xpath('//main/div[@class="content"]')
>>> print(content_divs[0].text_content().strip())
And here you see my biosketch ...
The search query (//main/div[@class="content"]
) is a bit complex, but if you
research HTML and XPath a little bit, and then look at my website’s
source, it’ll all make sense. All it’s asking is to find a <div
class="content">
tag within a <main>
tag. If you were to look at the HTML of
my website’s home page, you could find exactly that. And inside of it – my
biosketch. If you don’t believe me, try it! Click “Home” at the top left of
this page, and then right click anywhere, and select “View page source”. Scroll
down to about line 109 (again, at the time of this writing) and you’ll see
just what I’m talking about.
Unfortunately, not every website is this easy to scrape. There are many reasons that a website may not be easy to scrape, but one common one is that it requires you to log in. For instance, colleges provide school directories and course evaluation data to students, faculty and staff. But they’d rather not provide that to the rest of the world. So, they make you log in to view the data.
Logging into a website is a process that can vary wildly between sites. But, here’s an overview some of the common things that happen:
There are two very common parts of this process that you should understand:
So, with these concepts in mind, we can begin to understand how we might successfully log into a website in our own code. We’d need to (a) make sure that we keep track of all the cookies a website gives us, because one of them is probably our session cookie. And (b), we’d also need to request the form beforehand so we can grab a CSRF token out of the form before we submit our login credentials.
In order to accomplish goal (a), we can use a nifty feature of requests called a
Session
. If you create a session and do all of your HTTP requests using that
session, requests will save all your cookies and use them in subsequent
requests. The only code change you’ll notice is that we’ll create a session
object like so: s = requests.session()
. Then, we use that for making HTTP
requests instead of the normal library. That is, you’ll see s.get()
or
s.post()
instead of requests.get()
or requests.post()
. Tada!
In order to accomplish goal (b), we will have to do some manual inspection of
the login form we’re using. Since CSRF tokens are almost always
<input type="hidden">
tags within a form, we’ll probably be able to write a
simple XPath to match any hidden tags in the login form, and then we’ll be sure
to put them in our login message.
To put all this information together, let’s try to log into a real-world site using requests. Since I’ll be working there this summer, why not try Yelp?
>>> import requests, lxml.html
>>> s = requests.session()
### Here, we're getting the login page and then grabbing hidden form
### fields. We're probably also getting several session cookies too.
>>> login = s.get('https://www.yelp.com/login')
>>> login_html = lxml.html.fromstring(login.text)
>>> hidden_inputs = login_html.xpath(r'//form//input[@type="hidden"]')
>>> form = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
>>> print(form)
{'csrftok': '9e34ca7e492a0dda743369433e78ccf10c1e68bbb1f453cbb80ce6eaeeebe928',
'context': ''}
### Now that we have the hidden form fields, let's add in our
### username and password.
>>> form['email'] = # Enter an email here. Not mine.
>>> form['password'] = # I'm definitely not telling you my password.
>>> response = s.post('https://www.yelp.com/login', data=form)
### How can we tell that we logged in? Well, these worked for me:
>>> response.url
'https://www.yelp.com/cleveland'
>>> 'Stephen' in response.text
True
So, everything probably looks familiar right up until the line with XPath. All
that the XPath does is search for any <input type="hidden">
elements within a
<form>
. The next line turns them into a dictionary mapping names to values in
the form. At the end of that little section of code, you can see that we have
captured two form elements that are hidden and set by the server already.
Great!
All that’s left is to set the username and password, and then submit the login
form. Once that’s done, you can experiment a bit with the response Yelp gives
you to confirm that you are logged in. For me, Yelp takes me directly to a
Cleveland page, and it has my name in the response as well. From then on, since
the session s
contains all the session cookies associated with your account,
you can use that to make requests as a logged in user.
Of course, not all websites directly ask you for a username and password.
Sometimes, they delegate authentication to somebody else. The best example I
have of this is at my school, CWRU. Everyone here has a single “account” that
they can log into all school-related websites with. The thing is, your school
account is too valuable to let every single school website handle your username
and password. So the way the system is designed, there is only one login page.
In the case of my school, it’s https://login.case.edu/cas/login
. When a site
wants you to log in with your school account, it does the following:
https://login.case.edu/cas/login
. It adds on a
parameter named service
that indicates where the login server should send
you when you’re all done.service
specified in step 1. It adds on the “service ticket” to the
URL.login.case.edu
. If the ticket is valid, the site will set its session
cookie, marking you logged in!This system has a name: Central Authentication Service, or CAS. At first glance, it sounds like all hope is lost for using Python to log into sites that use CAS. But, that’s not true! I’ve tried to do exactly that, and the code I came up with isn’t too much different from the last example. The main differences are:
login.case.edu
.service
GET
parameter.So, without further ado, here’s a function that will return a requests Session
where you are logged into a website via CWRU’s CAS!
import requests
import lxml.html
def cas_login(service, username, password):
# GET parameters - URL we'd like to log into.
params = {'service': service}
LOGIN_URL = 'https://login.case.edu/cas/login'
# Start session and get login form.
session = requests.session()
login = session.get(LOGIN_URL, params=params)
# Get the hidden elements and put them in our form.
login_html = lxml.html.fromstring(login.text)
hidden_elements = login_html.xpath('//form//input[@type="hidden"]')
form = {x.attrib['name']: x.attrib['value'] for x in hidden_elements}
# "Fill out" the form.
form['username'] = username
form['password'] = password
# Finally, login and return the session.
session.post(LOGIN_URL, data=form, params=params)
return session
The exciting thing is that, since Requests follows redirects by default, the
final session.post()
call makes it all the way to the final url, which will
set your session cookie. So the returned Session
object will have all the
necessary cookies to access the site at service
as a logged in user. Say that
you wanted to log into the CWRU course evaluation site (say, to scrape some
evaluation reports). All you’d have to do is call
cas_login('https://webapps.case.edu/courseevals/', 'username', 'password)
.
Then, you could use the returned session to make all further requests.
Hopefully this article will help people understand web scraping, and also how logins work on the web. If you decide to try this sort of thing on other sites, keep in mind that none of this is exact - you may find that a strategy that works on one site doesn’t work on another. This is usually because that site may use an additional security technique to protect their login form. If you get comfortable with your browser’s developer tools, you can inspect these login forms carefully, and figure out how to modify your code to circumvent them. In general, just keep in mind that this process is almost always manual. Hopefully, by reading this article, you should have the basic concepts under your belt, so that you can dive right into logging into different sites.