rmed.blog

Using Selenium to parse a timetable

27 Sep 2016

The University of Greenwich portal offers a link to view your own timetable for any week or term. Even though it is useful, it would have been even more useful if it had a way of downloading the full timetable as an iCalendar file to import in calendar applications such as Google Calendar (other apps are available). I like being able to quickly access my calendar from my phone, but I am too lazy to copy the timetable to a piece of paper (which is what a friend suggested), therefore I wanted to try implementing a script to automatically convert the timetable to a .ics file.

Step 1: Login

I've been wanting to use Scrapy for some time now, and I thought it might be a good approach to login, access the timetable and parse the data. After several tries filling the form with the scrapy.FormRequest class (see the docs), I just couldn't manage to login to the portal.

Turns out the login form is kind of weird. Normally, you would expect something like this:

<form action="/login" method="post">
    <input type="text" placeholder="username" name="user">
    <input type="password" placeholder="password" name="pass">
    <button type="submit">Login</button>
</form>

But the code of the login page looks kind of like this:

<form name="userid">
    <input id="user" type="text" placeholder="username" name="user">
</form>

<form name="login" action="/login" method="post" onsubmit="login();">
    <input type="password" placeholder="password" name="pass">

    <input type="hidden" name="user" value="">
    <input type="hidden" name="uuid" value="">
</form>

<a href="javascript:login();">
    <div id="loginbutton"></div>
</a>

Soooo... why are there two forms again?

After further inspecting the source of the page, I find that the login() javascript function fills the user field in the second form and generates a uuid using the timestamp and some delta and then it submits the form.

Several tries later, I decided that I couldn't do what I wanted with Scrapy (or at least I didn't know how), so I turned to Selenium (unofficial readable docs at http://selenium-python.readthedocs.io/).

Selenium interacts with websites by means of a WebDriver. In my case, I chose to download a static build of PhantomJS rather than using a browser or the remote server. It is actually very easy to select elements with selenium:

from selenium import webdriver

driver = webdriver.PhantomJS(executable_path='/path/to/phantomjs')

# Get login page
driver.get('URL_TO_LOGIN')

# Select form elements
user_input = driver.find_element_by_id('user')
pass_input = driver.find_element_by_name('pass')
login_btn = driver.find_element_by_id('loginbutton')

# Introduce credentials
user_input.clear()
pass_input.clear()

user_input.send_keys('USERNAME')
pass_input.send_keys('PASSWORD')

# Login
login_btn.click()

With that the login is complete, the cookies are set and the portal is correctly loaded.

Step 2: Accessing the timetable

The link to the timetable is actually a javascript function that opens a new window with the link. Rather than clicking that link, let's get the driver to load the page directly, considering that it should have all the required cookies to access that section. Of course, I had to use Firefox Developer Tools to get URL that is loaded once the link is clicked.

from selenium import webdriver

driver = webdriver.PhantomJS(executable_path='/path/to/phantomjs')

# Login
# ...

# Access timetable
driver.get('URL_TO_TIMETABLE')

Once the page is loaded, it is necessary to select which dates to show and how to show them. In the script, I get the timetable per term after asking the user for the term to fetch. Furthermore, I chose to use the list timetable format, because it is easier to parse:

term = input('Term to fetch [1-3]')

# Select Term
select_term = Select(self.driver.find_element_by_name('lbWeeks'))
select_term.select_by_index(self.term - 1)

# Select days
select_days = Select(self.driver.find_element_by_name('lbDays'))
select_days.select_by_visible_text('All Week')

# Select list timetable
select_list = Select(self.driver.find_element_by_name('dlType'))
select_list.select_by_visible_text('List Timetable')

# Get timetable
view_btn = self.driver.find_element_by_id('bGetTimetable')
view_btn.click()

Clicking the button should return the specified timetable.

Step 3: Parsing the timetable

Being in HTML, the timetable itself is very easy to parse. It consists of several tables (one for each day of the week, starting on Monday) with the following structure:

+----------+----------------+---------+-------+-------+--------+------+-------+
| Activity | Description    | Type    | Start | End   | Weeks  | Room | Staff |
+==========+================+=========+=======+=======+========+======+=======+
| CODE     | Cyber Security | Lecture | 13:00 | 15:00 | Term 1 | ROOM | STAFF |
| CODE     | Cyber Security | Lab     | 16:00 | 17:00 | 2-4    | ROOM | STAFF |
+----------+----------------+---------+-------+-------+--------+------+-------+

Here is a small description of each column:

  • Activity: internal code of the activity, not really relevant for the calendar
  • Description: human-readable name of the activity;
  • Type: type of activity (e.g. lecture, lab, tutorial, etc.)
  • Start: starting time
  • End: ending time
  • Weeks: term weeks in which the activity takes place
  • Room: room in which the lecture takes place
  • Staff: people in charge of the lecture

Most of these fields are directly copied as is, but the Weeks one has to be converted to an actual date. In order to do this, I have defined a list that contains the dates in which each week starts, and then the actual day is obtained by adding a delta depending on the day of the week.

Furthermore, for the cells in which Term X appears, it is necessary to know when the term starts, so a simple tuple (start, end) for each term should be enough.

import datetime
import pytz

# Day deltas (M, T, W, Th, S, Su)
DAYS = [
    datetime.timedelta(days=0),
    datetime.timedelta(days=1),
    datetime.timedelta(days=2),
    datetime.timedelta(days=3),
    datetime.timedelta(days=4),
    datetime.timedelta(days=5),
    datetime.timedelta(days=6)
]

# Week starting dates
TZONE = pytz.timezone('Europe/London')
START_DATE = TZONE.localize(datetime.datetime(year=2016, month=9, day=19))
WEEKS = ['dummy']

for w in range(52):
    WEEKS.append(START_DATE + datetime.timedelta(days=(7 * w)))

# Term bounds
TERMS = [(2, 13), (18, 29), (34, 52)]

The actual parsing of the table rows goes like this:

tables = driver.find_elements_by_class_name('spreadsheet')

for tindex, table in enumerate(tables):
    rows = table.find_elements_by_tag_name('tr')

    for rindex, row in enumerate(rows):
        # ...

With the information extracted, creating a .ics file is very easy using the ics module:

from ics import Calendar, Event

calendar = Calendar()
new_event = Event(
    name='Cyber Security',
    description='TYPE; STAFF',
    location='ROOM',
    begin='Start datetime',
    end='End datetime'
)

calendar.events.append(event)

# Export calendar
with open('out.ics', 'w') as ics:
    ics.writelines(calendar)

The remaining step is to import the .ics file into the calendar tool of preference, and hopefully it should show the events correctly

Conclusions

There must be better ways to do what I wanted to do, surely, but Selenium is very useful for scenarios such as automatic testing of web applications in different browser environments.

The complete source code can be found at https://github.com/rmed/gre-timetable (note that it is somewhat different to the examples in this post).

Tags: selenium university parse greenwich spider python script web scrapy

rmed

My name is Rafael Medina, and I like code.

More about me.