Using Selenium to parse a timetable27 Sep 2016
The University of Greenwich portal offers a link to view your own timetable for any week or term. Even though it is useful, it would have been even more useful if it had a way of downloading the full timetable as an iCalendar file to import in calendar applications such as Google Calendar (other apps are available). I like being able to quickly access my calendar from my phone, but I am too lazy to copy the timetable to a piece of paper (which is what a friend suggested), therefore I wanted to try implementing a script to automatically convert the timetable to a
Step 1: Login
I've been wanting to use Scrapy for some time now, and I thought it might be a good approach to login, access the timetable and parse the data. After several tries filling the form with the
scrapy.FormRequest class (see the docs), I just couldn't manage to login to the portal.
Turns out the login form is kind of weird. Normally, you would expect something like this:
<form action="/login" method="post"> <input type="text" placeholder="username" name="user"> <input type="password" placeholder="password" name="pass"> <button type="submit">Login</button> </form>
But the code of the login page looks kind of like this:
Soooo... why are there two forms again?
After further inspecting the source of the page, I find that the
user field in the second form and generates a
uuid using the timestamp and some delta and then it submits the form.
Several tries later, I decided that I couldn't do what I wanted with Scrapy (or at least I didn't know how), so I turned to Selenium (unofficial readable docs at http://selenium-python.readthedocs.io/).
Selenium interacts with websites by means of a
WebDriver. In my case, I chose to download a static build of PhantomJS rather than using a browser or the remote server. It is actually very easy to select elements with selenium:
from selenium import webdriver driver = webdriver.PhantomJS(executable_path='/path/to/phantomjs') # Get login page driver.get('URL_TO_LOGIN') # Select form elements user_input = driver.find_element_by_id('user') pass_input = driver.find_element_by_name('pass') login_btn = driver.find_element_by_id('loginbutton') # Introduce credentials user_input.clear() pass_input.clear() user_input.send_keys('USERNAME') pass_input.send_keys('PASSWORD') # Login login_btn.click()
With that the login is complete, the cookies are set and the portal is correctly loaded.
Step 2: Accessing the timetable
from selenium import webdriver driver = webdriver.PhantomJS(executable_path='/path/to/phantomjs') # Login # ... # Access timetable driver.get('URL_TO_TIMETABLE')
Once the page is loaded, it is necessary to select which dates to show and how to show them. In the script, I get the timetable per term after asking the user for the term to fetch. Furthermore, I chose to use the list timetable format, because it is easier to parse:
term = input('Term to fetch [1-3]') # Select Term select_term = Select(self.driver.find_element_by_name('lbWeeks')) select_term.select_by_index(self.term - 1) # Select days select_days = Select(self.driver.find_element_by_name('lbDays')) select_days.select_by_visible_text('All Week') # Select list timetable select_list = Select(self.driver.find_element_by_name('dlType')) select_list.select_by_visible_text('List Timetable') # Get timetable view_btn = self.driver.find_element_by_id('bGetTimetable') view_btn.click()
Clicking the button should return the specified timetable.
Step 3: Parsing the timetable
Being in HTML, the timetable itself is very easy to parse. It consists of several tables (one for each day of the week, starting on Monday) with the following structure:
+----------+----------------+---------+-------+-------+--------+------+-------+ | Activity | Description | Type | Start | End | Weeks | Room | Staff | +==========+================+=========+=======+=======+========+======+=======+ | CODE | Cyber Security | Lecture | 13:00 | 15:00 | Term 1 | ROOM | STAFF | | CODE | Cyber Security | Lab | 16:00 | 17:00 | 2-4 | ROOM | STAFF | +----------+----------------+---------+-------+-------+--------+------+-------+
Here is a small description of each column:
Activity: internal code of the activity, not really relevant for the calendar
Description: human-readable name of the activity;
Type: type of activity (e.g. lecture, lab, tutorial, etc.)
Start: starting time
End: ending time
Weeks: term weeks in which the activity takes place
Room: room in which the lecture takes place
Staff: people in charge of the lecture
Most of these fields are directly copied as is, but the
Weeks one has to be converted to an actual date. In order to do this, I have defined a list that contains the dates in which each week starts, and then the actual day is obtained by adding a
delta depending on the day of the week.
Furthermore, for the cells in which
Term X appears, it is necessary to know when the term starts, so a simple tuple
(start, end) for each term should be enough.
import datetime import pytz # Day deltas (M, T, W, Th, S, Su) DAYS = [ datetime.timedelta(days=0), datetime.timedelta(days=1), datetime.timedelta(days=2), datetime.timedelta(days=3), datetime.timedelta(days=4), datetime.timedelta(days=5), datetime.timedelta(days=6) ] # Week starting dates TZONE = pytz.timezone('Europe/London') START_DATE = TZONE.localize(datetime.datetime(year=2016, month=9, day=19)) WEEKS = ['dummy'] for w in range(52): WEEKS.append(START_DATE + datetime.timedelta(days=(7 * w))) # Term bounds TERMS = [(2, 13), (18, 29), (34, 52)]
The actual parsing of the table rows goes like this:
tables = driver.find_elements_by_class_name('spreadsheet') for tindex, table in enumerate(tables): rows = table.find_elements_by_tag_name('tr') for rindex, row in enumerate(rows): # ...
With the information extracted, creating a
.ics file is very easy using the ics module:
from ics import Calendar, Event calendar = Calendar() new_event = Event( name='Cyber Security', description='TYPE; STAFF', location='ROOM', begin='Start datetime', end='End datetime' ) calendar.events.append(event) # Export calendar with open('out.ics', 'w') as ics: ics.writelines(calendar)
The remaining step is to import the
.ics file into the calendar tool of preference, and hopefully it should show the events correctly
There must be better ways to do what I wanted to do, surely, but Selenium is very useful for scenarios such as automatic testing of web applications in different browser environments.
The complete source code can be found at https://github.com/rmed/gre-timetable (note that it is somewhat different to the examples in this post).
My name is Rafael Medina, and I like code.