Skip to content

Pipelines

JamesyJi edited this page Nov 22, 2021 · 1 revision

Contents

Overview

In order to make it easy to run our pipelines, there is a runprocessors.py file in the backend/ directory which will run several commands in one go depending on the arguments you give it.

Scrape-Format-Process

The UNSW Handbook can be split up into 3 main sections: Programs, Specialisations and Courses. For each section, we run a pipeline like follows:

  1. Scrape (Get the raw data from the UNSW handbook API)
  2. Format (Remove all the unnecessary raw data so it's readable by a human)
  3. Process (Make changes to the data so it's in our desired format).

Conditions

Each course has some sort of enrolment condition which must be further handled with extra care. You can read more about this in the other wiki pages. The pipeline is as follows:

  1. Process (Run regexes on the conditions to get them into a nice format for our algorithm)
  2. Manual fixes (Apply manual fixes to the conditions which couldn't be fixed with regexing)
  3. Tokenise (Convert the conditions into a format our algorithm can parse)

Algorithms

The algorithms also does some additional caching in order to gain quick access to certain information. Some of the steps involved are:

  • Exclusions (Map exclusions to each other for quick access)
  • Warnings (Some courses have notes in the handbook such as "Consult with Faculty before enrolling")
  • Mapping (Maps useful information to each other such as courses to their school code)