Design

Terminology

Grading Stage

A Grading stage consists of a Docker image and zero or more environment variables, some of which might consist of templates. For example, you can have an image named autograder and template that with an environment variable STUDENT_ID which specifies which student to run the autograder for. A grading stage fails if the container returns a non-zero exit code or if the container times out.

Grading Pipeline

Consists of one or more grading stages which are run sequentially (as per the order specified in the array).

Grading Job

An instance of a grading pipeline is a grading job. For example, you can specify one grading pipeline to grade a student. There will be many instances of this pipeline based on the number of students you specify. A grading job fails if any of the intermediate grading stages fail. The grading job is aborted in case of any stage failure (subsequent stages are not executed) and is marked as failed.

Grading Run

Usually represents the grading of a single assignment. Consists of the following:

Pre-processing job: This is optional. This is executed before any of the other jobs are scheduled for this run. If this fails, the grading run is marked as failed and none of the other jobs are scheduled. If not defined, student jobs are scheduled right away.
Student grading jobs: Consists of many grading jobs which will be executed simultaneously. These jobs are distributed across the grading machines. Ideally, each student job grades one student. The grading run is unaffected by the failure of any of these jobs (since student's code might break things, timeout the containers, etc).
Post-processing job: This is optional. This is executed after all student jobs have finished. The grading run is marked as failed if this job fails.

Worker Nodes

The Broadway Graders communicate with the API by making requests. We keep track of alive worker nodes using the heartbeat protocol. The worker nodes are currently not capable of listening to requests. So the API can not notify them when events occur (like a grading job is ready to run). As a result, all the communication is designed to be initiated by the grader. The worker nodes are responsible for:

Polling grading jobs from the API (once they are available) and run those grading jobs. They have to check periodically with the API if a grading job is available by making periodic requests. If there is a job on the queue, it will be immediately sent back as a response to the request.
Once a grader has successfully received a grading job to run, they should send the results of the grading job back to the API after they are done executing the job. Once the API receives the results, it will mark the job as succeeded/failed and schedule the next batch of jobs appropriately. For example, the API receives the results for the pre-processing job and the job was successful, it will then schedule the student jobs.
Send periodic heartbeats. The API merely updates the worker node's state here.

Failure Detection

The API expects a heartbeat every HEARTBEAT_INTERVAL seconds (specified in the config). If a worker node does not send a heartbeat in 2 * HEARTBEAT_INTERVAL seconds, the API declares it dead. The API checks for dead worker nodes every HEARTBEAT_INTERVAL seconds using periodic callbacks. In case the grader crashes while executing a grading job, the API will declare it dead (since it will stop receiving heartbeats) and will mark that grading job as failed.

Job Queue

The scheduling is done by pushing the grading jobs onto a queue once they are ready to be run. The graders poll this queue periodically. The status code for the poll request is set to QUEUE_EMPTY_CODE if the queue is empty. : The following properties have to be satisfied for a grading job to be on the queue:

A student-job is on the queue if and only if either:
- no pre-processing job exists for the grading run
- the pre-processing job exists for the grading run and has already been executed and marked as succeeded
The post-processing job is on the queue if and only if all student jobs have been executed and marked as succeeded or failed.

The queue allows for concurrent runs from multiple courses. The job queue can be populated with grading jobs belonging to different grading runs as long as the above properties hold true.

Authentication

All requests are authorized through auth tokens which are passed in the header of the requests.

A cluster token is determined when the API is started. This cluster token is used to authenticate requests from the graders. Hence all the worker endpoints are authenticated with the cluster token. This cluster token has to be handed to the graders to start them.

On the other hand, each course can specify a list of tokens to authenticate requests pertaining to their course. They can use any of these to make requests for themselves. This prevents courses from making requests for each other and starting grading runs for each other.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly