Skip to content

Architecture

Linda edited this page Aug 4, 2023 · 40 revisions

Main components

"Let(')s audit Learning Analytics" (LaLA) consists mainly of the following classes:

laaudit classes

LaLA class diagram exported from PHPStorm

LaLA creates the following tables in the Moodle database:

laaudit database tables

LaLA database diagram exported from PHPStorm and edited with Miro to show relationships

These components, their relationships, and how they interact with the Moodle Learning Analytics (LA) system are described in the following.

Model configurations and versions

The original Moodle model (see the Moodle LA API diagram) is re-interpreted by LaLA in two parts: The model configuration (class model_configuration) and the model versions that can be produced with this configuration (class model_version).

Model configuration

Upon first access to the plugin page, for each existing Moodle model a LaLA model configuration is automatically created and stored in the database. The logic of creating model configurations is currently handled by the model_configurations class.

⚠️ LaLA ignores static models that do not use machine learning.

The model configuration saves a loose reference to the Moodle model and copies its properties and settings: target, predictions processor, analysis interval type and indicators. If some properties are not set for the Moodle model, meaningful defaults are chosen. The currently set context IDs that limit the scope of the Moodle model are stored as the default context IDs in the model configuration.

🛠️ In a future version of LaLA, one will be able to set the scope by context ID for a model version when creating it. This is to help selecting which data from the Moodle instance to use as training and testing data.

🛠️ LaLA will currently always train a Logistic Regression model, no matter which predictions processor is configured to be used. In a future version, LaLA will be able to used different predictions processors.

Model version

The model version is one possible model that is created from a model configuration. One can create multiple model versions of the same configuration and each time, the trained model will be a bit different, due to the random selection of training data from the overall data and the nature of machine learning. The model version i.a. stores the relative test set size (by default 0.2, like for the Moodle models), the included context IDs (or null if all contexts are in the scope), and whether an error occurred when creating the model version.

The model version creation is triggered through the secured endpoint /admin/tool/laaudit/modelversion.php?configid=<configid>. After accessing the endpoint, one is redirected to the new version on the index page.

The model version creation is split into multiple steps, which each add a piece of evidence for the concerned model version.The process makes use of object-oriented programming: The evidence types inherit directly or indirectly from an abstract class evidence and each implement the methods collect(array $options) and store(). So, for each step in the model version creation, an options array is constructed, then the collect(array $options) method of the evidence is triggered, then there might be some post-processing for anonymization, before finally calling the evidence's store() method. The collect($options) and store() methods are described below under "evidence". In the model version object, the evidence is stored in a multi-dimensional indexed array (evidence[$evidencetype][$evidenceID]).

The model version creation follows the following steps, after each of which the evidence is stored.

  1. First, gather_dataset(bool $anonymous = true) triggers the collection of data from the Moodle platform, that will be used for the model version. If necessary, an ID-map is created, and the collected data is anonymized with it.
  2. Then, split_training_test_data() triggers the splitting of the previously collected data into training and testing data sets. Note that the data is shuffled first, in order to create a random split.
  3. In the third step, train() triggers the training of a Logistic Regression model using the training data gathered before.
  4. Fourth, predict() triggers the generation of predictions of the trained Logistic Regression model for the test dataset.
  5. The final step is the collection of data related to the dataset gathered in step #1 using the method gather_related_data(bool $anonymous = true). First is analyzed which tables relate to the main table, recursively (see a more detailed explanation). Then for each of these tables the relevant data is collected. If necessary, ID-maps are created for each table and each table is anonymized.

Upon finishing the model version, an event model_version_created (in the event folder) logs who created a new version of which model configuration.

Immutability

The model configuration is immutable, and so is the model version once it is trained. If a Moodle model is updated, a new model configuration is added with properties copied from the updated Moodle model. If a Moodle model is deleted, the model configuration continues to exist. Model versions are not affected by Moodle model updates or deletions. This is to ensure reproducible and trustworthy audits.

Evidence

As described previously, model versions follow a process where after each step, one or multiple peaces of evidence are stored. LaLA currently implements six two types of evidence (dataset and model), with four sub-classes of dataset (training_dataset, test_dataset, predictions_dataset, and related_data) and two anonymized variations of evidence (dataset_anonymized and related_data_anonymized).

Storing (store()) is implemented in the abstract class evidence and stores the collected evidence on the server. The store() method first serializes the collected data into a string and then creates a file from the string at the location /evidence/modelversion<VERSIONID>-evidence<EVIDENCENAME><EVIDENCEID>.<FILETYPE>. The location differs for related data. Here, -TABLENAME is inserted after the original file name and before the ., so that we know what kind of related data this file contains (e.g. user, course). How the evidence's raw data is serialized is implemented by both direct children of evidence, as well as by related_data.

The evidence collection (collect(array $options)) and the validation of the $options array are implemented differently for almost each type of evidence. See the following sections.

Dataset

The dataset requires $options to contain an array of context IDs (can be empty) (contexts), an analyser (type core_analytics\local\analyser\base) and the ID of the original Moodle model (modelid).

The collect(array $options) method uses the analyzer to retrieve all analysables, e.g. student enrolments and analyze them, calculating indicator and target values.

Training dataset and test dataset

The training and test datasets require $options to contain a data array (needs to contain at least one item) (data) and the testsize (]0.0, 1.0[).

The collect(array $options) method uses the dataset helper to extract only the rows from the input dataset. The first part of the rows (the size of the part depends on the testsize) is collected as test data, the other part is collected as the training data. The dataset helper re-creates the fitting dataset structure so that the data can be used as input for the model.

Model

The model requires $options to contain a predictor (type core_analytics\predictor) and data (can not be empty).

The collect(array $options) method uses the dataset helper to extract the rows from the input datasetand to separate the rows into x values (the indicator values) and y values (the target values). The amount of necessary iterations is retrieved from the predictor and used to create a new, untrained Logistic Regression model. This is then trained with the x and y values.

Predictions dataset

The predictions dataset requires $options to contain a model (type Phpml\Classification\Linear\LogisticRegression) and a data set (data, can not be null or empty) as model input.

The collect(array $options) method uses the dataset helper to extract only the rows from the input dataset and to separate x and y values from the rows. The x values are then input into the model to retrieve predictions. Finally, the dataset helper is used to merge the header, sample IDs, target values and predictions into one dataset.

Related data

The related data evidence requires $options to contain a tablename and an array of ids for which data from the selected table should be retrieved.

The collect(array $options) method uses the database helper to get all columns in the selected table and subtracts the columns to be ignored from this list. Then it retrieves the data in the remaining columns for the relevant ids from the selected table.

Privacy components

To ensure privacy, anonymized evidence types (dataset_anonymized, related_data_anonymized) have been implemented, as well as the idmap class. Both anonymized evidence types inherit from their corresponding un-anonymized evidence types (dataset and related_data). They each implement a method to pseudonomize collected data and a method to generate an idmap from the IDs occurring in the respective evidence data (create_idmap(array $dataset)). They also overwrite the collect(array $options) method from their parents.

Anonymized dataset

In the method pseudonomize(array $data, idmap $idmap), the anonymized dataset pseudonomizes the IDs in the collected data by replacing each original sample ID with the pseudonymized sample ID and shuffling the data so that the order does not hint at the identity. The anonymized dataset's collect(array $options) method first calls the parent's collect(array $options) method and then verifies that, if the analyzer processes user data, at least three distinct sample IDs are found in the collected data.

Anonymized related data

In the method pseudonomize(array $data, array $idmaps, string $type), the anonymized related data pseudonomizes all occurring IDs. Columns that contain the word "id" are expected to contain an ID. For these columns, the type of ID (e.g. "user") is estimated from the column name and the $idmaps parameter is queried for an ID map for this ID type. If a fitting ID map has been found, the ID in the collected data is replaced with the pseudonymized ID. Finally the data is shuffled so that the order does not hint at the identity. The anonymized related data's collect(array $options) method first gets the possible columns for the selected table, then subtracts from this list columns that are set to be ignored, as well as those columns containing unique values and columns containing text (longtext). Then the collect(array $options) method retrieves the data in the remaining columns for the relevant ids from the selected table. After collecting the data, it is ensured that if the table relates to user data (contains the string "user"), it is only acceptable if at least three distinct IDs occur. Additionally, each column that contains an ID relating to user data (contains the strings "user" and "id") needs to contain at least three distinct IDs. Otherwise, the evidence collection is aborted and the collected and pseudonymized data is deleted.

ID map

The idmap class can be used to create ID maps that contain a list of original ids and a list of pseudonyms. The index determines which original ID maps to which pseudonym. Apart from some getters, the idmap class has the following features. Firstly, given a list of original ids, it can create a list of pseudonyms. This is done by first shuffling the original ids, then creating an array containing the range of numbers between 100 and 100 * factor * amount_of_original_ids. 100 has been chosen as the offset, so that each pseudonym has at least three digits. The factor is a random number between three and ten and has been added for more randomness. The so created array of numbers is then shuffled and the first part of it (the size of the list of original IDs) is taken and shuffled again. Additionally, the ID map can also be printed (__toString()) and the amount of ID pairs counted (count()`).

Security components

To ensure security, three new capabilities tool/laaudit:viewpagecontent, tool/laaudit:downloadevidence and tool/laaudit:createmodelversion are defined in \db\access.php. In \db\install.php, a new role auditor is added and assigned the new capabilities. The plugin page can only be accessed by users who are logged in and who have the capability tool/laaudit:viewpagecontent. Evidence files can only be downloaded by users who are logged in and who have the capability tool/laaudit:downloadevidence. A new model version can only be created by users who are logged in, who provide a session key and who have the capability tool/laaudit:createmodelversion.

Other components

Additionally, the following features are implemented and can be found in LaLA's source code:

  • Mustache templates (in the templates folder) and output renderers (in the output folder).
  • A plugin page (index.php) that sets some page properties and loads the root renderer. All available model configurations, versions and evidence are available on this page.
  • The addition of the link to the plugin page to the admin menu under analytics (in settings.php) and for auditors on the front page (in lib.php).
  • The ability to serve files is implemented in lib.php.
  • Translatable strings (in /lang/en/).
  • Development branches of the plugin on GitHub additionally contain a test directory for PHPUnit tests. This folder is removed in release branches.
Clone this wiki locally