Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to pre-filter and pre-sort eager loaded data #10929

Closed
GuySartorelli opened this issue Aug 28, 2023 · 7 comments
Closed

Provide a way to pre-filter and pre-sort eager loaded data #10929

GuySartorelli opened this issue Aug 28, 2023 · 7 comments

Comments

@GuySartorelli
Copy link
Member

GuySartorelli commented Aug 28, 2023

Description

@kinglozzer pointed out in #10874 (comment) that it could be useful (and perhaps more performant) to pre-filter and pre-sort eager loaded data ahead of time, so that the filtering and sorting of that data doesn't have to be done in PHP. This is something I've heard echoed by a few people now when talking about the new eager loading functionality.

There are two ways I can see to implement this outlined below. With both approaches, the relation to be eager loaded becomes the key to the array. This is backwards compatible, because we can add a is_numeric() check on the array keys, and if they are numeric we have no pre-filtering and just use the current implementation, and if they are not numeric, then we have pre-filtering and can do whatever needs to be done to pre-filter the results. In other words, if there is no filtering to be done, then relation chains should just be passed in as values, like they already are.

In either scenario, we should throw exceptions if the relation to be eager loaded is either has_one or belongs_to, since there is no list to be filtered or excluded in that scenario.

Option one: Pass the unfiltered fetched records into a callback for filtering

This option might require heavier unit testing, but requires less code to implement and is more powerful as it gives the full DataList API for filtering, sorting, or whatever else the developer wants to do (e.g. altering the underlying query, etc).

It is similar to the approach originally proposed by KingLozzer, but sets the relation to eagerload as the key in the array with the callback as the value which allows it to be backwards compatible as discussed above.

$groups = $groups->eagerLoad([
    // $products would initially be favourited products for *all* members in $groups
    'Members.FavouriteProducts' => function (EagerLoadedDataList $products) {
        return $products->exclude('Price:GreaterThan', 100);
    }
]);

$groups = $groups->eagerLoad([
    'Members' => function (DataList $members) {
        return $members->filter('FirstName', 'Bob');
    },
    // $products would initially be favourited products for only members named Bob in $groups, due to the filtering above
    'Members.FavouriteProducts' => function (DataList $products) {
        return $products->exclude('Price:GreaterThan', 100);
    }
]);

and in DataList we would just pass the list of fetched rows to the callback like this:

// where $fetchedRows is the unfiltered `DataList` holding all records in the relation for all parents.
if ($filterCallback)) {
    // in the above example, for the 'Members.FavouriteProducts' relation chain, this would be excluding anything with a price greater than 100
    $fetchedRows = $filterCallback($fetchedRows);
    if (!$fetchedRows) {
        // The developer probably just forgot a return statement in their callback
        throw new LogicException('suitable message here');
    }
}

Option two: Define filters/sorts/etc in an array, and apply them as appropriate

This option is very simple, but any functionality has to be given an explicit support - i.e. if we don't explicitly check for and apply functionality for "where" as an array key in the filters array, developers won't be able to use SQL where queries to pre-filter their results.
Because we have to explicitly support functionality, it makes it easier to test - but way less flexible.

They developer would use the API like this:

$groups = $groups->eagerLoad([
    // $products would initially be favourited products for *all* members in $groups
    'Members.FavouriteProducts' => [
        'exclude' => ['Price:GreaterThan' => 100],
        // other keys could include 'filter', 'filterAny', 'excludeAny', 'sort'... maybe also the SQL versions (where, orderBy).
    ],
]);

$groups = $groups->eagerLoad([
    'Members' => [
        'filter' => ['FirstName' => 'Bob'],
    ],
    // $products would initially be favourited products for only members named Bob in $groups, due to the filtering above
    'Members.FavouriteProducts' => [
        'exclude' => ['Price:GreaterThan' => 100],
    ],
]);

and in DataList we would just check for each supported thing in the filter array. e.g.

// where $fetchedRows is the unfiltered `DataList` holding all records in the relation for all parents.
if (array_key_exists('exclude', $filterArray)) {
    // in the above example, for the 'Members.FavouriteProducts' relation chain, this would be passing in `['Price:GreaterThan' => 100]`
    $fetchedRows = $fetchedRows->exclude($filterArray['exclude']);
}

Recommendation

Personally I think it's better to provide more flexibility for something like this, especially given eager loading is intended as an opt-in performance optimisation, so developers should have the freedom to squeeze every bit of performance they can out of it. In other words, I recommend option one, where developers provide a callback for filtering.

Acceptance Criteria

  • Eager loaded relations can be filtered and sorted as part of the fetching process
  • This new pre-filtering/pre-sorting API performs better than filtering/sorting the relations in PHP
  • The two broad options are discussed with at least 3 developers familiar with eager loading to validate which one makes the most sense.
  • Validate that there's no significant performance regression

Note

  • You'll need to time-test the full time from before executing the query to after any iterations through the results, since the db query time alone won't tell the full story

PRs

@emteknetnz
Copy link
Member

emteknetnz commented Oct 5, 2023

With both options I've noticed that you're passing an array instead of the documented methods of passing in strings ->eagerLoad('Relation01', 'Relation02.SubRelation')

Probably I think we could make this a little cleaner if we add a new method for this new functionality e.g.

public function eagerLoadRelation(string $relation, ?callable $callback = null): static { ... }

// ...

$myDataList->eagerLoadRelation('SomeRelation');

$myDataList->eagerLoadRelation('Members.FavouriteProducts', function (EagerLoadedDataList $products) {
    return $products->exclude('Price:GreaterThan', 100);
});

@GuySartorelli
Copy link
Member Author

GuySartorelli commented Oct 5, 2023

I think it would be confusing to have an eagerLoad() method and an eagerLoadRelation() method.
It's not uncommon for us to have a parameter (often it's in config, but some methods do it too) that accepts a string for the basic scenario, and an array when more configuration is required.

I guess we could do it and deprecate the existing method? Feels a little weird to deprecate so soon after implementing it but it wouldn't be the end of the world.
Not sure I like not being able to eager load multiple relations in a single method call though.

@emteknetnz
Copy link
Member

Yeah I'm OK with deprecating the existing (...$args) style method signature, you can never strong type it. We'll put that one down as a bad initial design decision.

@GuySartorelli
Copy link
Member Author

GuySartorelli commented Oct 6, 2023

...$args can be strongly typed, for what it's worth.

@maxime-rainville
Copy link
Contributor

We're not comfortable picking this up right away before getting more context about how people are using the new eager loading API.

If the lack of filtering on the eager loading API is an annoyance to you, please leave a comment with extra details.

@GuySartorelli
Copy link
Member Author

GuySartorelli commented Feb 19, 2024

Benchmarks in the (first) docs PR were done using this code:

<?php
// app/src/MyDataObject.php

use SilverStripe\ORM\DataObject;

class MyDataObject extends DataObject
{
    private static $table_name = 'MyDataObject';

    private static $db = [
        'Title' => 'Varchar'
    ];

    private static $has_many = [
        'MySubDataObjects' => MySubDataObject::class
    ];

    private static $many_many = [
        'MyManyDataObjects' => MyManyDataObject::class,
        'MyManyThroughDataObjects' => [
            'through' => MyThroughDataObject::class,
            'from' => 'MyDataObject',
            'to' => 'MyManyThroughDataObject',
        ],
    ];
}
<?php
// app/src/MyManyDataObject.php

use SilverStripe\ORM\DataObject;

class MyManyDataObject extends DataObject
{
    private static $table_name = 'MyManyDataObject';

    private static $db = [
        'Title' => 'Varchar'
    ];

    private static $belongs_many_many = [
        'MyDataObject' => MyDataObject::class
    ];
}
<?php
// app/src/MyManyThroughDataObject.php

use SilverStripe\ORM\DataObject;

class MyManyThroughDataObject extends DataObject
{
    private static $table_name = 'MyManyThroughDataObject';

    private static $db = [
        'Title' => 'Varchar'
    ];
}
<?php
// app/src/MyThroughDataObject.php

use SilverStripe\ORM\DataObject;

class MyThroughDataObject extends DataObject
{
    private static $table_name = 'MyThroughDataObject';

    private static $has_one = [
        'MyDataObject' => MyDataObject::class,
        'MyManyThroughDataObject' => MyManyThroughDataObject::class,
    ];
}
<?php
// app/src/MySubDataObject.php

use SilverStripe\ORM\DataObject;

class MySubDataObject extends DataObject
{
    private static $table_name = 'MySubDataObject';

    private static $db = [
        'Title' => 'Varchar'
    ];

    private static $has_one = [
        'MyDataObject' => MyDataObject::class
    ];
}
<?php
// app/_config.php

use SilverStripe\ORM\DataList;
use SilverStripe\ORM\DB;

// SET UP DB - only need to run this block once (after dev/build)
// DB::query('truncate MyDataObject');
// DB::query('truncate MySubDataObject');
// DB::query('truncate MyManyDataObject');
// DB::query('truncate MyDataObject_MyManyDataObjects');
// DB::query('truncate MyManyThroughDataObject');
// DB::query('truncate MyThroughDataObject');
// $b = [];
// $c = [];
// $d = [];
// $e = [];
// $f = [];
// for ($i = 1; $i <= 100; $i++) {
//     $a[] = "('Title $i')";
//     for ($j = 1; $j <= 100; $j++) {
//         $jj = ($i - 1) * 100 + $j;
//         $b[] = "('$i', 'Title $j')";
//         $c[] = "('Title $j')";
//         $d[] = "('$i', '$jj')";
//         $e[] = "('Title $j')";
//         $f[] = "('$i', '$jj')";
//     }
// }
// $as = implode(',', $a);
// $bs = implode(',', $b);
// $cs = implode(',', $c);
// $ds = implode(',', $d);
// $es = implode(',', $e);
// $fs = implode(',', $f);
// DB::query("insert into MyDataObject (Title) values $as");
// DB::query("insert into MySubDataObject (MyDataObjectID, Title) values $bs");
// DB::query("insert into MyManyDataObject (Title) values $cs");
// DB::query("insert into MyDataObject_MyManyDataObjects (MyDataObjectID, MyManyDataObjectID) values $ds");
// DB::query("insert into MyManyThroughDataObject (Title) values $es");
// DB::query("insert into MyThroughDataObject (MyDataObjectID, MyManyThroughDataObjectID) values $fs");

// Percentages calculated using https://www.calculatorsoup.com/calculators/algebra/percent-change-calculator.php
$s = microtime(true);
// NO EAGER LOADING
// 0.1567, 0.1443, 0.1488, 0.1590 -- 0.1522s (avg)
// foreach (MyDataObject::get() as $do) {
//     $do->MySubDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }
// 0.1560, 0.1577, 0.1626, 0.1600 -- 0.1591s (avg)
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }
// 0.3083, 0.3195, 0.3235, 0.3131 -- 0.3161s (avg)
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyThroughDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }

// WITH EAGERLOADING, FILTER/SORT AFTER THE FACT
// 0.7555, 0.6854, 0.7460, 0.7561 -- 0.7358s (avg)
// foreach (MyDataObject::get()->eagerLoad('MySubDataObjects') as $do) {
//     $do->MySubDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }
// 0.8834, 0.8377, 0.9296, 0.9500 -- 0.9002s (avg)
// foreach (MyDataObject::get()->eagerLoad('MyManyDataObjects') as $do) {
//     $do->MyManyDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }
// 1.0948, 1.0576, 1.0576, 1.0775 -- 1.0719s (avg)
// foreach (MyDataObject::get()->eagerLoad('MyManyThroughDataObjects') as $do) {
//     $do->MyManyThroughDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }

// WITH EAGERLOADING, FILTER/SORT IN DB
// 0.1115, 0.1077, 0.1019, 0.1110 -- 0.1080s (avg) (~29% faster than without eagerloading and ~85% faster than filtering in PHP)
// foreach (MyDataObject::get()->eagerLoad(['MySubDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')]) as $do) {
//     $do->MySubDataObjects()->toArray();
// }
// 0.1189, 0.1312, 0.1251, 0.1305 -- 0.1264s (avg) (~21% faster than without eagerloading and ~86% faster than filtering in PHP)
// foreach (MyDataObject::get()->eagerLoad(['MyManyDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')]) as $do) {
//     $do->MyManyDataObjects()->toArray();
// }
// 0.2425, 0.2572, 0.2496, 0.2551 -- 0.2511s (avg) (~21% faster than without eagerloading and ~77% faster than filtering in PHP)
// foreach (MyDataObject::get()->eagerLoad(['MyManyThroughDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')]) as $do) {
//     $do->MyManyThroughDataObjects()->toArray();
// }
$t = microtime(true) - $s;
printf('%0.4f', $t);echo"\n";

@GuySartorelli GuySartorelli removed their assignment Feb 19, 2024
@emteknetnz
Copy link
Member

I've expanded on the testing and split apart the benchmarks for filtering and sorting. I've just done a single run for a quick test

tl;dr - it makes a big difference for filtering, though no difference for sorting

// FILTERING

// 0.0877
// foreach (MyDataObject::get() as $do) {
//     $do->MySubDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }
// 0.1118 
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }
// 0.1715
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyThroughDataObjects()->filter('Title:PartialMatch', 1)->toArray();
//}

// 0.3942 
// foreach (MyDataObject::get()->eagerLoad('MySubDataObjects') as $do) {
//     $do->MySubDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }
// 0.1016 
// foreach (MyDataObject::get()->eagerLoad('MySubDataObjects') as $do) {
//     $do->MyManyDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }
// 0.4799 
// foreach (MyDataObject::get()->eagerLoad('MyManyThroughDataObjects') as $do) {
//     $do->MyManyThroughDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }

// 0.0474
// foreach (MyDataObject::get()->eagerLoad(['MySubDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)]) as $do) {
//     $do->MySubDataObjects()->toArray();
// }
// 0.0585 
// foreach (MyDataObject::get()->eagerLoad(['MyManyDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)]) as $do) {
//     $do->MyManyDataObjects()->toArray();
// }
// 0.1332 
// foreach (MyDataObject::get()->eagerLoad(['MyManyThroughDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)]) as $do) {
//     $do->MyManyThroughDataObjects()->toArray();
// }

// SORTING

// 0.1966 
// foreach (MyDataObject::get() as $do) {
//     $do->MySubDataObjects()->sort('Title', 'DESC')->toArray();
// }
// 0.2706 
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyDataObjects()->sort('Title', 'DESC')->toArray();
// }
// 0.5855 
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyThroughDataObjects()->sort('Title', 'DESC')->toArray();
// }

// 0.1586 
// foreach (MyDataObject::get()->eagerLoad('MySubDataObjects') as $do) {
//     $do->MySubDataObjects()->sort('Title', 'DESC')->toArray();
// }
// 0.2561 
// foreach (MyDataObject::get()->eagerLoad('MyManyDataObjects') as $do) {
//     $do->MyManyDataObjects()->sort('Title', 'DESC')->toArray();
// }
// 0.5739 
// foreach (MyDataObject::get()->eagerLoad('MyManyThroughDataObjects') as $do) {
//     $do->MyManyThroughDataObjects()->sort('Title', 'DESC')->toArray();
// }

// 0.1602 
// foreach (MyDataObject::get()->eagerLoad(['MySubDataObjects' => fn (DataList $list) => $list->sort('Title', 'DESC')]) as $do) {
//     $do->MySubDataObjects()->toArray();
// }
// 0.2581 
// foreach (MyDataObject::get()->eagerLoad(['MyManyDataObjects' => fn (DataList $list) => $list->sort('Title', 'DESC')]) as $do) {
//     $do->MyManyDataObjects()->toArray();
// }
// 0.5705 
// foreach (MyDataObject::get()->eagerLoad(['MyManyThroughDataObjects' => fn (DataList $list) => $list->sort('Title', 'DESC')]) as $do) {
//     $do->MyManyThroughDataObjects()->toArray();
// }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants