Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track and rebuild sitemaps for deleted posts #135

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

kraftbj
Copy link
Contributor

@kraftbj kraftbj commented Apr 23, 2018

Fixes #112

When a post is deleted, the post ID and date is added to an option which is used to augment the existing process of seeking sitemaps to regenerate based on a post's last modified time.

In this instance, we are opting for keeping the post_modified process since it requires less processing as posts are moved—no added filters or functions during that action.

Additionally, this PR adds a specified start time in a couple of places to remove the possibility that a post is modified in the short time between when processing started and when processing ended on rebuilding for modified posts.

@kraftbj
Copy link
Contributor Author

kraftbj commented Apr 23, 2018

Note: This needs unit tests, but I'm afk this week so I wanted to give the team a chance to review this without it being blocked until I wrote tests.

Copy link

@sboisvert sboisvert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments

$sitemap_last_run = get_option( 'msm_sitemap_update_last_run', false );

$date = date( 'Y-m-d H:i:s', ( current_time( 'timestamp', 1 ) - 3600 ) ); // posts changed within the last hour
$date = date( 'Y-m-d H:i:s', ( $start_time - 3600 ) ); // posts changed within the last hour

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a comment to why we're adding the start time. Either inline or with the doc block, whatever makes most sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 414ce0e


if ( $sitemap_last_run ) {
$date = date( 'Y-m-d H:i:s', $sitemap_last_run );
}

$post_types_in = self::get_supported_post_types_in();

$modified_posts = $wpdb->get_results( $wpdb->prepare( "SELECT ID, post_date FROM $wpdb->posts WHERE post_type IN ( {$post_types_in} ) AND post_modified_gmt >= %s LIMIT 1000", $date ) );
$modified_posts = $wpdb->get_results( $wpdb->prepare( "SELECT ID, post_date FROM $wpdb->posts WHERE post_type IN ( {$post_types_in} ) AND post_modified_gmt >= %s LIMIT 1000", $date ), ARRAY_A );

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would changing this to ARRAY_A cause problems with

$dates[] = date( 'Y-m-d', strtotime( $post->post_date ) );
where, from what I understand, it's assumed that $post->post_date will work?

@@ -599,7 +626,7 @@ public static function get_last_modified_posts() {
public static function get_post_dates( $posts ) {
$dates = array();
foreach ( $posts as $post ) {
$dates[] = date( 'Y-m-d', strtotime( $post->post_date ) );
$dates[] = date( 'Y-m-d', strtotime( $post['post_date'] ) );

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol. I see that's fixed here. I'm still unclear as to why we're changing the behaviour :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When storing the deleted posts for future processing ( https://github.com/Automattic/msm-sitemap/pull/135/files#diff-67d63994628dd99d09072eee489c7487R585 ), we're using an array for ease of processing. It all in all matches the logic already used, but switching the modified post query to output an array means we're able to match the data types.

Since the posts are deleted, we can't pull a WP_Post object without mocking up one, but we can change the query return to use arrays. In effect, I'm looking and using the least common denominator.

Doing this to me seems like a more direct route of combining stored deleted post identifiers and a query return. While we could also just store all modified posts, that introduces more possibilities of race conditions and creates a second, less reliable source of truth on what posts have been modified.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, perhaps we should mention it in the commit message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'll be sure to include that in the merge commit.

@mattoperry
Copy link

Just thinking out loud about some edge cases here @kraftbj ... what happens if there are a very large number of posts deleted? Like perhaps a site is trimming its archive or something and deletes 10s of 1000s of posts? I'm asking in the sense of "would anything bad happen" but also in the sense of would the plugin work under those conditions and regenerate the right sitemaps? Otherwise this looks like a good approach to me.

@kraftbj kraftbj force-pushed the add/delete-handling branch 2 times, most recently from 515e321 to 8df6975 Compare May 3, 2018 02:28
When a post is deleted, the post ID and date is added to an option which is used to augment the existing process of seeking sitemaps to regenerate based on a post's last modified time.

In this instance, we are opting for keeping the post_modified process since it requires less processing as posts are moved—no added filters or functions during that action.
Avoids potential situations where post actions occur exactly during processing being left out of future operations.
Previously, there was an array item for each post deleted. Since the purpose of tracking deleted posts is to know which daily sitemaps to regenerate, we only need to keep one item per day. On sites with a large number of posts, the previous behavior could result in an incredibly large option being saved with excessive number of db writes.
@kraftbj
Copy link
Contributor Author

kraftbj commented May 3, 2018

@mattoperry That's a good point. I reworked it a little in c3fccc1 . Previously if we had 10,000 posts deleted covering 30 days, we would have a 10k item array to inform us of 30 days we needed to regenerate. The revised approach only keeps one array item per date, keeping the option down to 30 under this scenario.

The solution we have right now is 5.5+. What is VIP's policy on doing something like a Composer dependency or just including a library? https://github.com/ramsey/array_column provides a backfill and is what WP-CLI uses to polyfill array_columns.

@kraftbj
Copy link
Contributor Author

kraftbj commented May 3, 2018

Opted to not polyfill and have a suboptimal experience pre-5.5. MSM should be declared "5.5+ recommended", but will still function on 5.2+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants