Add blogpost on flox heuristics #695

dcherian · 2024-07-29T22:18:38Z

https://xarray-dev-git-flox-smart-xarray.vercel.app/blog/flox-smart

vercel · 2024-07-29T22:18:40Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
xarray-dev	✅ Ready (Inspect)	Visit Preview	✅ 7 resolved	Aug 5, 2024 7:30pm

src/posts/flox-smart/index.md

Co-authored-by: Anderson Banihirwe <[email protected]>

dcherian · 2024-07-29T22:36:18Z

src/posts/flox-smart/index.md

@@ -0,0 +1,157 @@
+---
+title: 'flox: GroupBy, now with smarts!'
+date: '2024-05-31'


Suggested change

date: '2024-05-31'

date: '2024-08-05'

or whenever we decide to merge.

src/posts/flox-smart/index.md

dcherian · 2024-07-30T05:19:26Z

src/posts/flox-smart/index.md

+
+Unlike [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index), _containment_ [isn't skewed](http://ekzhu.com/datasketch/lshensemble.html) when one of the sets is much larger than the other.
+
+The steps are as follows:


Is this bit clear and useful? Should I take it out? Or expand it?

for more information, see https://pre-commit.ci

TomNicholas

Nice! Really incredible work you have done here @dcherian

I left a few comments: some nits, some minor suggestions on structure/content.

TomNicholas · 2024-08-06T22:11:11Z

src/posts/flox-smart/index.md

+
+## Avoiding catastrophe
+
+Thus `flox` quickly grew two new modes of computing the groupby reduction.


I would move this sentence down to after you have gone through your two realizations. Otherwise it's confusing when you jump into

First, method="blockwise"

TomNicholas · 2024-08-06T22:12:06Z

src/posts/flox-smart/index.md

+Notice how much cleaner the graph is in this image:
+![map-reduce](https://flox.readthedocs.io/en/latest/_images/new-map-reduce-reindex-True-annotated.svg)
+See our [previous blog post](https://xarray.dev/blog/flox) or the [docs](https://flox.readthedocs.io/en/latest/implementation.html#method-map-reduce) for more.


It's a little confusing that these two images are not for identical problems - the inputs are different.

TomNicholas · 2024-08-06T22:16:54Z

src/posts/flox-smart/index.md

+```python
+mean_mapreduce = ds.groupby("time.month").mean(method="map-reduce")
+mean_cohorts = ds.groupby("time.month").mean() # this is auto-detected!
+```


Suggested change

```python

mean_mapreduce = ds.groupby("time.month").mean(method="map-reduce")

mean_cohorts = ds.groupby("time.month").mean() # this is auto-detected!

```

```python

mean_mapreduce = ds.groupby("time.month").mean(method="map-reduce") # mapreduce is a suboptimal manual choice here

mean_cohorts = ds.groupby("time.month").mean() # cohorts is a better choice - auto-detected!

TomNicholas · 2024-08-06T22:17:32Z

src/posts/flox-smart/index.md

+Using the algorithm described below, flox will **automatically** set
+`method="cohorts"` for this dataset unless specified, yielding a 5X decrease in
+memory used and a 2X increase in runtime. Read on to figure out how!
+Read on to figure out how!


Suggested change

Read on to figure out how!

(duplication)

TomNicholas · 2024-08-06T22:18:02Z

src/posts/flox-smart/index.md

+
+_Note that the improvements here are strongly dependent on the details
+of the grouping variable, the chunksize, and even dask's scheduler. In fact, writing this post
+prompted the discovery of a bug in dask's scheduler that should substantially improve the "map-reduce"


TomNicholas · 2024-08-06T22:26:59Z

src/posts/flox-smart/index.md

+
+The improvements described here are strongly dependent on the details
+of the grouping variable, the chunksize, and even dask's scheduler.
+In fact, writing this post prompted the discovery of a bug in dask's scheduler


This is repeated from above

TomNicholas · 2024-08-06T22:28:26Z

src/posts/flox-smart/index.md

+
+Importantly this inference is fast — [~250ms for the US county](https://flox.readthedocs.io/en/latest/implementation.html#example-spatial-grouping) GroupBy problem in our [previous post](https://xarray.dev/blog/flox) where approximately 3000 groups are distributed over 2500 chunks; and ~1.25s for grouping by US watersheds ~87000 groups across 640 chunks.
+
+## What's next?


Your conclusion is a bit rambly, and I think some of the content (especially call-outs to any other groupby nerds who can help) could be in a footnote.

I think you should end with a succinct one-line conclusion that emphasises that most of this complexity can now be safely ignored by users.

TomNicholas · 2024-08-06T22:30:33Z

src/posts/flox-smart/index.md

+Using the algorithm described below, flox will **automatically** set
+`method="cohorts"` for this dataset unless specified, yielding a 5X decrease in
+memory used and a 2X increase in runtime. Read on to figure out how!
+Read on to figure out how!


A significant number of readers will think it's cool that this is now automatic, but not really care about the specifics of how you did it. That section below is also very dense. I think if there is any other information that you want readers to take away from this point onwards (e.g. that Groupers exist) you should telegraph it, and explicitly give them the option to skip forwards to a different subheading guilt-free.

dcherian · 2024-08-06T22:36:36Z

Thanks for the review @TomNicholas . I'm waiting for some tests on the new dask release before revising: https://docs.dask.org/en/stable/changelog.html#improve-scheduling-efficiency-for-xarray-groupby-reduce-patterns

Add blogpost on flox heuristics

1b261ab

vercel bot had a problem deploying to Preview July 29, 2024 22:18 Failure

Comment out images for now

e04df47

vercel bot had a problem deploying to Preview July 29, 2024 22:20 Failure

andersy005 reviewed Jul 29, 2024

View reviewed changes

src/posts/flox-smart/index.md Outdated Show resolved Hide resolved

dcherian and others added 3 commits July 29, 2024 16:35

Update src/posts/flox-smart/index.md

503a36b

Co-authored-by: Anderson Banihirwe <[email protected]>

edits

093a75e

Merge branch 'main' into flox-smart

d7eb21d

dcherian commented Jul 29, 2024

View reviewed changes

vercel bot deployed to Preview July 29, 2024 22:38 View deployment

more edits

44a8f02

dcherian force-pushed the flox-smart branch from 0c6e212 to 44a8f02 Compare July 29, 2024 22:44

vercel bot deployed to Preview July 29, 2024 22:46 View deployment

vercel bot deployed to Preview July 30, 2024 02:39 View deployment

more edits

105c22d

dcherian force-pushed the flox-smart branch from eb64606 to 105c22d Compare July 30, 2024 02:40

vercel bot deployed to Preview July 30, 2024 02:41 View deployment

dcherian commented Jul 30, 2024

View reviewed changes

src/posts/flox-smart/index.md Outdated Show resolved Hide resolved

dcherian commented Jul 30, 2024

View reviewed changes

src/posts/flox-smart/index.md Show resolved Hide resolved

dcherian commented Jul 30, 2024

View reviewed changes

src/posts/flox-smart/index.md Show resolved Hide resolved

dcherian commented Jul 30, 2024

View reviewed changes

more edits

010f01d

vercel bot deployed to Preview July 30, 2024 22:52 View deployment

more edits

1136f29

vercel bot deployed to Preview July 31, 2024 04:01 View deployment

more edits

1383ad8

vercel bot deployed to Preview July 31, 2024 04:02 View deployment

edit

5a2bb65

vercel bot deployed to Preview July 31, 2024 04:04 View deployment

edits

1cc7eb5

vercel bot deployed to Preview July 31, 2024 04:14 View deployment

vercel bot deployed to Preview July 31, 2024 04:17 View deployment

more edits

53e1c3a

dcherian force-pushed the flox-smart branch from 3507c99 to 53e1c3a Compare July 31, 2024 04:18

vercel bot deployed to Preview July 31, 2024 04:20 View deployment

tweak

ff8ccff

vercel bot deployed to Preview July 31, 2024 04:23 View deployment

dcherian and others added 3 commits August 1, 2024 09:57

Update

1c19f7f

Add demo.

5da9295

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad8caea

for more information, see https://pre-commit.ci

vercel bot deployed to Preview August 3, 2024 03:24 View deployment

Fix

2203a34

vercel bot deployed to Preview August 3, 2024 03:29 View deployment

Add assets

ebbd007

vercel bot deployed to Preview August 3, 2024 03:31 View deployment

fix

9448fb2

vercel bot deployed to Preview August 3, 2024 03:34 View deployment

edit

ab715b9

vercel bot deployed to Preview August 3, 2024 03:38 View deployment

dcherian force-pushed the flox-smart branch from 433afa6 to ab715b9 Compare August 3, 2024 03:38

vercel bot deployed to Preview August 3, 2024 03:39 View deployment

Updates

dc559a7

vercel bot deployed to Preview August 5, 2024 19:30 View deployment

TomNicholas reviewed Aug 6, 2024

View reviewed changes

TomNicholas added the blog Blog post label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blogpost on flox heuristics #695

Add blogpost on flox heuristics #695

dcherian commented Jul 29, 2024 •

edited

Loading

vercel bot commented Jul 29, 2024 •

edited

Loading

dcherian Jul 29, 2024 •

edited

Loading

dcherian Jul 29, 2024

dcherian Jul 30, 2024

TomNicholas left a comment

TomNicholas Aug 6, 2024

TomNicholas Aug 6, 2024

TomNicholas Aug 6, 2024

TomNicholas Aug 6, 2024

TomNicholas Aug 6, 2024

TomNicholas Aug 6, 2024

TomNicholas Aug 6, 2024

TomNicholas Aug 6, 2024

dcherian commented Aug 6, 2024


		Unlike [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index), _containment_ [isn't skewed](http://ekzhu.com/datasketch/lshensemble.html) when one of the sets is much larger than the other.

		The steps are as follows:


		## Avoiding catastrophe

		Thus `flox` quickly grew two new modes of computing the groupby reduction.


		Importantly this inference is fast — [~250ms for the US county](https://flox.readthedocs.io/en/latest/implementation.html#example-spatial-grouping) GroupBy problem in our [previous post](https://xarray.dev/blog/flox) where approximately 3000 groups are distributed over 2500 chunks; and ~1.25s for grouping by US watersheds ~87000 groups across 640 chunks.

		## What's next?

Add blogpost on flox heuristics #695

Are you sure you want to change the base?

Add blogpost on flox heuristics #695

Conversation

dcherian commented Jul 29, 2024 • edited Loading

vercel bot commented Jul 29, 2024 • edited Loading

dcherian Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian commented Aug 6, 2024

dcherian commented Jul 29, 2024 •

edited

Loading

vercel bot commented Jul 29, 2024 •

edited

Loading

dcherian Jul 29, 2024 •

edited

Loading