Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some performance quick wins for the geopandas implementation #53

Conversation

theroggy
Copy link
Contributor

@theroggy theroggy commented Nov 3, 2023

I encountered a link to your blog post with some performance comparisons between file formats. Because the performance differences there were not quite what I expected I got curious and had a look at the code.

This PR should give a boost to the performance of .fgb, .gpkg and .shp. I disabled creation of the spatial index on .gpkg because all other formats also don't have a spatial index and creating the spatial index takes quite some time. If you want to do serious spatial analyses using sql on the .gpkg file the spatial index can obviously be a huge advantage, but I don't think this is the case.

@cholmes
Copy link
Collaborator

cholmes commented Nov 6, 2023

Awesome! Yeah, after I published the post Kyle Barron pointed out that pyogrio would make things faster. It was on the long list of things to check out, so I really appreciate this PR.

And removing the spatial index does make sense too - I also thought about that after the post. Ideally there'd be an option to create it or not. Since then I've also realized that adding a quadkey column to GeoParquet (like in this code) serves as a decently effective spatial index. So we could use that for more of an apples to apples comparison for when the spatial index is 'on'.

@cholmes
Copy link
Collaborator

cholmes commented Nov 6, 2023

Oh, and feel free to write a blog post with new numbers, I'd definitely promote it and link to it from my original post. I'd write it myself but my side project queue is vast these days so I doubt I'll get to it any time soon.

The other thing I really want is to make a new project that compares both read and write for any format, and isn't just limited to this google building processing. Like just simple conversions, but make it easy to report out. This was just a side effort as I was working with a couple datasets, but I think it'd be awesome as its own project. I'd be happy to pitch in if you start on that, and to figure out a good home (perhaps in https://github.com/geopython).

If you're interested feel free to contact me on slack, I'm on the cloud native geo slack, can use this invite link

@cholmes cholmes merged commit f1703b2 into opengeos:main Nov 6, 2023
7 checks passed
@theroggy theroggy deleted the Some-performance-quick-wins-for-the-geopandas-implementation branch November 6, 2023 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants