Update report.

guiyunbao · Aug 11, 2023 · 3815aa5 · 3815aa5
1 parent 333749d
commit 3815aa5
Show file tree

Hide file tree

Showing 2 changed files with 86 additions and 26 deletions.
diff --git a/README.md b/README.md
@@ -1,13 +1,16 @@
 # Benchmark geospatial query performance on NoSQL(-like) databases
 
+The full paper can be found in [Releases](https://github.com/guiyunbao/geospatial-benchmark/releases). Source paper can be found at [report.typ](./report.typ). You may want to compile it with [typst](https://github.com/typst/typst).
+
 ## Test Environment
 
 - Databases
-  - [MongoDB](https://www.mongodb.com/) Community Edition@latest
-  - [Redis](https://redis.io/) Stack@latest
+  - [MongoDB](https://www.mongodb.com/) Enterprise Server@latest
+  - [Redis](https://redis.io/) (Stack)@latest
 - Programming Language and lib
   - [TypeScript](https://www.typescriptlang.org/)
     - [Mongoose](https://mongoosejs.com/)
+    - [node-redis](https://redis.js.org/)
     - [Redis OM](https://github.com/redis/redis-om-node)
 - Host
   - [GitHub Codespaces](https://docs.github.com/en/codespaces)
@@ -73,4 +76,4 @@ Pick a random location from the dataset, find all locations within certain dista
 
 This project is licensed under the terms of the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.
 
-Citation info can be found at [CITATION.cff](./CITATION.cff)
+To cite this report, check [CITATION.cff](./CITATION.cff).
diff --git a/report.typ b/report.typ
@@ -90,26 +90,6 @@ Generally, the benchmark should be as close to the actual use case as possible.
 
 There are some challenges while benchmarking. These challenges expose the real development issues that we may encounter in the future. We will talk about them later.
 
-== Limitation
-
-Due to complexity, serveral limitaions are found during the benchmark.
-
-=== Coordinate System
-
-Redis uses EPSG:3857 #footnote[https://epsg.io/3857], as known as _Pseudo-Mercator_ or _Web Mercator_, as the coordinate system @RedisGeoadd.
-This coordinate system is not including the area near the poles. So trying to insert a point near the poles will fail.
-In this benchmark, *while inserting points into Redis, if a point is near the poles, we will move it to a valid EPSG:3857 latitude*.
-
-It is worth mentioning that this system may not reflect actual ground distance as its projection is not perfect for Earth. *Though some reports say it has a 0.7% error rate @ErrorInProjection, we will ignore the error as it is far beyond our purpose.*
-
-MongoDB uses EPSG:4326 #footnote[https://epsg.io/4326], as known as _WGS84_ which is used in _GPS_. It's not affected by those previously mentioned issues.
-
-=== Query
-
-In Redis, querying around a geo point requires radius as one of the parameters. If the radius is too large, the query will be slow. *For simulating actual use cases, we will use 100 kilometers as the maximum radius while querying in Redis.*
-
-*All queries will use a valid EPSG:3857 coordinate.*
-
 == Test Data
 
 === iNaturalist 2017
@@ -121,7 +101,7 @@ We found a dataset with the geolocation that we needed for the benchmark called
 
 === Generated Data
 
-To simulate some edge cases, we generated 3 different type of datasets.
+To simulate some edge cases, we generated 3 different types of datasets.
 
 ==== Random
 
@@ -133,7 +113,7 @@ Generated points with Fibonacci Sphere Algorithm @FibonacciSphereAlgo. It is a m
 
 ==== Cluster
 
-Every 50 points will be consider as a cluster and will be placed around a random center on earch with short radius. All points will be generated in the range of EPSG:4326.
+Every 50 points will be considered as a cluster and will be placed around a random center on earch with short radius. All points will be generated in the range of EPSG:4326.
 
 == Test Environment
 
@@ -148,11 +128,88 @@ GitHub Actions #footnote[https://github.com/features/actions] is used for benchm
 - Redis \@ latest #footnote[https://hub.docker.com/_/redis]
   - Stack Server \@ latest #footnote[https://hub.docker.com/r/redis/redis-stack-server]
 
+=== Tools
+
+Source code used in the benchmark can be found in guiyunbao/geospatial-benchmark #footnote[https://github.com/guiyunbao/geospatial-benchmark] repository.
+
+=== Test cases
+
+==== Case A
+
+Pick a random point on earth, find the closest location in the dataset.
+
+This case is the most common use case for geospatial query. It is used to find the closest location to the user.
+
+==== Case B
+
+Pick a random location from the dataset, find all locations within certain distance.
+
+This case is used to find all locations within a certain distance from the user. Also measure the bandwidth of the database as the result may be large.
+
+==== Case C
+
+Pick a random location from the dataset, find all locations within certain distance, order by distance.
+
+This case is used to measure the performance of sorting by distance.
+
+== Limitation
+
+Due to complexity, several limitations are found during the benchmark. All patch that we made to ensure the test suite run correctly will be listed below.
+
+=== Coordinate System
+
+Redis uses EPSG:3857 #footnote[https://epsg.io/3857], as known as _Pseudo-Mercator_ or _Web Mercator_, as the coordinate system @RedisGeoadd.
+This coordinate system is not including the area near the poles. So trying to insert a point near the poles will fail.
+In this benchmark, *while inserting points into Redis, if a point is near the poles, we will move it to a valid EPSG:3857 latitude*.
+
+It is worth mentioning that this system may not reflect actual ground distance as its projection is not perfect for Earth. *Though some reports say it has a 0.7% error rate @ErrorInProjection, we will ignore the error as it is far beyond our purpose.*
+
+MongoDB uses EPSG:4326 #footnote[https://epsg.io/4326], as known as _WGS84_ which is used in _GPS_. It's not affected by the issue.
+
+=== Query
+
+In Redis, querying around a geo point requires radius as one of the parameters. If the radius is too large, the query will be slow. *For simulating actual use cases, we will use 100 kilometers as the maximum radius while querying in Redis.*
+
+*All queries will use a valid EPSG:3857 coordinate.*
+
+In Redis Stack, there is no way to evaluate distance between two points. And it is not possible to order them during query. So *Redis Stack will calulate the distance and order them in the application layer in Test Case C.*
+
 == Test Result
 
-= Analyze
+Due to limited time, we do not make a chart or numberic analysis here. But all raw data can be found in GitHub Action logs #footnote[https://github.com/guiyunbao/geospatial-benchmark/actions/workflows/dispatch.yml]. You can also fork the repository and run the benchmark by yourself.
+
+*Redis is the fastest in all cases*, followed by MongoDB and Redis Stack. *Most of the time, MongoDB is better than Redis Stack.*
+
+=== Analyze by Test Cases
+
+In most test case, Redis is dominating the benchmark. It is at least 5x faster than MongoDB, 10x faster then Redis Stack.
+And it seems that sorting by distance did not impact performance in all databases. Even though Redis Stack is sorting them at the application layer.
+
+=== Analyze by Datasets
+
+Manually generated datasets are used to simulate some edge cases. They exposed that Redis family can not handle area around poles properly.
+
+==== iNaturalist 2017 and Cluster
+
+Both datasets simulate the real world use cases. PoI may be clustered in some area. And the dataset may be large. All databases are suffering a bandwidth issue in Test Case B and C, ops/sec is significantly decreasing.
+
+==== Random
+
+Randomly distributed points are not a common case. But still worth to test. Databases can not take advantage by grouping or other optimization. So the performance is not as good as other datasets. At a large scale, this is actually like testing on datasets with cities on earth as a point. Do not see any difference on all databases.
+
+==== Grid
+
+Grid is a special case. It is evenly distributed on the surface of a sphere. All Test Case would return a much smaller results which could reflect the actual querying speed. At this point, Redis Stack sometimes faster than MongoDB.
 
 = Conclusion
 
+Raw Redis is the fastest among all databases. With geospatial as index, it runs almost 10x faster than MongoDB, and 20x faster than Redis Stack. But it is not suitable for complex queries. As raw Redis could not set a secondary or more indexes. Neither querying with the value it stored. To achieve same experience like other relational database. Developer have to maintain the foreign key in the application layer which could be expensive and complex. It may also cause data inconsistency and over-use of memory.
+
+MongoDB is somehow a balance to the performance and the complexity. It is not as fast as raw Redis, but it is still fast enough for most use cases. Its design is more like a relational database. MongoDB supports secondary indexes and querying with the value it stored. It also supports complex queries like aggregation. There is a In-Memory Storage Engine #footnote[https://www.mongodb.com/docs/manual/core/inmemory/#in-memory-storage-engine] in Enterprise Server but it is not enable by default. It also requires a setup of a replica set to persist the data. That is why it is not cover in this benchmark.
+
+Redis Stack could be a viable choice in the future, but not for now. It is still at a early prototype stage in our opinion. Though it extends the core features of Redis, it becomes not as fast as raw Redis. It tries to implements a document-based database, but its ecosystem is frustrating. At least for Redis OM for JavaScript. The OM can not provide a proper type annotation when querying with the value it stored. Which is implements well in Mongoose.
+
+
+
 #pagebreak(weak: true)
 #bibliography("cites.yml")