Bringing too much data back to the driver (collect and friends)
+
A common anti-pattern in Apache Spark is using collect() and then processing records on the driver. There are a few different reasons why folks tend to do this and we can work through some alternatives:
+
+
Label items in ascending order
+
ZipWithIndex
+
+
+
Index items in order
+
Compute the size of each partition use this to assign indexes.
+
+
+
In order processing
+
Compute a partition at a time (this is annoying to do, sorry).
+
+
+
Writing out to a format not supported by Spark
+
Use foreachPartition or implement your own DataSink.
+
+
+
Need to aggregate everything into a single record
+
Call reduce or treeReduce
+
+
+
+
Sometimes you do really need to bring the data back to the driver for some reason (e.g., updating model weights). In those cases, especially if you process the data sequentially, you can limit the amount of data coming back to the driver at one time. toLocalIterator gives you back an iterator which will only need to fetch a partition at a time (although in Python this may be pipeline for efficency). By default toLocalIterator will launch a Spark job for each partition, so if you know you will eventually need all of the data it makes sense to do a persist + a count (async or otherwise) so you don't block as long between partitions.
+
This doesn't mean every call to collect() is bad, if the amount of data being returned is under ~1gb it's probably OK although it will limit parallelism.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Beware that broadcast joins put unnecessary pressure on the driver. Before the tables are broadcasted to all the executors, the data is brought back to the driver and then broadcasted to executors. So you might run into driver OOMs.
+
Broadcast smaller tables but this is usually recommended for < 10 Mb tables. Although that is mostly the default, we can comfortably broadcast much larger datasets as long as they fit in the executor and driver memories. Remember if there are multiple broadcast joins in the same stage, you need to have enough room for all those datasets in memory.
+You can configure the broadcast threshold usingspark.sql.autoBroadcastJoinThreshold or increase the driver memory by setting spark.driver.memory to a higher value
+
Make sure that you need more memory on your driver than the sum of all your broadcasted data in any stage plus all the other overheads that the driver deals with!
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Tables getting broadcasted even when broadcast is disabled
+
You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Spark tries to broadcast the bigger table and fails with a broadcast error. And you observe that the query plan has BroadcastNestedLoopJoin in the physical plan.
+
+
Check for sub queries in your code using NOT IN
+
+
Example :
+
select * from TableA where id not in (select id from TableB)
+
+
This typically results in a forced BroadcastNestedLoopJoin even when the broadcast setting is disabled.
+If the data being processed is large enough, this results in broadcast errors when Spark attempts to broadcast the table
+
+
Rewrite query using not exists or a regular LEFT JOIN instead of not in
+
+
Example:
+
select * from TableA where not exists (select 1 from TableB where TableA.id = TableB.id)
+
+
The query will use SortMergeJoin and will resolve any Driver memory errors because of forced broadcasts
When your compile-time class path differs from the runtime class path, you may encounter errors that signal that a class or method could not be found (e.g., NoClassDefFoundError, NoSuchMethodError).
+
java.lang.NoSuchMethodError: com.fasterxml.jackson.dataformat.avro.AvroTypeResolverBuilder.subTypeValidator(Lcom/fasterxml/jackson/databind/cfg/MapperConfig;)Lcom/fasterxml/jackson/databind/jsontype/PolymorphicTypeValidator;
+ at com.fasterxml.jackson.dataformat.avro.AvroTypeResolverBuilder.buildTypeDeserializer(AvroTypeResolverBuilder.java:43)
+ at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findTypeDeserializer(BasicDeserializerFactory.java:1598)
+ at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findPropertyContentTypeDeserializer(BasicDeserializerFactory.java:1766)
+ at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.resolveMemberAndTypeAnnotations(BasicDeserializerFactory.java:2092)
+ at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.constructCreatorProperty(BasicDeserializerFactory.java:1069)
+ at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addExplicitPropertyCreator(BasicDeserializerFactory.java:703)
+ at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:476)
+ ...
+
This may be due to packaging a fat JAR with dependency versions that are in conflict with those provided by the Spark environment. When there are multiple versions of the same library in the runtime class path under the same package, Java's class loader hierarchy kicks in, which can lead to unintended behaviors.
+
There are a few options to get around this.
+
+
Identify the version of the problematic library within your Spark environment and pin the dependency to that version in your build file. To identify the version used in your Spark environment, in the Spark UI go to the Environment tab, scroll down to Classpath Entries, and find the corresponding library.
+
Exclude the transient dependency of the problematic library from imported libraries in your build file.
+
Shade the problematic library under a different package.
+
+
If options (1) and (2) result in more dependency conflicts, it may be that the version of the problematic library in the Spark environment is incompatible with your application code. Therefore, it makes sense to shade the problematic library so that your application can run with a version of the library isolated from the rest of the Spark environment.
+
If you are using the shadow plugin in Gradle, you can shade using:
+
Container OOMs can be difficult to debug as the container running the problematic code is killed, and sometimes not all of the log information is available.
+
Non-JVM language users (such as Python) are most likely to encounter issues with container OOMs. This is because the JVM is generally configured to not use more memory than the container it is running in.
+
Everything which isn't inside the JVM is considered "overhead", so Tensorflow, Python, bash, etc. A first step with a container OOM is often increasing spark.executor.memoryOverhead and spark.driver.memoryOverhead to leave more memory for non-Java processes.
+
Python users can set spark.executor.pyspark.memory to limit the Python VM to a certain amount of memory. This amount of memory is then added to the overhead.
spark.sql.AnalysisException: Correlated column is not allowed in predicate
+
SPARK-35080 introduces a check for correlated subqueries with aggregates which may have previously return incorect results.
+Instead, starting in Spark 2.4.8, these queries will raise an org.apache.spark.sql.AnalysisException exception.
create or replace view t1(c) as values ('a'), ('b');
+create or replace view t2(c) as values ('ab'), ('abc'), ('bc');
+
+select c, (select count(*) from t2 where t1.c = substring(t2.c, 1, 1)) from t1;
+
Instead you should do an explicit join and then perform your aggregation:
+
create or replace view t1(c) as values ('a'), ('b');
+create or replace view t2(c) as values ('ab'), ('abc'), ('bc');
+
+create or replace view t3 as select t1.c from t2 INNER JOIN t1 ON t1.c = substring(t2.c, 1, 1);
+
+select c, count(*) from t3 group by c;
+
Similarly:
+
create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3);
+create or replace view t2(c) as values (6);
+
+select c, (select count(*) from t1 where a + b = c) from t2;
+
Can be rewritten as:
+
create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3);
+create or replace view t2(c) as values (6);
+
+create or replace view t3 as select t2.c from t2 INNER JOIN t1 ON t2.c = t1.a + t1.b;
+
+select c, count(*) from t3 group by c;
+
Likewise in Scala and Python use an explicit .join and then perform your aggregation on the joined result.
+Now Spark can compute correct results thus avoiding the exception.
Result size larger than spark.driver.maxResultSize error OR Kryo serialization failed: Buffer overflow.
+
ex:
+
You typically run into this error for one of the following reasons.
+
+
You are sending a large result set to the driver using SELECT(in SQL) or COLLECT(in dataframes/dataset/RDD): Apply a limit if your intention is to spot check a few rows as you won't be able to go through full set of rows if you have a really high number of rows. Writing the results to a temporary table in your schema and querying the new table would be an alternative if you need to query the results multiple times with a specific set of filters.
+
You are broadcasting a table that is too big. Spark downloads all the rows for a table that needs to be broadcasted to the driver before it starts shipping to the executors. So iff you are broadcasting a table that is larger than spark.driver.maxResultSize, you will run into this error. You can overcome this by either increasing the spark.driver.maxResultSize or not broadcasting the table so Spark would use a shuffle hash or sort-merge join.
+
You have a sort in your SQL/Dataframe: Spark internally uses range-partitioning to assign sort keys to a partition range. This involves in collecting sample rows(reservoir sampling) from input partitions and sending them to the driver for computing range boundaries. This error can further fall into one of the below scenarios.
+ a. You have wide/bloated rows in your table: In this case, you are not sending a lot of rows to the driver, but you are sending bytes larger than the spark.driver.maxResultSize. The recommendation here is to lower the default sample size by setting the spark property spark.sql.execution.rangeExchange.sampleSizePerPartition to something lower than 20. You can also increase spark.driver.maxResultSize if lowering the sample size is causing an imbalance in partition ranges(for ex: skew in a sub-sequent stage or non-uniform output files etc..). If using the later option, be sure spark.driver.maxResultSize is less than spark.driver.memory.
+ b. You have too many Spark partitions from the previous stage: In this case, you have a large number of map tasks while reading from a table. Since spark has to collect sample rows from every partition, your total bytes from the number of rows(partitions*sampleSize) could be larger than spark.driver.maxResultSize. A recommended way to resolve this issue is by combining the splits for the table(increase spark.(path).(db).(table).target-size) with high map tasks. Note that having a large number of map tasks(>80k) will cause other OOM issues on driver as it needs to keep track of metadata for all these tasks/partitions.
Result size larger than spark.driver.maxResultsSize error
+
ex:
+
You typically run into this error for one of the following reasons.
+
+
You are sending a large result set to the driver using SELECT(in SQL) or COLLECT(in dataframes/dataset/RDD): Apply a limit if your intention is to spot check a few rows as you won't be able to go through full set of rows if you have a really high no.of rows. Writing the results to a temporary table in your schema and querying the new table would be an alternative if you need to query the results multiple times with a specific set of filters. (Collect best practices )
+
You are broadcasting a table that is too big. Spark downloads all the rows for a table that needs to be broadcasted to the driver before it starts shipping to the executors. So iff you are broadcasting a table that is larger than spark.driver.maxResultsSize, you will run into this error. You can overcome this by either increasing the spark.driver.maxResultsSize or not broadcasting the table so Spark would use a shuffle hash or sort-merge join. Note that Spark broadcasts a table referenced in a join if the size of the table is less than spark.sql.autoBroadcastJoinThreshold(100 MB by default at Netflix). You can change this config to include a larger tables in broadcast or reduce the threshold if you want to exclude certain tables. You can also set this to -1 if you want to disable broadcast joins.
+
You have a sort in your SQL/Dataframe: Spark internally uses range-partitioning to assign sort keys to a partition range. This involves in collecting sample rows(reservoir sampling) from input partitions and sending them to the driver for computing range boundaries. This error can further fall into one of the below scenarios.
+ a. You have wide/bloated rows in your table: In this case, you are not sending a lot of rows to the driver, but you are sending bytes larger than the spark.driver.maxResultsSize. The recommendation here is to lower the default sample size by setting the spark property spark.sql.execution.rangeExchange.sampleSizePerPartition to something lower than 20. You can also increase spark.driver.maxResultsSize if lowering the sample size is causing an imbalance in partition ranges(for ex: skew in a subsequent stage or non-uniform output files etc.)
+ b. You have too many Spark partitions from the previous stage: In this case, you have a large no.of map tasks while reading from a table. Since spark has to collect sample rows from every partition, your total bytes from the no.of rows(partitions*sampleSize) could be larger than spark.driver.maxResultsSize. A recommended way to resolve this issue is by combining the splits for the table(increase spark.netflix.(db).(table).target-size) with high map tasks. Note that having a large no.of map tasks(>80k) will cause other OOM issues on driver as it needs to keep track of metadata for all these tasks/partitions.
IF you see java.lang.OutOfMemoryError: in the driver log/stderr, it is most likely from driver JVM running out of memory. This article has the memory config for increasing the driver memory. One reason you could run into this error is
+if you are reading from a table with too many splits(s3 files) and overwhelming the driver with a lot of metadata.
+
Another cause for driver out of memory errors is when the number of partitions is too high and you trigger a sort or shuffle where Spark samples the data, but then runs out of memory while collecting the sample. To solve this repartition to a lower number of partitions or if you're in RDDs coalesce is a more efficent option (in DataFrames coalesce can have impact upstream in the query plan).
+
A less common, but still semi-frequent, occurnce of driver out of memory is an excessive number of tasks in the UI. This can be controlled by reducing spark.ui.retainedTasks (default 100k).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Note that it is very rare to run into this error. You may see this error when you are using too many filters(in your sql/dataframe/dataset). Workaround is to increase spark driver JVM stack size by setting below config to something higher than the default
+
+
spark.driver.extraJavaOptions: "-Xss512M" #Sets the stack size to 512 MB
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
By far the most common cause of executor out of disk errors is a mis-configuration of Spark's temporary directories.
+
You should set spark.local.dir to a directory with lots of local storage available. If you are on YARN this will be overriden by LOCAL_DIRS environment variable on the workers.
+
Kubernetes users may wish to add a large emptyDir for Spark to use for temporary storage.
+
Another common cause is having no longer needed/used RDDs/DataFrames/Datasets in scope. This tends to happen more often with notebooks as more things are placed in the global scope where they are not automatically cleaned up. A solution to this is breaking your code into more functions so that things go out of scope, or explicitily setting no longer needed RDDs/DataFrames/Datasets to None/null.
Executor out of memory issues can come from many sources. To narrow down what the cause of the error there are a few important places to look: the Spark Web UI, the executor log, the driver log, and (if applicable) the cluster manager (e.g. YARN/K8s) log/UI.
+
Container OOM
+
If the driver log indicates Container killed by YARN for exceeding memory limits for the applicable executor, or if (on K8s) the Spark UI show's the reason for the executor loss as "OOMKill" / exit code 137 then it's likely your program is exceeding the amount of memory assigned to it. This doesn't normally happen with pure JVM code, but instead when calling PySpark or JNI libraries (or using off-heap storage).
+
PySpark users are the most likely to encounter container OOMs. If you have PySpark UDF in the stage you should check out Python UDF OOM to eliminate that potential cause. Another potential issue to investigate is if your have key skew as trying to load too large a partition in Python can result in an OOM. If you are using a library, like Tensorflow, which results in
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Missing Files / File Not Found / Reading past RLE/BitPacking stream
+
Missing files are a relatively rare error in Spark. Most commonly they are caused by non-atomic operations in the data writer and will go away when you re-run your query/job.
+
On the other hand Reading past RLE/BitPacking stream or other file read errors tend to be non-transient.
+If the error is not transient it may mean that the metadata store (e.g. hive or iceberg) are pointing to a file that does not exist or has a bad format. You can cleanup Iceberg tables using Iceberg Table Cleanup from holden's spark-misc-utils, but be careful and talk with whoever produced the table to make sure that it's ok.
+
If you get a failed to read parquet file while you are not trying to read a parquet file, it's likely that you are using the wrong metastore.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
FetchFailed exceptions are mainly due to misconfiguration of spark.sql.shuffle.partitions:
+
+
Too few shuffle partitions: Having too few shuffle partitions means you could have a shuffle block that is larger than the limit(Integer.MaxValue=~2GB) or OOM(Exit code 143). The symptom for this can also be long-running tasks where the blocks are large but not reached the limit. A quick fix is to increase the shuffle/reducer parallelism by increasing spark.sqlshuffle.partitions(default is 500).
+
Too many shuffle partitions: Too many shuffle partitions could put a stress on the shuffle service and could run into errors like network timeout ```. Note that the shuffle service is a shared service for all the jobs running on the cluster so it is possible that someone else's job with high shuffle activity could cause errors for your job. It is worth checking to see if there is a pattern of these failures for your job to confirm if it is an issue with your job or not. Also note that the higher the shuffle partitions, the more likely that you would see this issue.
+
+
Tell me more.
+
FetchFailed Exceptions can be bucketed into below 4 categories:
+
+
Ran out of heap memory(OOM) on an Executor
+
Ran out of overhead memory on an Executor
+
Shuffle block greater than 2 GB
+
Network TimeOut.
+
+
Ran out of heap memory(OOM) on an Executor
+
This error indicates that the executor hosting the shuffle block has crashed due to Java OOM. The most likely cause for this is misconfiguration of spark.sqlshuffle.partitions. A workaround is to increase the shuffle partitions. Note that if you have skew from a single key(in join, group By), increasing this property wouldn't resolve the issue. Please refer to key-skew for related workarounds.
+
Errors that you normally see in the executor/task logs:
+
+
ExecutorLostFailure due to Exit code 143
+
ExecutorLostFailure due to Executor Heartbeat timed out.
+
+
Ran out of overhead memory on an Executor
+
This error indicates that the executor hosting the shuffle block has crashed due to off-heap(overhead) memory. Increasing spark.yarn.executor.Overhead should prevent this specific exception.
+
Error that you normally see in the executor/task logs:
+
+
ExecutorLostFailure, # GB of # GB physical memory used. Consider boosting the spark.yarn.executor.Overhead
+
+
Shuffle block greater than 2 GB
+
The most likely cause for this is misconfiguration of spark.sqlshuffle.partitions. A workaround is to increase the shuffle partitions(increases the no.of blocks and reduces the block size). Note that if you have skew from a single key(in join, group By), increasing this property wouldn't resolve the issue. Please refer to key-skew for related workarounds.
+
Error that you normally see in the executor/task logs:
+
+
Too Large Frame
+
Frame size exceeding
+
size exceeding Integer.MaxValue(~2GB)
+
+
Network Timeout
+
The most likely cause for this exception is a high shuffle activity(high network load) in your job. Reducing the shuffle partitions spark.sqlshuffle.partitions would mitigate this issue. You can also reduce the network load by modifying the shuffle config. (todo: add details)
+
Error that you normally see in the executor/task logs:
+
+
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
+
org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-xxxxxxxx
+
Caused by: org.apache.spark.shuffle.FetchFailedException: Too large frame: xxxxxxxxxxx
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Spark SQL AnalysisException covers a wide variety of potential issues, ranging from ambigious columns to more esotoric items like subquery issues. A good first step is making sure that your SQL is valid and your brackets are where you intend by putting your query through a SQL pretty-printer. After that hopefully the details of the AnalysisException error will guide you to one of the sub-nodes in the error graph.
To see if a stage if evenly partitioned take a look at the Spark WebUI --> Stage tab and look at the distribution of data sizes and durations of the completed tasks. Sometimes a stage with even partitioning is still slow.
+
There are a few common possible causes when the partitioning is even for slow stages. If your tasks are too short (e.g. finishing in under a few minutes), likely you have too many partitions/tasks. If your tasks are taking just the right amount of time but your jobs are slow you may not have enough executors. If your tasks are taking a long time you may have too large records, not enough partitions/tasks, or just slow functions. Another sign of not enough tasks can be excessive spill to disk.
+
If the data is evenly partitioned but the max task duration is longer than desired for the stage, increasing the number of executors will not help and you'll need to re-partition the data. Insufficient partitioning can be fixed by increasing the number of partitions (e.g. repartition(5000) or change spark.sql.shuffle.partitions).
+
Another cause of too large partitioning can be non-splittable compression formats, like gzip, that can be worked around with tools like splittablegzip.
Iceberg does not perform validation on the files specified, so it will let you create a table pointing to non-supported formats, e.g. CSV data, but will fail at query time. In this case you need to use a different metastore (e.g. Hive)
+
If the data is stored in a supported format, it is also possible you have an invalid iceberg table.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
+
+
+
+
+
+
+
+
+
+
+
Keyboard Shortcuts
+
+
+
+
+
+
+
Keys
+
Action
+
+
+
+
+
?
+
Open this help
+
+
+
n
+
Next page
+
+
+
p
+
Previous page
+
+
+
s
+
Search
+
+
+
+
+
+
+
+
+
+
+
diff --git a/details/failure-executor-large-record/index.html b/details/failure-executor-large-record/index.html
new file mode 100644
index 0000000..ee2c106
--- /dev/null
+++ b/details/failure-executor-large-record/index.html
@@ -0,0 +1,381 @@
+
+
+
+
+
+
+
+
+
+
+ Large record problems can show up in a few different ways. - Spark Advanced Topics
+
+
+
+
+
+
+
+
+
+
Large record problems can show up in a few different ways.
+
For particularly large records you may find an executor out of memory exception, otherwise you may find slow stages.
+
You can get a Kyro serialization (for SQL) or Java serialization error (for RDD). In addition if a given column in a row is too large you may encounter a IllegalArgumentException: Cannot grow BufferHolder by size, because the size after growing exceeds size limitation 2147483632.
+
Some common causes of too big records are groupByKey in RDD land, UDAFs or list aggregations (like collect_list) in Spark SQL, highly compressed or Sparse records without a sparse seriaization.
+
For sparse records check out AltEncoder in (spark-misc-utils)[https://github.com/holdenk/spark-misc-utils].
+
If you are uncertain of where exactly the too big record is coming from after looking at the executor logs, you can try and seperate the stage which is failing into distinct parts of the code by using persist at the DISK_ONLY level to introduce cuts into the graph.
+
If your exception is happening with a Python UDF, it's possible that the individual records themselves might not be too large, but the batch-size used by Spark is set too high for the size of your records. You can try turning down the record size.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
There are multiple use cases where you might want to measure performance for different transformations in your spark job, in which case you have to materialize the transformations by calling an explicit action. If you encounter an exception during the write phase that appears unrelated, one technique is to force computation earlier of the DataFrame or RDD to narrow down the true cause of the exception.
+
Forcing computation on RDDs is relatively simple, all you need to do is call count() and Spark will evaluate the RDD.
+
Forcing computation on DataFrames is more complex. Calling an action like count() on a DataFrame might not necessarily work because the optimizer will likely ignore unnecessary transformations. In order to compute the row count, Spark does not have to execute all transformations. The Spark optimizer can simplify the query plan in such a way that the actual transformation that you need to measure will be skipped because it is simply not needed for finding out the final count. In order to make sure all the transformations are called, we need to force Spark to compute them using other ways.
+
Here are some options to force Spark to compute all transformations of a DataFrame:
+
+
df.rdd.count() : convert to an RDD and perform a count
+
df.foreach (_ => ()) : do-nothing foreach
+
Write to an output table (not recommended for performance benchmarking since the execution time will be impacted heavily by the actual writing process)
+
If using Spark 3.0 and above, benchmarking is simplified by supporting a "noop" write format which will force compute all transformations without having to write it.
+
Key or partition skew is a frequent problem in Spark. Key skew can result in everything from slowly running jobs (with stragglers), to failing jobs.
+
What is data skew?
+
+
+
Usually caused during a transformation when the data in one partition ends up being a lot more than the others, bumping up memory could resolve an OOM error but does not solve the underlying problem
+
+
+
Processing partitions are unbalanced by a magnitude then the largest partition becomes the bottleneck
+
+
+
How to identify skew
+
+
If one task took much longer to complete than the other tasks, it's usually a sign of Skew. On the Spark UI under Summary Metrics for completed tasks if the Max duration is higher by a significant magnitude from the Median it usually represents Skew, e.g.:
+
+
+
Things to consider
+
+
Mitigating skew has a cost (e.g. repartition) hence its ignorable unless the duration or input size is significantly higher in magnitude severely impacting job time
+
+
Mitigation strategies
+
+
+
Increasing executor memory to prevent OOM exceptions -> This a short-term solution if you want to unblock yourself but does not address the underlying issue. Sometimes this is not an option when you are already running at the max memory settings allowable.
+
+
+
Salting is a way to balance partitions by introducing a salt/dummy key for the skewed partitions. Here is a sample workbook and an example of salting in content performance show completion pipeline, where the whole salting operation is parametrized with a JOIN_BUCKETS variable which helps with maintenance of this job.
+
+
+
+
Isolate the data for the skewed key, broadcast it for processing (e.g. join) and then union back the results
+
+
+
Adaptive Query Execution is a new framework with Spark 3.0, it enables Spark to dynamically identify skew. Under the hood adaptive query execution splits (and replicates if needed) skewed (large) partitions. If you are unable to upgrade to Spark 3.0, you can build the solution into the code by using the Salting/Partitioning technique listed above.
+
+
+
Using approximate functions/ probabilistic data structure
+
+
+
Using approximate distinct counts (Hyperloglog) can help get around skew if absolute precision isn't important.
+
+
+
Approximate data structures like Tdigest can help with quantile computations.
+If you need exact quantiles, check out the example in High Performance Spark
+
Certain types of aggregations and windows can result in partitioning the data on a particular key.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Partial Aggregation is a key concept when handling large amounts of data in Spark. Full aggregation means that all of the data for one key must be together on the same node and then it can be aggregated, whereas partial aggregation allows Spark to start the aggregation "map-side" (e.g. before the shuffle) and then combine these "partial" aggregations together.
+
In RDD world the classic "full" aggregation is groupByKey and partial aggregation is reduceByKey.
+
In DataFrame/Datasets, Scala UDAFs implement partial aggregation but the basic PySpark Panda's/Arrow UDAFs do not support partial aggregation.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Out of memory exceptions with Python user-defined-functions are especially likely as Spark doesn't do a good job of managing memory between the JVM and Python VM. Together this can result in exceeding container memory limits.
+
Grouped Map / Co-Grouped
+
The Grouped & Co-Grouped UDFs are especially likely to cause out-of-memory exceptions in PySpark when combined with key skew.
+Unlike most built in Spark aggregations, PySpark user-defined-aggregates do not support partial aggregation. This means that all of the data for a single key must fit in memory. If possible try and use an equivalent built-in aggregation, write a Scala aggregation supporting partial aggregates, or switch to an RDD and use reduceByKey.
+
This limitation applies regardless of whether you are using Arrow or "vanilla" UDAFs.
+
Arrow / Pandas / Vectorized UDFS
+
If you are using PySpark's not-so-new Arrow based UDFS (sometimes called pandas UDFS or vectorized UDFs), record batching can cause issues. You can configure spark.sql.execution.arrow.maxRecordsPerBatch, which defaults to 10k records per batch. If your records are large this default may very well be the source of your out of memory exceptions.
+
Note: setting spark.sql.execution.arrow.maxRecordsPerBatch too-small will result in reduced performance and reduced ability to vectorize operations over the data frames.
+
mapInPandas / mapInArrow
+
If you use mapInPandas or mapInArrow (proposed in 3.3+) it's important to note that Spark will serialize entire records, not just the columns needed by your UDF. If you encounter OOMs here because of record sizes, one option is to minimize the amount of data being serialized in each record. Select only the minimal data needed to perform the UDF + a key to rejoin with the target dataset.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
We're used to thinking of partitioning after a shuffle, but partitioning problems can occur at read time as well. This often happens when the layout of the data on disk is not well suited to our computation. Note that the number of partitions can be optionally specified when using the read API.
+
How to decide on a partition column or partition key?
+
+
+
Does the key have relatively low cardinality?
+1k distinct values are better than 1M distinct values.
+Consider a numeric, date, or timestamp column.
+
+
+
Does the key have enough data in each partition?
+
+
1Gb is a good goal.
+
+
+
+
Does the key have too much data in each partition?
+The data must fit on a single task in memory and avoid spilling to disk.
+
+
+
Does the key have evenly distributed data in each partition?
+If some partitions have orders of magnitude more data than others, those larger partitions have the potential to spill to disk, OOM, or simply consume excess resources in comparison to the partitions with median amounts of data. You don't want to size executors for the bloated partition. If none of the columns or keys has a particularly even distribution, then create a new column at the expense of saving a new version of the table/RDD/DF. A frequent approach here is to create a new column using a hash based on existing columns.
+
+
+
Does the key allow for fewer wide transformations?
+Wide transformations are more costly than narrow transformations.
+
+
+
Does the number of partitions approximate 2-3x the number of allocated cores on the executors?
There are three main different types and causes of bad partitioning in Spark. Partitioning is often the limitation of parallelism for most Spark jobs.
+
The most common (and most difficult to fix) bad partitioning in Spark is that of skewed partitioning. With key-skew the problem is not the number of partions, but that the data is not evenly distributed amongst the partions. The most frequent cause of skewed partitioning is that of "key-skew.". This happens frequently since humans and machines both tend to cluster resulting in skew (e.g. NYC and null).
+
The other type of skewed partitioning comes from "input partioned" data which is not evenly partioned. With input partioned data, the RDD or Dataframe doesn't have a particular partioner it just matches however the data is stored on disk. Uneven input partioned data can be fixed with an explicit repartion/shuffle. This input partioned data can also be skewed due to key-skew if the data is written out partitioned on a skewed key.
+
Insufficent partitioning is similar to input skewed partitioning, except instead of skew there just are not enough partions. Similarily you the number of partions (e.g. repartion(5000) or change spark.sql.shuffle.partitions).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
To see if a stage if evenly partioned take a look at the Spark WebUI --> Stage tab and look at the distribution of data sizes and durations of the completed tasks. Sometimes a stage with even parititoning is still slow.
+
If the max task duration is still substantailly shorter than the stages overall duration, this is often a sign of an insufficient number of executors. Spark can run (at most) spark.executor.cores * spark.dynamicAllocation.maxExecutors tasks in parallel (and in practice this will be lower since some tasks will be speculatively executed and some executors will fail). Try increasing the maxExecutors and seeing if your job speeds up.
+
+
Note
+
Setting spark.executor.cores * spark.dynamicAllocation.maxExecutors in excess of cluster capacity can result in the job waiting in PENDING state. So, try increasing maxExecutors within the limitations of the cluster resources and check if the job runtime is faster given the same input data.
+
+
If the data is evenly partitioned but the max task duration is longer than desired for the stage, increasing the number of executors will not help and you'll need to re-partition the data. See Bad Partitioning.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
There can be many reasons executors are slow; here are a few things you can look into:
+
+
Performance distribution among tasks in the same stage: In Spark UI - Stages - Summary Metric: check if there's uneven distribution of duration / input size. If true, there may be data skews or uneven partition splits. See uneven partitioning.
+
Task size: In Spark UI - Stages - Summary Metrics, check the input/output size of tasks. If individual input or output tasks are larger than a few hundred megabytes, you may need more partitions. Try increasing spark.sql.shuffle.partitions or spark.sql.files.maxPartitionBytes or consider making a repartition call.
+
GC: Check if GC time is a small fraction of duration, if it's more than a few percents, try increasing executor memory and see if any difference. If adding memory is not helping, you can now see if any optimization can be done in your code for that stage.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
How do I know if and when my job is waiting for cluster resources??
+
Sometimes the cluster manager may choke or otherwise not be able to allocate resources and we don't have a good way of detecting this situation making it difficult for the user to debug and tell apart from Spark not scaling up correctly.
+
As of Spark3.4, an executor will note when and for how long it waits for cluster resources. Check the JVM metrics for this information.
+
Reference link:
+
https://issues.apache.org/jira/browse/SPARK-36664
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Iceberg/Parquet provides 3 layers of data pruning/filtering, so it is recommended to make the most of it by utilizing them as upstream in your ETL as possible.
+
+
Partition Pruning : Applying a filter on a partition column would mean the Spark can prune all the partitions that are not needed (ex: utc_date, utc_hour etc.). Refer to this section for some examples.
+
Column Pruning : Parquet, a columnar format, allows us to read specific columns from a row group without having to read the entire row. By selecting the fields that you only need for your job/sql(instead of "select *"), you can avoid bringing unnecessary data only to drop it in the subsequent stages.
+
Predicate Push Down: It is also recommended to use filters on non-partition columns as this would allow Spark to exclude specific row groups while reading data from S3. For ex: account_id is not null if you know that you would be dropping the NULL account_ids eventually.
+
+
See also filter not pushed down, aggregation not pushed down(todo: add details), Bad storage partitioning(todo: add details).
+
Not enough Read/Map Tasks
+
If your map stage is taking longer, and you are sure that you are not reading more data than needed, then you may be reading the data with small no. of tasks. You can increase the no. of map tasks by decreasing target split size. Note that if you are constrained by the resources(map tasks are just waiting for resources and not in RUNNING status), you would have to request more executors for your job by increasing spark.dynamicAllocation.maxExecutors
+
Too many Read/Map Tasks
+
If you have large no. of map tasks in your stage, you could run into driver memory related errors as the task metadata could overwhelm the driver. This also could put a stress on shuffle(on map side) as more map tasks would create more shuffle blocks. It is recommended to keep the task count for a stage under 80k. You can decrease the no. of map tasks by increasing target split size (todo: add detail) for an Iceberg table. (Note: For a non-iceberg table, the property is spark.sql.maxPartitionBytes and it is at the job level and not at the table level)
+
Slow Transformations
+
Another reason for slow running map tasks could be from many reason, some common ones include:
+
+
Regex : You have RegEx in your transformation. Refer to RegEx tips for tuning.
+
udf: Make sure you are sending only the data that you need in UDF and tune UDF for performance. Refer to Slow UDF for more details.
+
Json: TBD
+
+
All these transformations may run into skew issues if you have a single row/column that is bloated. You could prevent this by checking the payload size before calling the transformation as a single row/column could potentially slow down the entire stage.
+
Skewed Map Tasks or Uneven partitioning
+
The most common (and most difficult to fix) bad partitioning in Spark is that of skewed partitioning. The data is not evenly distributed amongst the partitions.
+
+
+
Uneven partitioning due to Key-skew : The most frequent cause of skewed partitioning is that of "key-skew." This happens frequently since humans and machines both tend to cluster resulting in skew (e.g. NYC and null).
+
+
+
Uneven partitioning due to input layout: We are used to thinking of partitioning after a shuffle, but partitioning problems can occur at read time as well. This often happens when the layout of the data on disk is not well suited to our computation. In cases where the RDD or Dataframe doesn't have a particular partitioner, data is partitioned according to the storage on disk. Uneven input partitioned data can be fixed with an explicit repartition/shuffle. Spark is often able to avoid input layout issues by combinding and splitting inputs (when input formats are "splittable"), but not all input formats give Spark this freedom. One common example is gzip, although there is a work-around for "splittable gzip" but this comes at the cost of decompressing the entire file multiple times.
+
+
+
Record Skew : A single bloated row/record could be the root cause for slow map task. The easiest way to identify this is by checking your string fields that has Json payload. ( Ex: A bug in a client could write a lot of data). You can identify the culprit by checking the max(size/length) of the field in your upstream table. For CL, snapshot is a candidate for bloated field.
+
+
+
Task Skew : **This is only applicable to the tables with non-splittable file format(like TEXT, zip) and parquet files should never run into this issue. Task skew is where one of the tasks got more rows than others and it is possible if the upstream table has a single file that is large and has the non-splittable format.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Processing more data than necessary will typically slow down the job.
+If the input table is partitioned then applying filters on the partition columns can restrict the input volume Spark
+ needs
+ to scan.
+
A simple equality filter gets pushed down to the batch scan and enables Spark to only scan the files
+where dateint = 20211101 of a sample table partitioned on dateint and hour.
The query plan shows that Spark in this case scans the whole table and filters only in a later step.
+
+
Filter is dynamic via a join
+
In a more complex job we might restrict the data based on joining to another table. If the filtering criteria is not
+ static it won't be pushed down to the scan. So in the example below the two table scans happen independently, and
+ min(dateint) calculated in the CTE won't have an effect on the second scan.
+
with dates as
+ (select min(dateint) dateint
+ from jlantos.sample_table)
+
+select *
+from jlantos.sample_table st
+join dates d on st.dateint = d.dateint
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
The default shuffle parallelism for our Spark cluster is 500, and it may not be enough for larger datasets. If you don't see skew and most/all of the tasks are taking really long to finish a reduce stage, you can improve the overall runtime by increasing the spark.sql.shuffle.partitions.
+
Note that if you are constrained by the resources(reduce tasks are just waiting for resources and not in RUNNING status), you would have to request more executors for your job by increasing spark.dynamicAllocation.maxExecutors
+
Too many shuffle tasks
+
While having too many shuffle tasks has no direct effect on the stage duration, it could slow the stage down if there are multiple retries during the shuffle stage due to shuffle fetch failures. Note that the higher the shuffle partitions, the more chances of running into FetchFailure exceptions.
+
Skewed Shuffle Tasks
+
Partitioning problems are often the limitation of parallelism for most Spark jobs.
+
There are two primary types of bad partitioning, skewed partitioning (where the partitions are not equal in size/work) or even but non-ideal number partitioning (where the partitions are equal in size/work). If your tasks are taking roughly equivalent times to complete then you likely have even partitioning, and if they are taking unequal times to complete then you may have skewed or uneven partitioning.
Join: Skew is natural in most of our data sets due to the nature of the data. Both Hash join and Sort-Merge join can run into skew issue if you have a lot of data for one or more keys on either side of the join. Check Skewed Joins for handling skewed joins with example.
+
+
+
Aggregation/Group By: All aggregate functions(UDAFs) using SQL/dataframes/Datasets implement partial aggregation(combiner in MR) so you would only run into a skew if you are using a non-algebraic functions like distinct and percentiles which can't be computed partially. Partial vs Full aggregates
+
+
+
Sort/Repartition/Coalesce before write: It is recommended to introduce an additional stage for Sort or Repartition or Coalesce before the write stage to write optimal no. of S3 files into your target table. CheckSkewed Write for more details.
+
+
+
Slow Aggregation
+
Below non-algebraic functions can slow down the reduce stage if you have too many values/rows for a given key.
+
+
Count Distinct: Use HyperLogLog(HLL) based sketches for cardinality if you just need the approx counts for trends and don't need the exact counts. HLL can estimate with a standard error of 2%.
+
Percentiles: Use approx_percentile or t-digest sketches which would speed up the computation for a small accuracy trade-off.
+
+
Spill To Disk
+
Spark executors will start using "disk" once they exceed the spark memory fraction of executor memory. This it self is not an issue but too much of "spill to disk" will slow down the stage/job. You can overcome this by either increasing the executor memory or tweaking the job/stage to consume less memory.(for ex: a Sort-Merge join requires a lot less memory than a Hash join)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Spark function regexp_extract and regexp_replace can transform data using regular expressions.
+The regular expression pattern follows Java regex pattern.
Certain values in the dataset cause regexp_extract with a certain regex pattern to run very slowly.
+See https://stackoverflow.com/questions/5011672/java-regular-expression-running-very-slow.
+
Match Special Character in PySpark
+
You will need 4 backslashes to match any special character,
+2 required by Python string escaping and 2 by Java regex parsing.
Skewed joins happen frequently as some locations (NYC), data (null), and titles (Mr. Farts - Farting Around The House) are more popular than other types of data.
+
To a certain degree Spark 3.3 query engine has improvements to handle skewed joins, so a first step should be attempting to upgrade to the most recent version of Sprk.
+
Broadcast joins are ideal for handling skewed joins, but they only work when one table is smaller than the other. A general, albiet hacky, solution is to isolate the data for the skewed key, broadcast it for processing (e.g. join) and then union back the results.
+
Other technique can include introduce some type of salting and doing multi-stage joins.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Writes can be slow depending on the preceding stage of write(), target table partition scheme, and write parallelism(spark.sql.shuffle.partitions).
+The goal of this article is to go through below options and see the most optimal transformation for writing optimal files in target table/partition.
+
When to use Sort
+
A global sort in Spark internally uses range-partitioning to assign sort keys to a partition range. This involves in collecting sample rows(reservoir sampling) from input partitions and sending them to the driver for computing range boundaries.
+
Use global sort
+
+
If you are writing multiple partitions(especially heterogeneous partitions) as part of your write() as it can estimate the no. of files/tasks for a given target table partition based on the no. of sample rows it observes.
+
If you want to enable predicate-push-down on a set of target table fields for down stream consumption.
+
+
Tips:
+1. You can increase the spark property spark.sql.execution.rangeExchange.sampleSizePerPartition to improve the estimates if you are not seeing optimal no. of files per partition.
+2. You can also introduce salt to sort keys to increase the no. of write tasks if the sort keys cardinality less than the spark.sql.shuffle.partitions. Example
+
When to use Repartition
+
Repartition(hash partitioning) partitions rows in a round-robin manner and to produce uniform distribution across the tasks and a hash partitioning just before the write would produce uniform files and all write tasks should take about the same time.
+
Use repartition
+
+
If you are writing into a single partition or a non-partitioned table and want to get uniform file sizes.
+
If you want to produce a specific no.o files. for ex: using repartiton(100) would generate up to 100 files.
+
+
When to use Coalesce
+
Coalesce tries to combine files without invoking a shuffle and useful when you are going from a higher parallelism to lower parallelism. Use Coalesce:
+
+
If you are writing very small no. of files and the file size is relatively small.
+
+
Note that, Coalesce(N) is not an optimal way to merge files as it tries to combine multiple files(until it reaches target no. of files 'N' ) without taking size into equation, and you could run into (org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0) if the size exceeds.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
When you have an event log from an earlier "good run"
+
You can compare the slow and the fast runs.
+For this you can even use your local pyspark and calculate a ratio between slow and fast run for each stage metrics:
+
# Helper methods (just copy-paste it)
+
+def createEventView(eventLogFile, eventViewName):
+ sql("CREATE OR REPLACE TEMPORARY VIEW {} USING org.apache.spark.sql.json OPTIONS (path '{}')".format(eventViewName, eventLogFile))
+
+
+def createStageMetricsView(eventViewName, stageMetricsViewName):
+ sql("CREATE OR REPLACE TEMPORARY VIEW {} AS select `Submission Time`, `Completion Time`, `Stage ID`, t3.col.* from (select `Stage Info`.* from {} where Event='SparkListenerStageCompleted') lateral view explode(Accumulables) t3".format(stageMetricsViewName, eventViewName))
+
+
+def showDiffInStage(fastStagesTable, slowStagesTable, stageID):
+ sql("select {fastStages}.Name, {fastStages}.Value as Fast, {slowStages}.Value as Slow, {slowStages}.Value / {fastStages}.Value as `Slow / Fast` from {fastStages} INNER JOIN {slowStages} ON {fastStages}.ID = {slowStages}.ID where {fastStages}.`Stage ID` = {stageID} and {slowStages}.`Stage ID` = {stageID}".format(fastStages=fastStagesTable, slowStages=slowStagesTable, stageID=stageID)).show(40, False)
+
+
+# Creating the views from the event logs (just an example, you have to specify your own paths)
+
+createEventView("<path_to_the_fast_run_event_log>", "FAST_EVENTS")
+createStageMetricsView("FAST_EVENTS", "FAST_STAGE_METRICS")
+
+createEventView("<path_to_the_slow_run_event_log>", "SLOW_EVENTS")
+createStageMetricsView("SLOW_EVENTS", "SLOW_STAGE_METRICS")
+
+>>> sql("SELECT DISTINCT `Stage ID` from FAST_STAGE_METRICS").show()
++--------+
+|Stage ID|
++--------+
+| 0|
+| 1|
+| 2|
++--------+
+
+>>> sql("SELECT DISTINCT `Stage ID` from SLOW_STAGE_METRICS").show()
++--------+
+|Stage ID|
++--------+
+| 0|
+| 1|
+| 2|
++--------+
+
+>>> showDiffInStage("FAST_STAGE_METRICS", "SLOW_STAGE_METRICS", 2)
++-------------------------------------------+-------------+-------------+------------------+
+|Name |Fast |Slow |Slow / Fast |
++-------------------------------------------+-------------+-------------+------------------+
+|scan time total (min, med, max) |1095931 |1628308 |1.485776020570638 |
+|internal.metrics.executorRunTime |7486648 |12990126 |1.735105750931525 |
+|duration total (min, med, max) |7017645 |12322243 |1.7558943206731032|
+|internal.metrics.jvmGCTime |220325 |1084412 |4.921874503574266 |
+|internal.metrics.output.bytesWritten |34767744411 |34767744411 |1.0 |
+|internal.metrics.input.recordsRead |149652381 |149652381 |1.0 |
+|internal.metrics.executorDeserializeCpuTime|5666230304 |7760682789 |1.3696377260771504|
+|internal.metrics.resultSize |625598 |626415 |1.0013059504665935|
+|internal.metrics.executorCpuTime |6403420405851|8762799691603|1.3684560963069305|
+|internal.metrics.input.bytesRead |69488204276 |69488204276 |1.0 |
+|number of output rows |149652381 |149652381 |1.0 |
+|internal.metrics.resultSerializationTime |36 |72 |2.0 |
+|internal.metrics.output.recordsWritten |149652381 |149652381 |1.0 |
+|internal.metrics.executorDeserializeTime |6024 |11954 |1.9843957503320053|
++-------------------------------------------+-------------+-------------+------------------+
+
When there is no event log from a good run
+
Steps:
+
+
Navigate to Spark UI using spark history URL
+
Click on Stages and sort the stages(click on Duration) in descending order to find the longest running stage.
+
+
+
Now let's figure out if the slow stage is a Map or Reduce/Shuffle
+
Once you identify the slow stage, check the fields "Input", "Output", "Shuffle Read", "Shuffle Write" of the slow stage and use below grid to identify the stage type and the corresponding ETL action.
+
-----------------------------------------------------------------------------------
+| Input | Output | Shuffle Read | Shuffle Write | MR Stage | ETL Action |
+|------------------------------------------------------------|----------------------|
+| X | | | X | Map | Read |
+|------------------------------------------------------------|----------------------|
+| X | X | | | Map | Read/Write |
+|------------------------------------------------------------|----------------------|
+| X | | | | Map | Sort Estimate |
+|------------------------------------------------------------|----------------------|
+| | | X | | Map | Sort Estimate |
+|------------------------------------------------------------|----------------------|
+| | | X | X | Reduce | Join/Agg/Repartition |
+|------------------------------------------------------------|----------------------|
+| | X | X | | Reduce | Write |
+ ------------------------------------------------------------|----------------------
+
+
+
go to Map if the slow stage is from a Map operation.
+go to Reduce if the slow stage is from a Reduce/Shuffle operation.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Using the default file output committer with S3a results in double data writes (sad times!).
+Use a newer cloud committer such as the "S3 magic committer" or a committer specialized for your hadoop cluster.
Sometimes a partitioning approach works fine for a small dataset, but can cause a surprisingly large number of partitions for a slighly larger dataset. Check out The Small File Problem in context of HDFS.
Too Big DAG (or when iterative algorithms go bump in the night)
+
Spark uses lazy evaluation and creates a DAG (directed acyclic graph) of the operations needed to compute a peice of data. Even if the data is persisted or cached, Spark will keep this DAG in memory on the driver so that if an executor fails it can re-create this data later. This is more likely to cause problems with iterative algorithms that create RDDs or DataFrames on each iteration based on the previous iteration, like ALS. Some signs of a DAG getting too big are:
+
+
Iterative algorithm becoming slower on each iteration
+
Driver OOM
+
Executor out-of-disk-error
+
+
If your job hasn't crashed, an easy way to check is by looking at the Spark Web UI and seeing what the DAG visualization looks like. If the DAG takes a measurable length of time to load (minutes), or fills a few screens it's likely "too-big." Just because a DAG "looks" small though doesn't mean that it isn't necessarily an issue, medium-sized-looking DAGs with lots of shuffle files can cause executor out of disk issues too.
+
Working around this can be complicated, but there are some tools to simplify it. The first is Spark's checkpointing which allows Spark to "forget" the DAG so far by writing the data out to a persistent storage like S3 or HDFS. The second is manually doing what checkpointing does, that is on your own writing the data out and loading it back in.
+
Unfortunately, if you work in a notebook environment this might not be enough to solve your problem. While this will introduce a "cut" in the DAG, if the old RDDs or DataFrames/Datasets are still in scope they will still continue to reside in memory on the driver, and any shuffle files will continue to reside on the disks of the workers. To work around this it's important to explicitly clean up your old RDDs/DataFrames by setting their references to None/null.
+
If you still run into executor out of disk space errors, you may need to look at the approach taken in Spark's ALS algorithm of triggering eager shuffle cleanups, but this is an advanced feature and can lead to non-recoverable errors.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
User defined functions in Spark are black blox to Spark and can limit performance. When possible look for built-in alternatives.
+
One important exception is that if you have multiple functions which must be done in Python, the advice changes a little bit. Since moving data from the JVM to Python is expensive, if you can chain together multiple Python UDFs on the same column, Spark is able to pipeline these together into a single copy to/from Python.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Write failures can sometimes mask other problems. A good first step is to insert a cache or persist right before the write step.
+
Iceberg table writes can sometimes fail after upgrading to a new version as the partitioning of the table bubbles further up. Range based partitioning (used by default with sorted tables) can result in a small number of partitions when there is not much key distance.
+
One option is to, as with a manual sort in Spark, add some extra higher cardinality columns to your sort order in your iceberg table.
+
You can go back to pre-Spark 3 behaviour by instead insert your own manual sort and set write mode to none.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+
From here you can search these documents. Enter your search terms below.
Welcome to the Spark Advanced Topics working group documentation.
+This documentation is in the early stages.
+We have been working on a flowchart to help you solve your current problems.
+The documentation is collected under "details" (see above).
"],_default:[0,"",""]};function ve(e,t){var n;return n="undefined"!=typeof e.getElementsByTagName?e.getElementsByTagName(t||"*"):"undefined"!=typeof e.querySelectorAll?e.querySelectorAll(t||"*"):[],void 0===t||t&&A(e,t)?S.merge([e],n):n}function ye(e,t){for(var n=0,r=e.length;n",""]);var me=/<|?\w+;/;function xe(e,t,n,r,i){for(var o,a,s,u,l,c,f=t.createDocumentFragment(),p=[],d=0,h=e.length;d\s*$/g;function je(e,t){return A(e,"table")&&A(11!==t.nodeType?t:t.firstChild,"tr")&&S(e).children("tbody")[0]||e}function De(e){return e.type=(null!==e.getAttribute("type"))+"/"+e.type,e}function qe(e){return"true/"===(e.type||"").slice(0,5)?e.type=e.type.slice(5):e.removeAttribute("type"),e}function Le(e,t){var n,r,i,o,a,s;if(1===t.nodeType){if(Y.hasData(e)&&(s=Y.get(e).events))for(i in Y.remove(t,"handle events"),s)for(n=0,r=s[i].length;n").attr(n.scriptAttrs||{}).prop({charset:n.scriptCharset,src:n.url}).on("load error",i=function(e){r.remove(),i=null,e&&t("error"===e.type?404:200,e.type)}),E.head.appendChild(r[0])},abort:function(){i&&i()}}});var _t,zt=[],Ut=/(=)\?(?=&|$)|\?\?/;S.ajaxSetup({jsonp:"callback",jsonpCallback:function(){var e=zt.pop()||S.expando+"_"+wt.guid++;return this[e]=!0,e}}),S.ajaxPrefilter("json jsonp",function(e,t,n){var r,i,o,a=!1!==e.jsonp&&(Ut.test(e.url)?"url":"string"==typeof e.data&&0===(e.contentType||"").indexOf("application/x-www-form-urlencoded")&&Ut.test(e.data)&&"data");if(a||"jsonp"===e.dataTypes[0])return r=e.jsonpCallback=m(e.jsonpCallback)?e.jsonpCallback():e.jsonpCallback,a?e[a]=e[a].replace(Ut,"$1"+r):!1!==e.jsonp&&(e.url+=(Tt.test(e.url)?"&":"?")+e.jsonp+"="+r),e.converters["script json"]=function(){return o||S.error(r+" was not called"),o[0]},e.dataTypes[0]="json",i=C[r],C[r]=function(){o=arguments},n.always(function(){void 0===i?S(C).removeProp(r):C[r]=i,e[r]&&(e.jsonpCallback=t.jsonpCallback,zt.push(r)),o&&m(i)&&i(o[0]),o=i=void 0}),"script"}),y.createHTMLDocument=((_t=E.implementation.createHTMLDocument("").body).innerHTML="",2===_t.childNodes.length),S.parseHTML=function(e,t,n){return"string"!=typeof e?[]:("boolean"==typeof t&&(n=t,t=!1),t||(y.createHTMLDocument?((r=(t=E.implementation.createHTMLDocument("")).createElement("base")).href=E.location.href,t.head.appendChild(r)):t=E),o=!n&&[],(i=N.exec(e))?[t.createElement(i[1])]:(i=xe([e],t,o),o&&o.length&&S(o).remove(),S.merge([],i.childNodes)));var r,i,o},S.fn.load=function(e,t,n){var r,i,o,a=this,s=e.indexOf(" ");return-1").append(S.parseHTML(e)).find(r):e)}).always(n&&function(e,t){a.each(function(){n.apply(this,o||[e.responseText,t,e])})}),this},S.expr.pseudos.animated=function(t){return S.grep(S.timers,function(e){return t===e.elem}).length},S.offset={setOffset:function(e,t,n){var r,i,o,a,s,u,l=S.css(e,"position"),c=S(e),f={};"static"===l&&(e.style.position="relative"),s=c.offset(),o=S.css(e,"top"),u=S.css(e,"left"),("absolute"===l||"fixed"===l)&&-1<(o+u).indexOf("auto")?(a=(r=c.position()).top,i=r.left):(a=parseFloat(o)||0,i=parseFloat(u)||0),m(t)&&(t=t.call(e,n,S.extend({},s))),null!=t.top&&(f.top=t.top-s.top+a),null!=t.left&&(f.left=t.left-s.left+i),"using"in t?t.using.call(e,f):c.css(f)}},S.fn.extend({offset:function(t){if(arguments.length)return void 0===t?this:this.each(function(e){S.offset.setOffset(this,t,e)});var e,n,r=this[0];return r?r.getClientRects().length?(e=r.getBoundingClientRect(),n=r.ownerDocument.defaultView,{top:e.top+n.pageYOffset,left:e.left+n.pageXOffset}):{top:0,left:0}:void 0},position:function(){if(this[0]){var e,t,n,r=this[0],i={top:0,left:0};if("fixed"===S.css(r,"position"))t=r.getBoundingClientRect();else{t=this.offset(),n=r.ownerDocument,e=r.offsetParent||n.documentElement;while(e&&(e===n.body||e===n.documentElement)&&"static"===S.css(e,"position"))e=e.parentNode;e&&e!==r&&1===e.nodeType&&((i=S(e).offset()).top+=S.css(e,"borderTopWidth",!0),i.left+=S.css(e,"borderLeftWidth",!0))}return{top:t.top-i.top-S.css(r,"marginTop",!0),left:t.left-i.left-S.css(r,"marginLeft",!0)}}},offsetParent:function(){return this.map(function(){var e=this.offsetParent;while(e&&"static"===S.css(e,"position"))e=e.offsetParent;return e||re})}}),S.each({scrollLeft:"pageXOffset",scrollTop:"pageYOffset"},function(t,i){var o="pageYOffset"===i;S.fn[t]=function(e){return $(this,function(e,t,n){var r;if(x(e)?r=e:9===e.nodeType&&(r=e.defaultView),void 0===n)return r?r[i]:e[t];r?r.scrollTo(o?r.pageXOffset:n,o?n:r.pageYOffset):e[t]=n},t,e,arguments.length)}}),S.each(["top","left"],function(e,n){S.cssHooks[n]=Fe(y.pixelPosition,function(e,t){if(t)return t=We(e,n),Pe.test(t)?S(e).position()[n]+"px":t})}),S.each({Height:"height",Width:"width"},function(a,s){S.each({padding:"inner"+a,content:s,"":"outer"+a},function(r,o){S.fn[o]=function(e,t){var n=arguments.length&&(r||"boolean"!=typeof e),i=r||(!0===e||!0===t?"margin":"border");return $(this,function(e,t,n){var r;return x(e)?0===o.indexOf("outer")?e["inner"+a]:e.document.documentElement["client"+a]:9===e.nodeType?(r=e.documentElement,Math.max(e.body["scroll"+a],r["scroll"+a],e.body["offset"+a],r["offset"+a],r["client"+a])):void 0===n?S.css(e,t,i):S.style(e,t,n,i)},s,n?e:void 0,n)}})}),S.each(["ajaxStart","ajaxStop","ajaxComplete","ajaxError","ajaxSuccess","ajaxSend"],function(e,t){S.fn[t]=function(e){return this.on(t,e)}}),S.fn.extend({bind:function(e,t,n){return this.on(e,null,t,n)},unbind:function(e,t){return this.off(e,null,t)},delegate:function(e,t,n,r){return this.on(t,e,n,r)},undelegate:function(e,t,n){return 1===arguments.length?this.off(e,"**"):this.off(t,e||"**",n)},hover:function(e,t){return this.mouseenter(e).mouseleave(t||e)}}),S.each("blur focus focusin focusout resize scroll click dblclick mousedown mouseup mousemove mouseover mouseout mouseenter mouseleave change select submit keydown keypress keyup contextmenu".split(" "),function(e,n){S.fn[n]=function(e,t){return 0 0) {
+ var tokenMetadata = lunr.utils.clone(metadata) || {}
+ tokenMetadata["position"] = [sliceStart, sliceLength]
+ tokenMetadata["index"] = tokens.length
+
+ tokens.push(
+ new lunr.Token (
+ str.slice(sliceStart, sliceEnd),
+ tokenMetadata
+ )
+ )
+ }
+
+ sliceStart = sliceEnd + 1
+ }
+
+ }
+
+ return tokens
+}
+
+/**
+ * The separator used to split a string into tokens. Override this property to change the behaviour of
+ * `lunr.tokenizer` behaviour when tokenizing strings. By default this splits on whitespace and hyphens.
+ *
+ * @static
+ * @see lunr.tokenizer
+ */
+lunr.tokenizer.separator = /[\s\-]+/
+/*!
+ * lunr.Pipeline
+ * Copyright (C) 2020 Oliver Nightingale
+ */
+
+/**
+ * lunr.Pipelines maintain an ordered list of functions to be applied to all
+ * tokens in documents entering the search index and queries being ran against
+ * the index.
+ *
+ * An instance of lunr.Index created with the lunr shortcut will contain a
+ * pipeline with a stop word filter and an English language stemmer. Extra
+ * functions can be added before or after either of these functions or these
+ * default functions can be removed.
+ *
+ * When run the pipeline will call each function in turn, passing a token, the
+ * index of that token in the original list of all tokens and finally a list of
+ * all the original tokens.
+ *
+ * The output of functions in the pipeline will be passed to the next function
+ * in the pipeline. To exclude a token from entering the index the function
+ * should return undefined, the rest of the pipeline will not be called with
+ * this token.
+ *
+ * For serialisation of pipelines to work, all functions used in an instance of
+ * a pipeline should be registered with lunr.Pipeline. Registered functions can
+ * then be loaded. If trying to load a serialised pipeline that uses functions
+ * that are not registered an error will be thrown.
+ *
+ * If not planning on serialising the pipeline then registering pipeline functions
+ * is not necessary.
+ *
+ * @constructor
+ */
+lunr.Pipeline = function () {
+ this._stack = []
+}
+
+lunr.Pipeline.registeredFunctions = Object.create(null)
+
+/**
+ * A pipeline function maps lunr.Token to lunr.Token. A lunr.Token contains the token
+ * string as well as all known metadata. A pipeline function can mutate the token string
+ * or mutate (or add) metadata for a given token.
+ *
+ * A pipeline function can indicate that the passed token should be discarded by returning
+ * null, undefined or an empty string. This token will not be passed to any downstream pipeline
+ * functions and will not be added to the index.
+ *
+ * Multiple tokens can be returned by returning an array of tokens. Each token will be passed
+ * to any downstream pipeline functions and all will returned tokens will be added to the index.
+ *
+ * Any number of pipeline functions may be chained together using a lunr.Pipeline.
+ *
+ * @interface lunr.PipelineFunction
+ * @param {lunr.Token} token - A token from the document being processed.
+ * @param {number} i - The index of this token in the complete list of tokens for this document/field.
+ * @param {lunr.Token[]} tokens - All tokens for this document/field.
+ * @returns {(?lunr.Token|lunr.Token[])}
+ */
+
+/**
+ * Register a function with the pipeline.
+ *
+ * Functions that are used in the pipeline should be registered if the pipeline
+ * needs to be serialised, or a serialised pipeline needs to be loaded.
+ *
+ * Registering a function does not add it to a pipeline, functions must still be
+ * added to instances of the pipeline for them to be used when running a pipeline.
+ *
+ * @param {lunr.PipelineFunction} fn - The function to check for.
+ * @param {String} label - The label to register this function with
+ */
+lunr.Pipeline.registerFunction = function (fn, label) {
+ if (label in this.registeredFunctions) {
+ lunr.utils.warn('Overwriting existing registered function: ' + label)
+ }
+
+ fn.label = label
+ lunr.Pipeline.registeredFunctions[fn.label] = fn
+}
+
+/**
+ * Warns if the function is not registered as a Pipeline function.
+ *
+ * @param {lunr.PipelineFunction} fn - The function to check for.
+ * @private
+ */
+lunr.Pipeline.warnIfFunctionNotRegistered = function (fn) {
+ var isRegistered = fn.label && (fn.label in this.registeredFunctions)
+
+ if (!isRegistered) {
+ lunr.utils.warn('Function is not registered with pipeline. This may cause problems when serialising the index.\n', fn)
+ }
+}
+
+/**
+ * Loads a previously serialised pipeline.
+ *
+ * All functions to be loaded must already be registered with lunr.Pipeline.
+ * If any function from the serialised data has not been registered then an
+ * error will be thrown.
+ *
+ * @param {Object} serialised - The serialised pipeline to load.
+ * @returns {lunr.Pipeline}
+ */
+lunr.Pipeline.load = function (serialised) {
+ var pipeline = new lunr.Pipeline
+
+ serialised.forEach(function (fnName) {
+ var fn = lunr.Pipeline.registeredFunctions[fnName]
+
+ if (fn) {
+ pipeline.add(fn)
+ } else {
+ throw new Error('Cannot load unregistered function: ' + fnName)
+ }
+ })
+
+ return pipeline
+}
+
+/**
+ * Adds new functions to the end of the pipeline.
+ *
+ * Logs a warning if the function has not been registered.
+ *
+ * @param {lunr.PipelineFunction[]} functions - Any number of functions to add to the pipeline.
+ */
+lunr.Pipeline.prototype.add = function () {
+ var fns = Array.prototype.slice.call(arguments)
+
+ fns.forEach(function (fn) {
+ lunr.Pipeline.warnIfFunctionNotRegistered(fn)
+ this._stack.push(fn)
+ }, this)
+}
+
+/**
+ * Adds a single function after a function that already exists in the
+ * pipeline.
+ *
+ * Logs a warning if the function has not been registered.
+ *
+ * @param {lunr.PipelineFunction} existingFn - A function that already exists in the pipeline.
+ * @param {lunr.PipelineFunction} newFn - The new function to add to the pipeline.
+ */
+lunr.Pipeline.prototype.after = function (existingFn, newFn) {
+ lunr.Pipeline.warnIfFunctionNotRegistered(newFn)
+
+ var pos = this._stack.indexOf(existingFn)
+ if (pos == -1) {
+ throw new Error('Cannot find existingFn')
+ }
+
+ pos = pos + 1
+ this._stack.splice(pos, 0, newFn)
+}
+
+/**
+ * Adds a single function before a function that already exists in the
+ * pipeline.
+ *
+ * Logs a warning if the function has not been registered.
+ *
+ * @param {lunr.PipelineFunction} existingFn - A function that already exists in the pipeline.
+ * @param {lunr.PipelineFunction} newFn - The new function to add to the pipeline.
+ */
+lunr.Pipeline.prototype.before = function (existingFn, newFn) {
+ lunr.Pipeline.warnIfFunctionNotRegistered(newFn)
+
+ var pos = this._stack.indexOf(existingFn)
+ if (pos == -1) {
+ throw new Error('Cannot find existingFn')
+ }
+
+ this._stack.splice(pos, 0, newFn)
+}
+
+/**
+ * Removes a function from the pipeline.
+ *
+ * @param {lunr.PipelineFunction} fn The function to remove from the pipeline.
+ */
+lunr.Pipeline.prototype.remove = function (fn) {
+ var pos = this._stack.indexOf(fn)
+ if (pos == -1) {
+ return
+ }
+
+ this._stack.splice(pos, 1)
+}
+
+/**
+ * Runs the current list of functions that make up the pipeline against the
+ * passed tokens.
+ *
+ * @param {Array} tokens The tokens to run through the pipeline.
+ * @returns {Array}
+ */
+lunr.Pipeline.prototype.run = function (tokens) {
+ var stackLength = this._stack.length
+
+ for (var i = 0; i < stackLength; i++) {
+ var fn = this._stack[i]
+ var memo = []
+
+ for (var j = 0; j < tokens.length; j++) {
+ var result = fn(tokens[j], j, tokens)
+
+ if (result === null || result === void 0 || result === '') continue
+
+ if (Array.isArray(result)) {
+ for (var k = 0; k < result.length; k++) {
+ memo.push(result[k])
+ }
+ } else {
+ memo.push(result)
+ }
+ }
+
+ tokens = memo
+ }
+
+ return tokens
+}
+
+/**
+ * Convenience method for passing a string through a pipeline and getting
+ * strings out. This method takes care of wrapping the passed string in a
+ * token and mapping the resulting tokens back to strings.
+ *
+ * @param {string} str - The string to pass through the pipeline.
+ * @param {?object} metadata - Optional metadata to associate with the token
+ * passed to the pipeline.
+ * @returns {string[]}
+ */
+lunr.Pipeline.prototype.runString = function (str, metadata) {
+ var token = new lunr.Token (str, metadata)
+
+ return this.run([token]).map(function (t) {
+ return t.toString()
+ })
+}
+
+/**
+ * Resets the pipeline by removing any existing processors.
+ *
+ */
+lunr.Pipeline.prototype.reset = function () {
+ this._stack = []
+}
+
+/**
+ * Returns a representation of the pipeline ready for serialisation.
+ *
+ * Logs a warning if the function has not been registered.
+ *
+ * @returns {Array}
+ */
+lunr.Pipeline.prototype.toJSON = function () {
+ return this._stack.map(function (fn) {
+ lunr.Pipeline.warnIfFunctionNotRegistered(fn)
+
+ return fn.label
+ })
+}
+/*!
+ * lunr.Vector
+ * Copyright (C) 2020 Oliver Nightingale
+ */
+
+/**
+ * A vector is used to construct the vector space of documents and queries. These
+ * vectors support operations to determine the similarity between two documents or
+ * a document and a query.
+ *
+ * Normally no parameters are required for initializing a vector, but in the case of
+ * loading a previously dumped vector the raw elements can be provided to the constructor.
+ *
+ * For performance reasons vectors are implemented with a flat array, where an elements
+ * index is immediately followed by its value. E.g. [index, value, index, value]. This
+ * allows the underlying array to be as sparse as possible and still offer decent
+ * performance when being used for vector calculations.
+ *
+ * @constructor
+ * @param {Number[]} [elements] - The flat list of element index and element value pairs.
+ */
+lunr.Vector = function (elements) {
+ this._magnitude = 0
+ this.elements = elements || []
+}
+
+
+/**
+ * Calculates the position within the vector to insert a given index.
+ *
+ * This is used internally by insert and upsert. If there are duplicate indexes then
+ * the position is returned as if the value for that index were to be updated, but it
+ * is the callers responsibility to check whether there is a duplicate at that index
+ *
+ * @param {Number} insertIdx - The index at which the element should be inserted.
+ * @returns {Number}
+ */
+lunr.Vector.prototype.positionForIndex = function (index) {
+ // For an empty vector the tuple can be inserted at the beginning
+ if (this.elements.length == 0) {
+ return 0
+ }
+
+ var start = 0,
+ end = this.elements.length / 2,
+ sliceLength = end - start,
+ pivotPoint = Math.floor(sliceLength / 2),
+ pivotIndex = this.elements[pivotPoint * 2]
+
+ while (sliceLength > 1) {
+ if (pivotIndex < index) {
+ start = pivotPoint
+ }
+
+ if (pivotIndex > index) {
+ end = pivotPoint
+ }
+
+ if (pivotIndex == index) {
+ break
+ }
+
+ sliceLength = end - start
+ pivotPoint = start + Math.floor(sliceLength / 2)
+ pivotIndex = this.elements[pivotPoint * 2]
+ }
+
+ if (pivotIndex == index) {
+ return pivotPoint * 2
+ }
+
+ if (pivotIndex > index) {
+ return pivotPoint * 2
+ }
+
+ if (pivotIndex < index) {
+ return (pivotPoint + 1) * 2
+ }
+}
+
+/**
+ * Inserts an element at an index within the vector.
+ *
+ * Does not allow duplicates, will throw an error if there is already an entry
+ * for this index.
+ *
+ * @param {Number} insertIdx - The index at which the element should be inserted.
+ * @param {Number} val - The value to be inserted into the vector.
+ */
+lunr.Vector.prototype.insert = function (insertIdx, val) {
+ this.upsert(insertIdx, val, function () {
+ throw "duplicate index"
+ })
+}
+
+/**
+ * Inserts or updates an existing index within the vector.
+ *
+ * @param {Number} insertIdx - The index at which the element should be inserted.
+ * @param {Number} val - The value to be inserted into the vector.
+ * @param {function} fn - A function that is called for updates, the existing value and the
+ * requested value are passed as arguments
+ */
+lunr.Vector.prototype.upsert = function (insertIdx, val, fn) {
+ this._magnitude = 0
+ var position = this.positionForIndex(insertIdx)
+
+ if (this.elements[position] == insertIdx) {
+ this.elements[position + 1] = fn(this.elements[position + 1], val)
+ } else {
+ this.elements.splice(position, 0, insertIdx, val)
+ }
+}
+
+/**
+ * Calculates the magnitude of this vector.
+ *
+ * @returns {Number}
+ */
+lunr.Vector.prototype.magnitude = function () {
+ if (this._magnitude) return this._magnitude
+
+ var sumOfSquares = 0,
+ elementsLength = this.elements.length
+
+ for (var i = 1; i < elementsLength; i += 2) {
+ var val = this.elements[i]
+ sumOfSquares += val * val
+ }
+
+ return this._magnitude = Math.sqrt(sumOfSquares)
+}
+
+/**
+ * Calculates the dot product of this vector and another vector.
+ *
+ * @param {lunr.Vector} otherVector - The vector to compute the dot product with.
+ * @returns {Number}
+ */
+lunr.Vector.prototype.dot = function (otherVector) {
+ var dotProduct = 0,
+ a = this.elements, b = otherVector.elements,
+ aLen = a.length, bLen = b.length,
+ aVal = 0, bVal = 0,
+ i = 0, j = 0
+
+ while (i < aLen && j < bLen) {
+ aVal = a[i], bVal = b[j]
+ if (aVal < bVal) {
+ i += 2
+ } else if (aVal > bVal) {
+ j += 2
+ } else if (aVal == bVal) {
+ dotProduct += a[i + 1] * b[j + 1]
+ i += 2
+ j += 2
+ }
+ }
+
+ return dotProduct
+}
+
+/**
+ * Calculates the similarity between this vector and another vector.
+ *
+ * @param {lunr.Vector} otherVector - The other vector to calculate the
+ * similarity with.
+ * @returns {Number}
+ */
+lunr.Vector.prototype.similarity = function (otherVector) {
+ return this.dot(otherVector) / this.magnitude() || 0
+}
+
+/**
+ * Converts the vector to an array of the elements within the vector.
+ *
+ * @returns {Number[]}
+ */
+lunr.Vector.prototype.toArray = function () {
+ var output = new Array (this.elements.length / 2)
+
+ for (var i = 1, j = 0; i < this.elements.length; i += 2, j++) {
+ output[j] = this.elements[i]
+ }
+
+ return output
+}
+
+/**
+ * A JSON serializable representation of the vector.
+ *
+ * @returns {Number[]}
+ */
+lunr.Vector.prototype.toJSON = function () {
+ return this.elements
+}
+/* eslint-disable */
+/*!
+ * lunr.stemmer
+ * Copyright (C) 2020 Oliver Nightingale
+ * Includes code from - http://tartarus.org/~martin/PorterStemmer/js.txt
+ */
+
+/**
+ * lunr.stemmer is an english language stemmer, this is a JavaScript
+ * implementation of the PorterStemmer taken from http://tartarus.org/~martin
+ *
+ * @static
+ * @implements {lunr.PipelineFunction}
+ * @param {lunr.Token} token - The string to stem
+ * @returns {lunr.Token}
+ * @see {@link lunr.Pipeline}
+ * @function
+ */
+lunr.stemmer = (function(){
+ var step2list = {
+ "ational" : "ate",
+ "tional" : "tion",
+ "enci" : "ence",
+ "anci" : "ance",
+ "izer" : "ize",
+ "bli" : "ble",
+ "alli" : "al",
+ "entli" : "ent",
+ "eli" : "e",
+ "ousli" : "ous",
+ "ization" : "ize",
+ "ation" : "ate",
+ "ator" : "ate",
+ "alism" : "al",
+ "iveness" : "ive",
+ "fulness" : "ful",
+ "ousness" : "ous",
+ "aliti" : "al",
+ "iviti" : "ive",
+ "biliti" : "ble",
+ "logi" : "log"
+ },
+
+ step3list = {
+ "icate" : "ic",
+ "ative" : "",
+ "alize" : "al",
+ "iciti" : "ic",
+ "ical" : "ic",
+ "ful" : "",
+ "ness" : ""
+ },
+
+ c = "[^aeiou]", // consonant
+ v = "[aeiouy]", // vowel
+ C = c + "[^aeiouy]*", // consonant sequence
+ V = v + "[aeiou]*", // vowel sequence
+
+ mgr0 = "^(" + C + ")?" + V + C, // [C]VC... is m>0
+ meq1 = "^(" + C + ")?" + V + C + "(" + V + ")?$", // [C]VC[V] is m=1
+ mgr1 = "^(" + C + ")?" + V + C + V + C, // [C]VCVC... is m>1
+ s_v = "^(" + C + ")?" + v; // vowel in stem
+
+ var re_mgr0 = new RegExp(mgr0);
+ var re_mgr1 = new RegExp(mgr1);
+ var re_meq1 = new RegExp(meq1);
+ var re_s_v = new RegExp(s_v);
+
+ var re_1a = /^(.+?)(ss|i)es$/;
+ var re2_1a = /^(.+?)([^s])s$/;
+ var re_1b = /^(.+?)eed$/;
+ var re2_1b = /^(.+?)(ed|ing)$/;
+ var re_1b_2 = /.$/;
+ var re2_1b_2 = /(at|bl|iz)$/;
+ var re3_1b_2 = new RegExp("([^aeiouylsz])\\1$");
+ var re4_1b_2 = new RegExp("^" + C + v + "[^aeiouwxy]$");
+
+ var re_1c = /^(.+?[^aeiou])y$/;
+ var re_2 = /^(.+?)(ational|tional|enci|anci|izer|bli|alli|entli|eli|ousli|ization|ation|ator|alism|iveness|fulness|ousness|aliti|iviti|biliti|logi)$/;
+
+ var re_3 = /^(.+?)(icate|ative|alize|iciti|ical|ful|ness)$/;
+
+ var re_4 = /^(.+?)(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ou|ism|ate|iti|ous|ive|ize)$/;
+ var re2_4 = /^(.+?)(s|t)(ion)$/;
+
+ var re_5 = /^(.+?)e$/;
+ var re_5_1 = /ll$/;
+ var re3_5 = new RegExp("^" + C + v + "[^aeiouwxy]$");
+
+ var porterStemmer = function porterStemmer(w) {
+ var stem,
+ suffix,
+ firstch,
+ re,
+ re2,
+ re3,
+ re4;
+
+ if (w.length < 3) { return w; }
+
+ firstch = w.substr(0,1);
+ if (firstch == "y") {
+ w = firstch.toUpperCase() + w.substr(1);
+ }
+
+ // Step 1a
+ re = re_1a
+ re2 = re2_1a;
+
+ if (re.test(w)) { w = w.replace(re,"$1$2"); }
+ else if (re2.test(w)) { w = w.replace(re2,"$1$2"); }
+
+ // Step 1b
+ re = re_1b;
+ re2 = re2_1b;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ re = re_mgr0;
+ if (re.test(fp[1])) {
+ re = re_1b_2;
+ w = w.replace(re,"");
+ }
+ } else if (re2.test(w)) {
+ var fp = re2.exec(w);
+ stem = fp[1];
+ re2 = re_s_v;
+ if (re2.test(stem)) {
+ w = stem;
+ re2 = re2_1b_2;
+ re3 = re3_1b_2;
+ re4 = re4_1b_2;
+ if (re2.test(w)) { w = w + "e"; }
+ else if (re3.test(w)) { re = re_1b_2; w = w.replace(re,""); }
+ else if (re4.test(w)) { w = w + "e"; }
+ }
+ }
+
+ // Step 1c - replace suffix y or Y by i if preceded by a non-vowel which is not the first letter of the word (so cry -> cri, by -> by, say -> say)
+ re = re_1c;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ w = stem + "i";
+ }
+
+ // Step 2
+ re = re_2;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ suffix = fp[2];
+ re = re_mgr0;
+ if (re.test(stem)) {
+ w = stem + step2list[suffix];
+ }
+ }
+
+ // Step 3
+ re = re_3;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ suffix = fp[2];
+ re = re_mgr0;
+ if (re.test(stem)) {
+ w = stem + step3list[suffix];
+ }
+ }
+
+ // Step 4
+ re = re_4;
+ re2 = re2_4;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ re = re_mgr1;
+ if (re.test(stem)) {
+ w = stem;
+ }
+ } else if (re2.test(w)) {
+ var fp = re2.exec(w);
+ stem = fp[1] + fp[2];
+ re2 = re_mgr1;
+ if (re2.test(stem)) {
+ w = stem;
+ }
+ }
+
+ // Step 5
+ re = re_5;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ re = re_mgr1;
+ re2 = re_meq1;
+ re3 = re3_5;
+ if (re.test(stem) || (re2.test(stem) && !(re3.test(stem)))) {
+ w = stem;
+ }
+ }
+
+ re = re_5_1;
+ re2 = re_mgr1;
+ if (re.test(w) && re2.test(w)) {
+ re = re_1b_2;
+ w = w.replace(re,"");
+ }
+
+ // and turn initial Y back to y
+
+ if (firstch == "y") {
+ w = firstch.toLowerCase() + w.substr(1);
+ }
+
+ return w;
+ };
+
+ return function (token) {
+ return token.update(porterStemmer);
+ }
+})();
+
+lunr.Pipeline.registerFunction(lunr.stemmer, 'stemmer')
+/*!
+ * lunr.stopWordFilter
+ * Copyright (C) 2020 Oliver Nightingale
+ */
+
+/**
+ * lunr.generateStopWordFilter builds a stopWordFilter function from the provided
+ * list of stop words.
+ *
+ * The built in lunr.stopWordFilter is built using this generator and can be used
+ * to generate custom stopWordFilters for applications or non English languages.
+ *
+ * @function
+ * @param {Array} token The token to pass through the filter
+ * @returns {lunr.PipelineFunction}
+ * @see lunr.Pipeline
+ * @see lunr.stopWordFilter
+ */
+lunr.generateStopWordFilter = function (stopWords) {
+ var words = stopWords.reduce(function (memo, stopWord) {
+ memo[stopWord] = stopWord
+ return memo
+ }, {})
+
+ return function (token) {
+ if (token && words[token.toString()] !== token.toString()) return token
+ }
+}
+
+/**
+ * lunr.stopWordFilter is an English language stop word list filter, any words
+ * contained in the list will not be passed through the filter.
+ *
+ * This is intended to be used in the Pipeline. If the token does not pass the
+ * filter then undefined will be returned.
+ *
+ * @function
+ * @implements {lunr.PipelineFunction}
+ * @params {lunr.Token} token - A token to check for being a stop word.
+ * @returns {lunr.Token}
+ * @see {@link lunr.Pipeline}
+ */
+lunr.stopWordFilter = lunr.generateStopWordFilter([
+ 'a',
+ 'able',
+ 'about',
+ 'across',
+ 'after',
+ 'all',
+ 'almost',
+ 'also',
+ 'am',
+ 'among',
+ 'an',
+ 'and',
+ 'any',
+ 'are',
+ 'as',
+ 'at',
+ 'be',
+ 'because',
+ 'been',
+ 'but',
+ 'by',
+ 'can',
+ 'cannot',
+ 'could',
+ 'dear',
+ 'did',
+ 'do',
+ 'does',
+ 'either',
+ 'else',
+ 'ever',
+ 'every',
+ 'for',
+ 'from',
+ 'get',
+ 'got',
+ 'had',
+ 'has',
+ 'have',
+ 'he',
+ 'her',
+ 'hers',
+ 'him',
+ 'his',
+ 'how',
+ 'however',
+ 'i',
+ 'if',
+ 'in',
+ 'into',
+ 'is',
+ 'it',
+ 'its',
+ 'just',
+ 'least',
+ 'let',
+ 'like',
+ 'likely',
+ 'may',
+ 'me',
+ 'might',
+ 'most',
+ 'must',
+ 'my',
+ 'neither',
+ 'no',
+ 'nor',
+ 'not',
+ 'of',
+ 'off',
+ 'often',
+ 'on',
+ 'only',
+ 'or',
+ 'other',
+ 'our',
+ 'own',
+ 'rather',
+ 'said',
+ 'say',
+ 'says',
+ 'she',
+ 'should',
+ 'since',
+ 'so',
+ 'some',
+ 'than',
+ 'that',
+ 'the',
+ 'their',
+ 'them',
+ 'then',
+ 'there',
+ 'these',
+ 'they',
+ 'this',
+ 'tis',
+ 'to',
+ 'too',
+ 'twas',
+ 'us',
+ 'wants',
+ 'was',
+ 'we',
+ 'were',
+ 'what',
+ 'when',
+ 'where',
+ 'which',
+ 'while',
+ 'who',
+ 'whom',
+ 'why',
+ 'will',
+ 'with',
+ 'would',
+ 'yet',
+ 'you',
+ 'your'
+])
+
+lunr.Pipeline.registerFunction(lunr.stopWordFilter, 'stopWordFilter')
+/*!
+ * lunr.trimmer
+ * Copyright (C) 2020 Oliver Nightingale
+ */
+
+/**
+ * lunr.trimmer is a pipeline function for trimming non word
+ * characters from the beginning and end of tokens before they
+ * enter the index.
+ *
+ * This implementation may not work correctly for non latin
+ * characters and should either be removed or adapted for use
+ * with languages with non-latin characters.
+ *
+ * @static
+ * @implements {lunr.PipelineFunction}
+ * @param {lunr.Token} token The token to pass through the filter
+ * @returns {lunr.Token}
+ * @see lunr.Pipeline
+ */
+lunr.trimmer = function (token) {
+ return token.update(function (s) {
+ return s.replace(/^\W+/, '').replace(/\W+$/, '')
+ })
+}
+
+lunr.Pipeline.registerFunction(lunr.trimmer, 'trimmer')
+/*!
+ * lunr.TokenSet
+ * Copyright (C) 2020 Oliver Nightingale
+ */
+
+/**
+ * A token set is used to store the unique list of all tokens
+ * within an index. Token sets are also used to represent an
+ * incoming query to the index, this query token set and index
+ * token set are then intersected to find which tokens to look
+ * up in the inverted index.
+ *
+ * A token set can hold multiple tokens, as in the case of the
+ * index token set, or it can hold a single token as in the
+ * case of a simple query token set.
+ *
+ * Additionally token sets are used to perform wildcard matching.
+ * Leading, contained and trailing wildcards are supported, and
+ * from this edit distance matching can also be provided.
+ *
+ * Token sets are implemented as a minimal finite state automata,
+ * where both common prefixes and suffixes are shared between tokens.
+ * This helps to reduce the space used for storing the token set.
+ *
+ * @constructor
+ */
+lunr.TokenSet = function () {
+ this.final = false
+ this.edges = {}
+ this.id = lunr.TokenSet._nextId
+ lunr.TokenSet._nextId += 1
+}
+
+/**
+ * Keeps track of the next, auto increment, identifier to assign
+ * to a new tokenSet.
+ *
+ * TokenSets require a unique identifier to be correctly minimised.
+ *
+ * @private
+ */
+lunr.TokenSet._nextId = 1
+
+/**
+ * Creates a TokenSet instance from the given sorted array of words.
+ *
+ * @param {String[]} arr - A sorted array of strings to create the set from.
+ * @returns {lunr.TokenSet}
+ * @throws Will throw an error if the input array is not sorted.
+ */
+lunr.TokenSet.fromArray = function (arr) {
+ var builder = new lunr.TokenSet.Builder
+
+ for (var i = 0, len = arr.length; i < len; i++) {
+ builder.insert(arr[i])
+ }
+
+ builder.finish()
+ return builder.root
+}
+
+/**
+ * Creates a token set from a query clause.
+ *
+ * @private
+ * @param {Object} clause - A single clause from lunr.Query.
+ * @param {string} clause.term - The query clause term.
+ * @param {number} [clause.editDistance] - The optional edit distance for the term.
+ * @returns {lunr.TokenSet}
+ */
+lunr.TokenSet.fromClause = function (clause) {
+ if ('editDistance' in clause) {
+ return lunr.TokenSet.fromFuzzyString(clause.term, clause.editDistance)
+ } else {
+ return lunr.TokenSet.fromString(clause.term)
+ }
+}
+
+/**
+ * Creates a token set representing a single string with a specified
+ * edit distance.
+ *
+ * Insertions, deletions, substitutions and transpositions are each
+ * treated as an edit distance of 1.
+ *
+ * Increasing the allowed edit distance will have a dramatic impact
+ * on the performance of both creating and intersecting these TokenSets.
+ * It is advised to keep the edit distance less than 3.
+ *
+ * @param {string} str - The string to create the token set from.
+ * @param {number} editDistance - The allowed edit distance to match.
+ * @returns {lunr.Vector}
+ */
+lunr.TokenSet.fromFuzzyString = function (str, editDistance) {
+ var root = new lunr.TokenSet
+
+ var stack = [{
+ node: root,
+ editsRemaining: editDistance,
+ str: str
+ }]
+
+ while (stack.length) {
+ var frame = stack.pop()
+
+ // no edit
+ if (frame.str.length > 0) {
+ var char = frame.str.charAt(0),
+ noEditNode
+
+ if (char in frame.node.edges) {
+ noEditNode = frame.node.edges[char]
+ } else {
+ noEditNode = new lunr.TokenSet
+ frame.node.edges[char] = noEditNode
+ }
+
+ if (frame.str.length == 1) {
+ noEditNode.final = true
+ }
+
+ stack.push({
+ node: noEditNode,
+ editsRemaining: frame.editsRemaining,
+ str: frame.str.slice(1)
+ })
+ }
+
+ if (frame.editsRemaining == 0) {
+ continue
+ }
+
+ // insertion
+ if ("*" in frame.node.edges) {
+ var insertionNode = frame.node.edges["*"]
+ } else {
+ var insertionNode = new lunr.TokenSet
+ frame.node.edges["*"] = insertionNode
+ }
+
+ if (frame.str.length == 0) {
+ insertionNode.final = true
+ }
+
+ stack.push({
+ node: insertionNode,
+ editsRemaining: frame.editsRemaining - 1,
+ str: frame.str
+ })
+
+ // deletion
+ // can only do a deletion if we have enough edits remaining
+ // and if there are characters left to delete in the string
+ if (frame.str.length > 1) {
+ stack.push({
+ node: frame.node,
+ editsRemaining: frame.editsRemaining - 1,
+ str: frame.str.slice(1)
+ })
+ }
+
+ // deletion
+ // just removing the last character from the str
+ if (frame.str.length == 1) {
+ frame.node.final = true
+ }
+
+ // substitution
+ // can only do a substitution if we have enough edits remaining
+ // and if there are characters left to substitute
+ if (frame.str.length >= 1) {
+ if ("*" in frame.node.edges) {
+ var substitutionNode = frame.node.edges["*"]
+ } else {
+ var substitutionNode = new lunr.TokenSet
+ frame.node.edges["*"] = substitutionNode
+ }
+
+ if (frame.str.length == 1) {
+ substitutionNode.final = true
+ }
+
+ stack.push({
+ node: substitutionNode,
+ editsRemaining: frame.editsRemaining - 1,
+ str: frame.str.slice(1)
+ })
+ }
+
+ // transposition
+ // can only do a transposition if there are edits remaining
+ // and there are enough characters to transpose
+ if (frame.str.length > 1) {
+ var charA = frame.str.charAt(0),
+ charB = frame.str.charAt(1),
+ transposeNode
+
+ if (charB in frame.node.edges) {
+ transposeNode = frame.node.edges[charB]
+ } else {
+ transposeNode = new lunr.TokenSet
+ frame.node.edges[charB] = transposeNode
+ }
+
+ if (frame.str.length == 1) {
+ transposeNode.final = true
+ }
+
+ stack.push({
+ node: transposeNode,
+ editsRemaining: frame.editsRemaining - 1,
+ str: charA + frame.str.slice(2)
+ })
+ }
+ }
+
+ return root
+}
+
+/**
+ * Creates a TokenSet from a string.
+ *
+ * The string may contain one or more wildcard characters (*)
+ * that will allow wildcard matching when intersecting with
+ * another TokenSet.
+ *
+ * @param {string} str - The string to create a TokenSet from.
+ * @returns {lunr.TokenSet}
+ */
+lunr.TokenSet.fromString = function (str) {
+ var node = new lunr.TokenSet,
+ root = node
+
+ /*
+ * Iterates through all characters within the passed string
+ * appending a node for each character.
+ *
+ * When a wildcard character is found then a self
+ * referencing edge is introduced to continually match
+ * any number of any characters.
+ */
+ for (var i = 0, len = str.length; i < len; i++) {
+ var char = str[i],
+ final = (i == len - 1)
+
+ if (char == "*") {
+ node.edges[char] = node
+ node.final = final
+
+ } else {
+ var next = new lunr.TokenSet
+ next.final = final
+
+ node.edges[char] = next
+ node = next
+ }
+ }
+
+ return root
+}
+
+/**
+ * Converts this TokenSet into an array of strings
+ * contained within the TokenSet.
+ *
+ * This is not intended to be used on a TokenSet that
+ * contains wildcards, in these cases the results are
+ * undefined and are likely to cause an infinite loop.
+ *
+ * @returns {string[]}
+ */
+lunr.TokenSet.prototype.toArray = function () {
+ var words = []
+
+ var stack = [{
+ prefix: "",
+ node: this
+ }]
+
+ while (stack.length) {
+ var frame = stack.pop(),
+ edges = Object.keys(frame.node.edges),
+ len = edges.length
+
+ if (frame.node.final) {
+ /* In Safari, at this point the prefix is sometimes corrupted, see:
+ * https://github.com/olivernn/lunr.js/issues/279 Calling any
+ * String.prototype method forces Safari to "cast" this string to what
+ * it's supposed to be, fixing the bug. */
+ frame.prefix.charAt(0)
+ words.push(frame.prefix)
+ }
+
+ for (var i = 0; i < len; i++) {
+ var edge = edges[i]
+
+ stack.push({
+ prefix: frame.prefix.concat(edge),
+ node: frame.node.edges[edge]
+ })
+ }
+ }
+
+ return words
+}
+
+/**
+ * Generates a string representation of a TokenSet.
+ *
+ * This is intended to allow TokenSets to be used as keys
+ * in objects, largely to aid the construction and minimisation
+ * of a TokenSet. As such it is not designed to be a human
+ * friendly representation of the TokenSet.
+ *
+ * @returns {string}
+ */
+lunr.TokenSet.prototype.toString = function () {
+ // NOTE: Using Object.keys here as this.edges is very likely
+ // to enter 'hash-mode' with many keys being added
+ //
+ // avoiding a for-in loop here as it leads to the function
+ // being de-optimised (at least in V8). From some simple
+ // benchmarks the performance is comparable, but allowing
+ // V8 to optimize may mean easy performance wins in the future.
+
+ if (this._str) {
+ return this._str
+ }
+
+ var str = this.final ? '1' : '0',
+ labels = Object.keys(this.edges).sort(),
+ len = labels.length
+
+ for (var i = 0; i < len; i++) {
+ var label = labels[i],
+ node = this.edges[label]
+
+ str = str + label + node.id
+ }
+
+ return str
+}
+
+/**
+ * Returns a new TokenSet that is the intersection of
+ * this TokenSet and the passed TokenSet.
+ *
+ * This intersection will take into account any wildcards
+ * contained within the TokenSet.
+ *
+ * @param {lunr.TokenSet} b - An other TokenSet to intersect with.
+ * @returns {lunr.TokenSet}
+ */
+lunr.TokenSet.prototype.intersect = function (b) {
+ var output = new lunr.TokenSet,
+ frame = undefined
+
+ var stack = [{
+ qNode: b,
+ output: output,
+ node: this
+ }]
+
+ while (stack.length) {
+ frame = stack.pop()
+
+ // NOTE: As with the #toString method, we are using
+ // Object.keys and a for loop instead of a for-in loop
+ // as both of these objects enter 'hash' mode, causing
+ // the function to be de-optimised in V8
+ var qEdges = Object.keys(frame.qNode.edges),
+ qLen = qEdges.length,
+ nEdges = Object.keys(frame.node.edges),
+ nLen = nEdges.length
+
+ for (var q = 0; q < qLen; q++) {
+ var qEdge = qEdges[q]
+
+ for (var n = 0; n < nLen; n++) {
+ var nEdge = nEdges[n]
+
+ if (nEdge == qEdge || qEdge == '*') {
+ var node = frame.node.edges[nEdge],
+ qNode = frame.qNode.edges[qEdge],
+ final = node.final && qNode.final,
+ next = undefined
+
+ if (nEdge in frame.output.edges) {
+ // an edge already exists for this character
+ // no need to create a new node, just set the finality
+ // bit unless this node is already final
+ next = frame.output.edges[nEdge]
+ next.final = next.final || final
+
+ } else {
+ // no edge exists yet, must create one
+ // set the finality bit and insert it
+ // into the output
+ next = new lunr.TokenSet
+ next.final = final
+ frame.output.edges[nEdge] = next
+ }
+
+ stack.push({
+ qNode: qNode,
+ output: next,
+ node: node
+ })
+ }
+ }
+ }
+ }
+
+ return output
+}
+lunr.TokenSet.Builder = function () {
+ this.previousWord = ""
+ this.root = new lunr.TokenSet
+ this.uncheckedNodes = []
+ this.minimizedNodes = {}
+}
+
+lunr.TokenSet.Builder.prototype.insert = function (word) {
+ var node,
+ commonPrefix = 0
+
+ if (word < this.previousWord) {
+ throw new Error ("Out of order word insertion")
+ }
+
+ for (var i = 0; i < word.length && i < this.previousWord.length; i++) {
+ if (word[i] != this.previousWord[i]) break
+ commonPrefix++
+ }
+
+ this.minimize(commonPrefix)
+
+ if (this.uncheckedNodes.length == 0) {
+ node = this.root
+ } else {
+ node = this.uncheckedNodes[this.uncheckedNodes.length - 1].child
+ }
+
+ for (var i = commonPrefix; i < word.length; i++) {
+ var nextNode = new lunr.TokenSet,
+ char = word[i]
+
+ node.edges[char] = nextNode
+
+ this.uncheckedNodes.push({
+ parent: node,
+ char: char,
+ child: nextNode
+ })
+
+ node = nextNode
+ }
+
+ node.final = true
+ this.previousWord = word
+}
+
+lunr.TokenSet.Builder.prototype.finish = function () {
+ this.minimize(0)
+}
+
+lunr.TokenSet.Builder.prototype.minimize = function (downTo) {
+ for (var i = this.uncheckedNodes.length - 1; i >= downTo; i--) {
+ var node = this.uncheckedNodes[i],
+ childKey = node.child.toString()
+
+ if (childKey in this.minimizedNodes) {
+ node.parent.edges[node.char] = this.minimizedNodes[childKey]
+ } else {
+ // Cache the key for this node since
+ // we know it can't change anymore
+ node.child._str = childKey
+
+ this.minimizedNodes[childKey] = node.child
+ }
+
+ this.uncheckedNodes.pop()
+ }
+}
+/*!
+ * lunr.Index
+ * Copyright (C) 2020 Oliver Nightingale
+ */
+
+/**
+ * An index contains the built index of all documents and provides a query interface
+ * to the index.
+ *
+ * Usually instances of lunr.Index will not be created using this constructor, instead
+ * lunr.Builder should be used to construct new indexes, or lunr.Index.load should be
+ * used to load previously built and serialized indexes.
+ *
+ * @constructor
+ * @param {Object} attrs - The attributes of the built search index.
+ * @param {Object} attrs.invertedIndex - An index of term/field to document reference.
+ * @param {Object} attrs.fieldVectors - Field vectors
+ * @param {lunr.TokenSet} attrs.tokenSet - An set of all corpus tokens.
+ * @param {string[]} attrs.fields - The names of indexed document fields.
+ * @param {lunr.Pipeline} attrs.pipeline - The pipeline to use for search terms.
+ */
+lunr.Index = function (attrs) {
+ this.invertedIndex = attrs.invertedIndex
+ this.fieldVectors = attrs.fieldVectors
+ this.tokenSet = attrs.tokenSet
+ this.fields = attrs.fields
+ this.pipeline = attrs.pipeline
+}
+
+/**
+ * A result contains details of a document matching a search query.
+ * @typedef {Object} lunr.Index~Result
+ * @property {string} ref - The reference of the document this result represents.
+ * @property {number} score - A number between 0 and 1 representing how similar this document is to the query.
+ * @property {lunr.MatchData} matchData - Contains metadata about this match including which term(s) caused the match.
+ */
+
+/**
+ * Although lunr provides the ability to create queries using lunr.Query, it also provides a simple
+ * query language which itself is parsed into an instance of lunr.Query.
+ *
+ * For programmatically building queries it is advised to directly use lunr.Query, the query language
+ * is best used for human entered text rather than program generated text.
+ *
+ * At its simplest queries can just be a single term, e.g. `hello`, multiple terms are also supported
+ * and will be combined with OR, e.g `hello world` will match documents that contain either 'hello'
+ * or 'world', though those that contain both will rank higher in the results.
+ *
+ * Wildcards can be included in terms to match one or more unspecified characters, these wildcards can
+ * be inserted anywhere within the term, and more than one wildcard can exist in a single term. Adding
+ * wildcards will increase the number of documents that will be found but can also have a negative
+ * impact on query performance, especially with wildcards at the beginning of a term.
+ *
+ * Terms can be restricted to specific fields, e.g. `title:hello`, only documents with the term
+ * hello in the title field will match this query. Using a field not present in the index will lead
+ * to an error being thrown.
+ *
+ * Modifiers can also be added to terms, lunr supports edit distance and boost modifiers on terms. A term
+ * boost will make documents matching that term score higher, e.g. `foo^5`. Edit distance is also supported
+ * to provide fuzzy matching, e.g. 'hello~2' will match documents with hello with an edit distance of 2.
+ * Avoid large values for edit distance to improve query performance.
+ *
+ * Each term also supports a presence modifier. By default a term's presence in document is optional, however
+ * this can be changed to either required or prohibited. For a term's presence to be required in a document the
+ * term should be prefixed with a '+', e.g. `+foo bar` is a search for documents that must contain 'foo' and
+ * optionally contain 'bar'. Conversely a leading '-' sets the terms presence to prohibited, i.e. it must not
+ * appear in a document, e.g. `-foo bar` is a search for documents that do not contain 'foo' but may contain 'bar'.
+ *
+ * To escape special characters the backslash character '\' can be used, this allows searches to include
+ * characters that would normally be considered modifiers, e.g. `foo\~2` will search for a term "foo~2" instead
+ * of attempting to apply a boost of 2 to the search term "foo".
+ *
+ * @typedef {string} lunr.Index~QueryString
+ * @example
Simple single term query
+ * hello
+ * @example
Multiple term query
+ * hello world
+ * @example
term scoped to a field
+ * title:hello
+ * @example
term with a boost of 10
+ * hello^10
+ * @example
term with an edit distance of 2
+ * hello~2
+ * @example
terms with presence modifiers
+ * -foo +bar baz
+ */
+
+/**
+ * Performs a search against the index using lunr query syntax.
+ *
+ * Results will be returned sorted by their score, the most relevant results
+ * will be returned first. For details on how the score is calculated, please see
+ * the {@link https://lunrjs.com/guides/searching.html#scoring|guide}.
+ *
+ * For more programmatic querying use lunr.Index#query.
+ *
+ * @param {lunr.Index~QueryString} queryString - A string containing a lunr query.
+ * @throws {lunr.QueryParseError} If the passed query string cannot be parsed.
+ * @returns {lunr.Index~Result[]}
+ */
+lunr.Index.prototype.search = function (queryString) {
+ return this.query(function (query) {
+ var parser = new lunr.QueryParser(queryString, query)
+ parser.parse()
+ })
+}
+
+/**
+ * A query builder callback provides a query object to be used to express
+ * the query to perform on the index.
+ *
+ * @callback lunr.Index~queryBuilder
+ * @param {lunr.Query} query - The query object to build up.
+ * @this lunr.Query
+ */
+
+/**
+ * Performs a query against the index using the yielded lunr.Query object.
+ *
+ * If performing programmatic queries against the index, this method is preferred
+ * over lunr.Index#search so as to avoid the additional query parsing overhead.
+ *
+ * A query object is yielded to the supplied function which should be used to
+ * express the query to be run against the index.
+ *
+ * Note that although this function takes a callback parameter it is _not_ an
+ * asynchronous operation, the callback is just yielded a query object to be
+ * customized.
+ *
+ * @param {lunr.Index~queryBuilder} fn - A function that is used to build the query.
+ * @returns {lunr.Index~Result[]}
+ */
+lunr.Index.prototype.query = function (fn) {
+ // for each query clause
+ // * process terms
+ // * expand terms from token set
+ // * find matching documents and metadata
+ // * get document vectors
+ // * score documents
+
+ var query = new lunr.Query(this.fields),
+ matchingFields = Object.create(null),
+ queryVectors = Object.create(null),
+ termFieldCache = Object.create(null),
+ requiredMatches = Object.create(null),
+ prohibitedMatches = Object.create(null)
+
+ /*
+ * To support field level boosts a query vector is created per
+ * field. An empty vector is eagerly created to support negated
+ * queries.
+ */
+ for (var i = 0; i < this.fields.length; i++) {
+ queryVectors[this.fields[i]] = new lunr.Vector
+ }
+
+ fn.call(query, query)
+
+ for (var i = 0; i < query.clauses.length; i++) {
+ /*
+ * Unless the pipeline has been disabled for this term, which is
+ * the case for terms with wildcards, we need to pass the clause
+ * term through the search pipeline. A pipeline returns an array
+ * of processed terms. Pipeline functions may expand the passed
+ * term, which means we may end up performing multiple index lookups
+ * for a single query term.
+ */
+ var clause = query.clauses[i],
+ terms = null,
+ clauseMatches = lunr.Set.empty
+
+ if (clause.usePipeline) {
+ terms = this.pipeline.runString(clause.term, {
+ fields: clause.fields
+ })
+ } else {
+ terms = [clause.term]
+ }
+
+ for (var m = 0; m < terms.length; m++) {
+ var term = terms[m]
+
+ /*
+ * Each term returned from the pipeline needs to use the same query
+ * clause object, e.g. the same boost and or edit distance. The
+ * simplest way to do this is to re-use the clause object but mutate
+ * its term property.
+ */
+ clause.term = term
+
+ /*
+ * From the term in the clause we create a token set which will then
+ * be used to intersect the indexes token set to get a list of terms
+ * to lookup in the inverted index
+ */
+ var termTokenSet = lunr.TokenSet.fromClause(clause),
+ expandedTerms = this.tokenSet.intersect(termTokenSet).toArray()
+
+ /*
+ * If a term marked as required does not exist in the tokenSet it is
+ * impossible for the search to return any matches. We set all the field
+ * scoped required matches set to empty and stop examining any further
+ * clauses.
+ */
+ if (expandedTerms.length === 0 && clause.presence === lunr.Query.presence.REQUIRED) {
+ for (var k = 0; k < clause.fields.length; k++) {
+ var field = clause.fields[k]
+ requiredMatches[field] = lunr.Set.empty
+ }
+
+ break
+ }
+
+ for (var j = 0; j < expandedTerms.length; j++) {
+ /*
+ * For each term get the posting and termIndex, this is required for
+ * building the query vector.
+ */
+ var expandedTerm = expandedTerms[j],
+ posting = this.invertedIndex[expandedTerm],
+ termIndex = posting._index
+
+ for (var k = 0; k < clause.fields.length; k++) {
+ /*
+ * For each field that this query term is scoped by (by default
+ * all fields are in scope) we need to get all the document refs
+ * that have this term in that field.
+ *
+ * The posting is the entry in the invertedIndex for the matching
+ * term from above.
+ */
+ var field = clause.fields[k],
+ fieldPosting = posting[field],
+ matchingDocumentRefs = Object.keys(fieldPosting),
+ termField = expandedTerm + "/" + field,
+ matchingDocumentsSet = new lunr.Set(matchingDocumentRefs)
+
+ /*
+ * if the presence of this term is required ensure that the matching
+ * documents are added to the set of required matches for this clause.
+ *
+ */
+ if (clause.presence == lunr.Query.presence.REQUIRED) {
+ clauseMatches = clauseMatches.union(matchingDocumentsSet)
+
+ if (requiredMatches[field] === undefined) {
+ requiredMatches[field] = lunr.Set.complete
+ }
+ }
+
+ /*
+ * if the presence of this term is prohibited ensure that the matching
+ * documents are added to the set of prohibited matches for this field,
+ * creating that set if it does not yet exist.
+ */
+ if (clause.presence == lunr.Query.presence.PROHIBITED) {
+ if (prohibitedMatches[field] === undefined) {
+ prohibitedMatches[field] = lunr.Set.empty
+ }
+
+ prohibitedMatches[field] = prohibitedMatches[field].union(matchingDocumentsSet)
+
+ /*
+ * Prohibited matches should not be part of the query vector used for
+ * similarity scoring and no metadata should be extracted so we continue
+ * to the next field
+ */
+ continue
+ }
+
+ /*
+ * The query field vector is populated using the termIndex found for
+ * the term and a unit value with the appropriate boost applied.
+ * Using upsert because there could already be an entry in the vector
+ * for the term we are working with. In that case we just add the scores
+ * together.
+ */
+ queryVectors[field].upsert(termIndex, clause.boost, function (a, b) { return a + b })
+
+ /**
+ * If we've already seen this term, field combo then we've already collected
+ * the matching documents and metadata, no need to go through all that again
+ */
+ if (termFieldCache[termField]) {
+ continue
+ }
+
+ for (var l = 0; l < matchingDocumentRefs.length; l++) {
+ /*
+ * All metadata for this term/field/document triple
+ * are then extracted and collected into an instance
+ * of lunr.MatchData ready to be returned in the query
+ * results
+ */
+ var matchingDocumentRef = matchingDocumentRefs[l],
+ matchingFieldRef = new lunr.FieldRef (matchingDocumentRef, field),
+ metadata = fieldPosting[matchingDocumentRef],
+ fieldMatch
+
+ if ((fieldMatch = matchingFields[matchingFieldRef]) === undefined) {
+ matchingFields[matchingFieldRef] = new lunr.MatchData (expandedTerm, field, metadata)
+ } else {
+ fieldMatch.add(expandedTerm, field, metadata)
+ }
+
+ }
+
+ termFieldCache[termField] = true
+ }
+ }
+ }
+
+ /**
+ * If the presence was required we need to update the requiredMatches field sets.
+ * We do this after all fields for the term have collected their matches because
+ * the clause terms presence is required in _any_ of the fields not _all_ of the
+ * fields.
+ */
+ if (clause.presence === lunr.Query.presence.REQUIRED) {
+ for (var k = 0; k < clause.fields.length; k++) {
+ var field = clause.fields[k]
+ requiredMatches[field] = requiredMatches[field].intersect(clauseMatches)
+ }
+ }
+ }
+
+ /**
+ * Need to combine the field scoped required and prohibited
+ * matching documents into a global set of required and prohibited
+ * matches
+ */
+ var allRequiredMatches = lunr.Set.complete,
+ allProhibitedMatches = lunr.Set.empty
+
+ for (var i = 0; i < this.fields.length; i++) {
+ var field = this.fields[i]
+
+ if (requiredMatches[field]) {
+ allRequiredMatches = allRequiredMatches.intersect(requiredMatches[field])
+ }
+
+ if (prohibitedMatches[field]) {
+ allProhibitedMatches = allProhibitedMatches.union(prohibitedMatches[field])
+ }
+ }
+
+ var matchingFieldRefs = Object.keys(matchingFields),
+ results = [],
+ matches = Object.create(null)
+
+ /*
+ * If the query is negated (contains only prohibited terms)
+ * we need to get _all_ fieldRefs currently existing in the
+ * index. This is only done when we know that the query is
+ * entirely prohibited terms to avoid any cost of getting all
+ * fieldRefs unnecessarily.
+ *
+ * Additionally, blank MatchData must be created to correctly
+ * populate the results.
+ */
+ if (query.isNegated()) {
+ matchingFieldRefs = Object.keys(this.fieldVectors)
+
+ for (var i = 0; i < matchingFieldRefs.length; i++) {
+ var matchingFieldRef = matchingFieldRefs[i]
+ var fieldRef = lunr.FieldRef.fromString(matchingFieldRef)
+ matchingFields[matchingFieldRef] = new lunr.MatchData
+ }
+ }
+
+ for (var i = 0; i < matchingFieldRefs.length; i++) {
+ /*
+ * Currently we have document fields that match the query, but we
+ * need to return documents. The matchData and scores are combined
+ * from multiple fields belonging to the same document.
+ *
+ * Scores are calculated by field, using the query vectors created
+ * above, and combined into a final document score using addition.
+ */
+ var fieldRef = lunr.FieldRef.fromString(matchingFieldRefs[i]),
+ docRef = fieldRef.docRef
+
+ if (!allRequiredMatches.contains(docRef)) {
+ continue
+ }
+
+ if (allProhibitedMatches.contains(docRef)) {
+ continue
+ }
+
+ var fieldVector = this.fieldVectors[fieldRef],
+ score = queryVectors[fieldRef.fieldName].similarity(fieldVector),
+ docMatch
+
+ if ((docMatch = matches[docRef]) !== undefined) {
+ docMatch.score += score
+ docMatch.matchData.combine(matchingFields[fieldRef])
+ } else {
+ var match = {
+ ref: docRef,
+ score: score,
+ matchData: matchingFields[fieldRef]
+ }
+ matches[docRef] = match
+ results.push(match)
+ }
+ }
+
+ /*
+ * Sort the results objects by score, highest first.
+ */
+ return results.sort(function (a, b) {
+ return b.score - a.score
+ })
+}
+
+/**
+ * Prepares the index for JSON serialization.
+ *
+ * The schema for this JSON blob will be described in a
+ * separate JSON schema file.
+ *
+ * @returns {Object}
+ */
+lunr.Index.prototype.toJSON = function () {
+ var invertedIndex = Object.keys(this.invertedIndex)
+ .sort()
+ .map(function (term) {
+ return [term, this.invertedIndex[term]]
+ }, this)
+
+ var fieldVectors = Object.keys(this.fieldVectors)
+ .map(function (ref) {
+ return [ref, this.fieldVectors[ref].toJSON()]
+ }, this)
+
+ return {
+ version: lunr.version,
+ fields: this.fields,
+ fieldVectors: fieldVectors,
+ invertedIndex: invertedIndex,
+ pipeline: this.pipeline.toJSON()
+ }
+}
+
+/**
+ * Loads a previously serialized lunr.Index
+ *
+ * @param {Object} serializedIndex - A previously serialized lunr.Index
+ * @returns {lunr.Index}
+ */
+lunr.Index.load = function (serializedIndex) {
+ var attrs = {},
+ fieldVectors = {},
+ serializedVectors = serializedIndex.fieldVectors,
+ invertedIndex = Object.create(null),
+ serializedInvertedIndex = serializedIndex.invertedIndex,
+ tokenSetBuilder = new lunr.TokenSet.Builder,
+ pipeline = lunr.Pipeline.load(serializedIndex.pipeline)
+
+ if (serializedIndex.version != lunr.version) {
+ lunr.utils.warn("Version mismatch when loading serialised index. Current version of lunr '" + lunr.version + "' does not match serialized index '" + serializedIndex.version + "'")
+ }
+
+ for (var i = 0; i < serializedVectors.length; i++) {
+ var tuple = serializedVectors[i],
+ ref = tuple[0],
+ elements = tuple[1]
+
+ fieldVectors[ref] = new lunr.Vector(elements)
+ }
+
+ for (var i = 0; i < serializedInvertedIndex.length; i++) {
+ var tuple = serializedInvertedIndex[i],
+ term = tuple[0],
+ posting = tuple[1]
+
+ tokenSetBuilder.insert(term)
+ invertedIndex[term] = posting
+ }
+
+ tokenSetBuilder.finish()
+
+ attrs.fields = serializedIndex.fields
+
+ attrs.fieldVectors = fieldVectors
+ attrs.invertedIndex = invertedIndex
+ attrs.tokenSet = tokenSetBuilder.root
+ attrs.pipeline = pipeline
+
+ return new lunr.Index(attrs)
+}
+/*!
+ * lunr.Builder
+ * Copyright (C) 2020 Oliver Nightingale
+ */
+
+/**
+ * lunr.Builder performs indexing on a set of documents and
+ * returns instances of lunr.Index ready for querying.
+ *
+ * All configuration of the index is done via the builder, the
+ * fields to index, the document reference, the text processing
+ * pipeline and document scoring parameters are all set on the
+ * builder before indexing.
+ *
+ * @constructor
+ * @property {string} _ref - Internal reference to the document reference field.
+ * @property {string[]} _fields - Internal reference to the document fields to index.
+ * @property {object} invertedIndex - The inverted index maps terms to document fields.
+ * @property {object} documentTermFrequencies - Keeps track of document term frequencies.
+ * @property {object} documentLengths - Keeps track of the length of documents added to the index.
+ * @property {lunr.tokenizer} tokenizer - Function for splitting strings into tokens for indexing.
+ * @property {lunr.Pipeline} pipeline - The pipeline performs text processing on tokens before indexing.
+ * @property {lunr.Pipeline} searchPipeline - A pipeline for processing search terms before querying the index.
+ * @property {number} documentCount - Keeps track of the total number of documents indexed.
+ * @property {number} _b - A parameter to control field length normalization, setting this to 0 disabled normalization, 1 fully normalizes field lengths, the default value is 0.75.
+ * @property {number} _k1 - A parameter to control how quickly an increase in term frequency results in term frequency saturation, the default value is 1.2.
+ * @property {number} termIndex - A counter incremented for each unique term, used to identify a terms position in the vector space.
+ * @property {array} metadataWhitelist - A list of metadata keys that have been whitelisted for entry in the index.
+ */
+lunr.Builder = function () {
+ this._ref = "id"
+ this._fields = Object.create(null)
+ this._documents = Object.create(null)
+ this.invertedIndex = Object.create(null)
+ this.fieldTermFrequencies = {}
+ this.fieldLengths = {}
+ this.tokenizer = lunr.tokenizer
+ this.pipeline = new lunr.Pipeline
+ this.searchPipeline = new lunr.Pipeline
+ this.documentCount = 0
+ this._b = 0.75
+ this._k1 = 1.2
+ this.termIndex = 0
+ this.metadataWhitelist = []
+}
+
+/**
+ * Sets the document field used as the document reference. Every document must have this field.
+ * The type of this field in the document should be a string, if it is not a string it will be
+ * coerced into a string by calling toString.
+ *
+ * The default ref is 'id'.
+ *
+ * The ref should _not_ be changed during indexing, it should be set before any documents are
+ * added to the index. Changing it during indexing can lead to inconsistent results.
+ *
+ * @param {string} ref - The name of the reference field in the document.
+ */
+lunr.Builder.prototype.ref = function (ref) {
+ this._ref = ref
+}
+
+/**
+ * A function that is used to extract a field from a document.
+ *
+ * Lunr expects a field to be at the top level of a document, if however the field
+ * is deeply nested within a document an extractor function can be used to extract
+ * the right field for indexing.
+ *
+ * @callback fieldExtractor
+ * @param {object} doc - The document being added to the index.
+ * @returns {?(string|object|object[])} obj - The object that will be indexed for this field.
+ * @example
Extracting a nested field
+ * function (doc) { return doc.nested.field }
+ */
+
+/**
+ * Adds a field to the list of document fields that will be indexed. Every document being
+ * indexed should have this field. Null values for this field in indexed documents will
+ * not cause errors but will limit the chance of that document being retrieved by searches.
+ *
+ * All fields should be added before adding documents to the index. Adding fields after
+ * a document has been indexed will have no effect on already indexed documents.
+ *
+ * Fields can be boosted at build time. This allows terms within that field to have more
+ * importance when ranking search results. Use a field boost to specify that matches within
+ * one field are more important than other fields.
+ *
+ * @param {string} fieldName - The name of a field to index in all documents.
+ * @param {object} attributes - Optional attributes associated with this field.
+ * @param {number} [attributes.boost=1] - Boost applied to all terms within this field.
+ * @param {fieldExtractor} [attributes.extractor] - Function to extract a field from a document.
+ * @throws {RangeError} fieldName cannot contain unsupported characters '/'
+ */
+lunr.Builder.prototype.field = function (fieldName, attributes) {
+ if (/\//.test(fieldName)) {
+ throw new RangeError ("Field '" + fieldName + "' contains illegal character '/'")
+ }
+
+ this._fields[fieldName] = attributes || {}
+}
+
+/**
+ * A parameter to tune the amount of field length normalisation that is applied when
+ * calculating relevance scores. A value of 0 will completely disable any normalisation
+ * and a value of 1 will fully normalise field lengths. The default is 0.75. Values of b
+ * will be clamped to the range 0 - 1.
+ *
+ * @param {number} number - The value to set for this tuning parameter.
+ */
+lunr.Builder.prototype.b = function (number) {
+ if (number < 0) {
+ this._b = 0
+ } else if (number > 1) {
+ this._b = 1
+ } else {
+ this._b = number
+ }
+}
+
+/**
+ * A parameter that controls the speed at which a rise in term frequency results in term
+ * frequency saturation. The default value is 1.2. Setting this to a higher value will give
+ * slower saturation levels, a lower value will result in quicker saturation.
+ *
+ * @param {number} number - The value to set for this tuning parameter.
+ */
+lunr.Builder.prototype.k1 = function (number) {
+ this._k1 = number
+}
+
+/**
+ * Adds a document to the index.
+ *
+ * Before adding fields to the index the index should have been fully setup, with the document
+ * ref and all fields to index already having been specified.
+ *
+ * The document must have a field name as specified by the ref (by default this is 'id') and
+ * it should have all fields defined for indexing, though null or undefined values will not
+ * cause errors.
+ *
+ * Entire documents can be boosted at build time. Applying a boost to a document indicates that
+ * this document should rank higher in search results than other documents.
+ *
+ * @param {object} doc - The document to add to the index.
+ * @param {object} attributes - Optional attributes associated with this document.
+ * @param {number} [attributes.boost=1] - Boost applied to all terms within this document.
+ */
+lunr.Builder.prototype.add = function (doc, attributes) {
+ var docRef = doc[this._ref],
+ fields = Object.keys(this._fields)
+
+ this._documents[docRef] = attributes || {}
+ this.documentCount += 1
+
+ for (var i = 0; i < fields.length; i++) {
+ var fieldName = fields[i],
+ extractor = this._fields[fieldName].extractor,
+ field = extractor ? extractor(doc) : doc[fieldName],
+ tokens = this.tokenizer(field, {
+ fields: [fieldName]
+ }),
+ terms = this.pipeline.run(tokens),
+ fieldRef = new lunr.FieldRef (docRef, fieldName),
+ fieldTerms = Object.create(null)
+
+ this.fieldTermFrequencies[fieldRef] = fieldTerms
+ this.fieldLengths[fieldRef] = 0
+
+ // store the length of this field for this document
+ this.fieldLengths[fieldRef] += terms.length
+
+ // calculate term frequencies for this field
+ for (var j = 0; j < terms.length; j++) {
+ var term = terms[j]
+
+ if (fieldTerms[term] == undefined) {
+ fieldTerms[term] = 0
+ }
+
+ fieldTerms[term] += 1
+
+ // add to inverted index
+ // create an initial posting if one doesn't exist
+ if (this.invertedIndex[term] == undefined) {
+ var posting = Object.create(null)
+ posting["_index"] = this.termIndex
+ this.termIndex += 1
+
+ for (var k = 0; k < fields.length; k++) {
+ posting[fields[k]] = Object.create(null)
+ }
+
+ this.invertedIndex[term] = posting
+ }
+
+ // add an entry for this term/fieldName/docRef to the invertedIndex
+ if (this.invertedIndex[term][fieldName][docRef] == undefined) {
+ this.invertedIndex[term][fieldName][docRef] = Object.create(null)
+ }
+
+ // store all whitelisted metadata about this token in the
+ // inverted index
+ for (var l = 0; l < this.metadataWhitelist.length; l++) {
+ var metadataKey = this.metadataWhitelist[l],
+ metadata = term.metadata[metadataKey]
+
+ if (this.invertedIndex[term][fieldName][docRef][metadataKey] == undefined) {
+ this.invertedIndex[term][fieldName][docRef][metadataKey] = []
+ }
+
+ this.invertedIndex[term][fieldName][docRef][metadataKey].push(metadata)
+ }
+ }
+
+ }
+}
+
+/**
+ * Calculates the average document length for this index
+ *
+ * @private
+ */
+lunr.Builder.prototype.calculateAverageFieldLengths = function () {
+
+ var fieldRefs = Object.keys(this.fieldLengths),
+ numberOfFields = fieldRefs.length,
+ accumulator = {},
+ documentsWithField = {}
+
+ for (var i = 0; i < numberOfFields; i++) {
+ var fieldRef = lunr.FieldRef.fromString(fieldRefs[i]),
+ field = fieldRef.fieldName
+
+ documentsWithField[field] || (documentsWithField[field] = 0)
+ documentsWithField[field] += 1
+
+ accumulator[field] || (accumulator[field] = 0)
+ accumulator[field] += this.fieldLengths[fieldRef]
+ }
+
+ var fields = Object.keys(this._fields)
+
+ for (var i = 0; i < fields.length; i++) {
+ var fieldName = fields[i]
+ accumulator[fieldName] = accumulator[fieldName] / documentsWithField[fieldName]
+ }
+
+ this.averageFieldLength = accumulator
+}
+
+/**
+ * Builds a vector space model of every document using lunr.Vector
+ *
+ * @private
+ */
+lunr.Builder.prototype.createFieldVectors = function () {
+ var fieldVectors = {},
+ fieldRefs = Object.keys(this.fieldTermFrequencies),
+ fieldRefsLength = fieldRefs.length,
+ termIdfCache = Object.create(null)
+
+ for (var i = 0; i < fieldRefsLength; i++) {
+ var fieldRef = lunr.FieldRef.fromString(fieldRefs[i]),
+ fieldName = fieldRef.fieldName,
+ fieldLength = this.fieldLengths[fieldRef],
+ fieldVector = new lunr.Vector,
+ termFrequencies = this.fieldTermFrequencies[fieldRef],
+ terms = Object.keys(termFrequencies),
+ termsLength = terms.length
+
+
+ var fieldBoost = this._fields[fieldName].boost || 1,
+ docBoost = this._documents[fieldRef.docRef].boost || 1
+
+ for (var j = 0; j < termsLength; j++) {
+ var term = terms[j],
+ tf = termFrequencies[term],
+ termIndex = this.invertedIndex[term]._index,
+ idf, score, scoreWithPrecision
+
+ if (termIdfCache[term] === undefined) {
+ idf = lunr.idf(this.invertedIndex[term], this.documentCount)
+ termIdfCache[term] = idf
+ } else {
+ idf = termIdfCache[term]
+ }
+
+ score = idf * ((this._k1 + 1) * tf) / (this._k1 * (1 - this._b + this._b * (fieldLength / this.averageFieldLength[fieldName])) + tf)
+ score *= fieldBoost
+ score *= docBoost
+ scoreWithPrecision = Math.round(score * 1000) / 1000
+ // Converts 1.23456789 to 1.234.
+ // Reducing the precision so that the vectors take up less
+ // space when serialised. Doing it now so that they behave
+ // the same before and after serialisation. Also, this is
+ // the fastest approach to reducing a number's precision in
+ // JavaScript.
+
+ fieldVector.insert(termIndex, scoreWithPrecision)
+ }
+
+ fieldVectors[fieldRef] = fieldVector
+ }
+
+ this.fieldVectors = fieldVectors
+}
+
+/**
+ * Creates a token set of all tokens in the index using lunr.TokenSet
+ *
+ * @private
+ */
+lunr.Builder.prototype.createTokenSet = function () {
+ this.tokenSet = lunr.TokenSet.fromArray(
+ Object.keys(this.invertedIndex).sort()
+ )
+}
+
+/**
+ * Builds the index, creating an instance of lunr.Index.
+ *
+ * This completes the indexing process and should only be called
+ * once all documents have been added to the index.
+ *
+ * @returns {lunr.Index}
+ */
+lunr.Builder.prototype.build = function () {
+ this.calculateAverageFieldLengths()
+ this.createFieldVectors()
+ this.createTokenSet()
+
+ return new lunr.Index({
+ invertedIndex: this.invertedIndex,
+ fieldVectors: this.fieldVectors,
+ tokenSet: this.tokenSet,
+ fields: Object.keys(this._fields),
+ pipeline: this.searchPipeline
+ })
+}
+
+/**
+ * Applies a plugin to the index builder.
+ *
+ * A plugin is a function that is called with the index builder as its context.
+ * Plugins can be used to customise or extend the behaviour of the index
+ * in some way. A plugin is just a function, that encapsulated the custom
+ * behaviour that should be applied when building the index.
+ *
+ * The plugin function will be called with the index builder as its argument, additional
+ * arguments can also be passed when calling use. The function will be called
+ * with the index builder as its context.
+ *
+ * @param {Function} plugin The plugin to apply.
+ */
+lunr.Builder.prototype.use = function (fn) {
+ var args = Array.prototype.slice.call(arguments, 1)
+ args.unshift(this)
+ fn.apply(this, args)
+}
+/**
+ * Contains and collects metadata about a matching document.
+ * A single instance of lunr.MatchData is returned as part of every
+ * lunr.Index~Result.
+ *
+ * @constructor
+ * @param {string} term - The term this match data is associated with
+ * @param {string} field - The field in which the term was found
+ * @param {object} metadata - The metadata recorded about this term in this field
+ * @property {object} metadata - A cloned collection of metadata associated with this document.
+ * @see {@link lunr.Index~Result}
+ */
+lunr.MatchData = function (term, field, metadata) {
+ var clonedMetadata = Object.create(null),
+ metadataKeys = Object.keys(metadata || {})
+
+ // Cloning the metadata to prevent the original
+ // being mutated during match data combination.
+ // Metadata is kept in an array within the inverted
+ // index so cloning the data can be done with
+ // Array#slice
+ for (var i = 0; i < metadataKeys.length; i++) {
+ var key = metadataKeys[i]
+ clonedMetadata[key] = metadata[key].slice()
+ }
+
+ this.metadata = Object.create(null)
+
+ if (term !== undefined) {
+ this.metadata[term] = Object.create(null)
+ this.metadata[term][field] = clonedMetadata
+ }
+}
+
+/**
+ * An instance of lunr.MatchData will be created for every term that matches a
+ * document. However only one instance is required in a lunr.Index~Result. This
+ * method combines metadata from another instance of lunr.MatchData with this
+ * objects metadata.
+ *
+ * @param {lunr.MatchData} otherMatchData - Another instance of match data to merge with this one.
+ * @see {@link lunr.Index~Result}
+ */
+lunr.MatchData.prototype.combine = function (otherMatchData) {
+ var terms = Object.keys(otherMatchData.metadata)
+
+ for (var i = 0; i < terms.length; i++) {
+ var term = terms[i],
+ fields = Object.keys(otherMatchData.metadata[term])
+
+ if (this.metadata[term] == undefined) {
+ this.metadata[term] = Object.create(null)
+ }
+
+ for (var j = 0; j < fields.length; j++) {
+ var field = fields[j],
+ keys = Object.keys(otherMatchData.metadata[term][field])
+
+ if (this.metadata[term][field] == undefined) {
+ this.metadata[term][field] = Object.create(null)
+ }
+
+ for (var k = 0; k < keys.length; k++) {
+ var key = keys[k]
+
+ if (this.metadata[term][field][key] == undefined) {
+ this.metadata[term][field][key] = otherMatchData.metadata[term][field][key]
+ } else {
+ this.metadata[term][field][key] = this.metadata[term][field][key].concat(otherMatchData.metadata[term][field][key])
+ }
+
+ }
+ }
+ }
+}
+
+/**
+ * Add metadata for a term/field pair to this instance of match data.
+ *
+ * @param {string} term - The term this match data is associated with
+ * @param {string} field - The field in which the term was found
+ * @param {object} metadata - The metadata recorded about this term in this field
+ */
+lunr.MatchData.prototype.add = function (term, field, metadata) {
+ if (!(term in this.metadata)) {
+ this.metadata[term] = Object.create(null)
+ this.metadata[term][field] = metadata
+ return
+ }
+
+ if (!(field in this.metadata[term])) {
+ this.metadata[term][field] = metadata
+ return
+ }
+
+ var metadataKeys = Object.keys(metadata)
+
+ for (var i = 0; i < metadataKeys.length; i++) {
+ var key = metadataKeys[i]
+
+ if (key in this.metadata[term][field]) {
+ this.metadata[term][field][key] = this.metadata[term][field][key].concat(metadata[key])
+ } else {
+ this.metadata[term][field][key] = metadata[key]
+ }
+ }
+}
+/**
+ * A lunr.Query provides a programmatic way of defining queries to be performed
+ * against a {@link lunr.Index}.
+ *
+ * Prefer constructing a lunr.Query using the {@link lunr.Index#query} method
+ * so the query object is pre-initialized with the right index fields.
+ *
+ * @constructor
+ * @property {lunr.Query~Clause[]} clauses - An array of query clauses.
+ * @property {string[]} allFields - An array of all available fields in a lunr.Index.
+ */
+lunr.Query = function (allFields) {
+ this.clauses = []
+ this.allFields = allFields
+}
+
+/**
+ * Constants for indicating what kind of automatic wildcard insertion will be used when constructing a query clause.
+ *
+ * This allows wildcards to be added to the beginning and end of a term without having to manually do any string
+ * concatenation.
+ *
+ * The wildcard constants can be bitwise combined to select both leading and trailing wildcards.
+ *
+ * @constant
+ * @default
+ * @property {number} wildcard.NONE - The term will have no wildcards inserted, this is the default behaviour
+ * @property {number} wildcard.LEADING - Prepend the term with a wildcard, unless a leading wildcard already exists
+ * @property {number} wildcard.TRAILING - Append a wildcard to the term, unless a trailing wildcard already exists
+ * @see lunr.Query~Clause
+ * @see lunr.Query#clause
+ * @see lunr.Query#term
+ * @example
+ * query.term('foo', {
+ * wildcard: lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
+ * })
+ */
+
+lunr.Query.wildcard = new String ("*")
+lunr.Query.wildcard.NONE = 0
+lunr.Query.wildcard.LEADING = 1
+lunr.Query.wildcard.TRAILING = 2
+
+/**
+ * Constants for indicating what kind of presence a term must have in matching documents.
+ *
+ * @constant
+ * @enum {number}
+ * @see lunr.Query~Clause
+ * @see lunr.Query#clause
+ * @see lunr.Query#term
+ * @example
query term with required presence
+ * query.term('foo', { presence: lunr.Query.presence.REQUIRED })
+ */
+lunr.Query.presence = {
+ /**
+ * Term's presence in a document is optional, this is the default value.
+ */
+ OPTIONAL: 1,
+
+ /**
+ * Term's presence in a document is required, documents that do not contain
+ * this term will not be returned.
+ */
+ REQUIRED: 2,
+
+ /**
+ * Term's presence in a document is prohibited, documents that do contain
+ * this term will not be returned.
+ */
+ PROHIBITED: 3
+}
+
+/**
+ * A single clause in a {@link lunr.Query} contains a term and details on how to
+ * match that term against a {@link lunr.Index}.
+ *
+ * @typedef {Object} lunr.Query~Clause
+ * @property {string[]} fields - The fields in an index this clause should be matched against.
+ * @property {number} [boost=1] - Any boost that should be applied when matching this clause.
+ * @property {number} [editDistance] - Whether the term should have fuzzy matching applied, and how fuzzy the match should be.
+ * @property {boolean} [usePipeline] - Whether the term should be passed through the search pipeline.
+ * @property {number} [wildcard=lunr.Query.wildcard.NONE] - Whether the term should have wildcards appended or prepended.
+ * @property {number} [presence=lunr.Query.presence.OPTIONAL] - The terms presence in any matching documents.
+ */
+
+/**
+ * Adds a {@link lunr.Query~Clause} to this query.
+ *
+ * Unless the clause contains the fields to be matched all fields will be matched. In addition
+ * a default boost of 1 is applied to the clause.
+ *
+ * @param {lunr.Query~Clause} clause - The clause to add to this query.
+ * @see lunr.Query~Clause
+ * @returns {lunr.Query}
+ */
+lunr.Query.prototype.clause = function (clause) {
+ if (!('fields' in clause)) {
+ clause.fields = this.allFields
+ }
+
+ if (!('boost' in clause)) {
+ clause.boost = 1
+ }
+
+ if (!('usePipeline' in clause)) {
+ clause.usePipeline = true
+ }
+
+ if (!('wildcard' in clause)) {
+ clause.wildcard = lunr.Query.wildcard.NONE
+ }
+
+ if ((clause.wildcard & lunr.Query.wildcard.LEADING) && (clause.term.charAt(0) != lunr.Query.wildcard)) {
+ clause.term = "*" + clause.term
+ }
+
+ if ((clause.wildcard & lunr.Query.wildcard.TRAILING) && (clause.term.slice(-1) != lunr.Query.wildcard)) {
+ clause.term = "" + clause.term + "*"
+ }
+
+ if (!('presence' in clause)) {
+ clause.presence = lunr.Query.presence.OPTIONAL
+ }
+
+ this.clauses.push(clause)
+
+ return this
+}
+
+/**
+ * A negated query is one in which every clause has a presence of
+ * prohibited. These queries require some special processing to return
+ * the expected results.
+ *
+ * @returns boolean
+ */
+lunr.Query.prototype.isNegated = function () {
+ for (var i = 0; i < this.clauses.length; i++) {
+ if (this.clauses[i].presence != lunr.Query.presence.PROHIBITED) {
+ return false
+ }
+ }
+
+ return true
+}
+
+/**
+ * Adds a term to the current query, under the covers this will create a {@link lunr.Query~Clause}
+ * to the list of clauses that make up this query.
+ *
+ * The term is used as is, i.e. no tokenization will be performed by this method. Instead conversion
+ * to a token or token-like string should be done before calling this method.
+ *
+ * The term will be converted to a string by calling `toString`. Multiple terms can be passed as an
+ * array, each term in the array will share the same options.
+ *
+ * @param {object|object[]} term - The term(s) to add to the query.
+ * @param {object} [options] - Any additional properties to add to the query clause.
+ * @returns {lunr.Query}
+ * @see lunr.Query#clause
+ * @see lunr.Query~Clause
+ * @example
adding a single term to a query
+ * query.term("foo")
+ * @example
adding a single term to a query and specifying search fields, term boost and automatic trailing wildcard
';
+}
+
+function displayResults (results) {
+ var search_results = document.getElementById("mkdocs-search-results");
+ while (search_results.firstChild) {
+ search_results.removeChild(search_results.firstChild);
+ }
+ if (results.length > 0){
+ for (var i=0; i < results.length; i++){
+ var result = results[i];
+ var html = formatResult(result.location, result.title, result.summary);
+ search_results.insertAdjacentHTML('beforeend', html);
+ }
+ } else {
+ var noResultsText = search_results.getAttribute('data-no-results-text');
+ if (!noResultsText) {
+ noResultsText = "No results found";
+ }
+ search_results.insertAdjacentHTML('beforeend', '
' + noResultsText + '
');
+ }
+}
+
+function doSearch () {
+ var query = document.getElementById('mkdocs-search-query').value;
+ if (query.length > min_search_length) {
+ if (!window.Worker) {
+ displayResults(search(query));
+ } else {
+ searchWorker.postMessage({query: query});
+ }
+ } else {
+ // Clear results for short queries
+ displayResults([]);
+ }
+}
+
+function initSearch () {
+ var search_input = document.getElementById('mkdocs-search-query');
+ if (search_input) {
+ search_input.addEventListener("keyup", doSearch);
+ }
+ var term = getSearchTermFromLocation();
+ if (term) {
+ search_input.value = term;
+ doSearch();
+ }
+}
+
+function onWorkerMessage (e) {
+ if (e.data.allowSearch) {
+ initSearch();
+ } else if (e.data.results) {
+ var results = e.data.results;
+ displayResults(results);
+ } else if (e.data.config) {
+ min_search_length = e.data.config.min_search_length-1;
+ }
+}
+
+if (!window.Worker) {
+ console.log('Web Worker API not supported');
+ // load index in main thread
+ $.getScript(joinUrl(base_url, "search/worker.js")).done(function () {
+ console.log('Loaded worker');
+ init();
+ window.postMessage = function (msg) {
+ onWorkerMessage({data: msg});
+ };
+ }).fail(function (jqxhr, settings, exception) {
+ console.error('Could not load worker.js');
+ });
+} else {
+ // Wrap search in a web worker
+ var searchWorker = new Worker(joinUrl(base_url, "search/worker.js"));
+ searchWorker.postMessage({init: true});
+ searchWorker.onmessage = onWorkerMessage;
+}
diff --git a/search/search_index.json b/search/search_index.json
new file mode 100644
index 0000000..fa6af85
--- /dev/null
+++ b/search/search_index.json
@@ -0,0 +1 @@
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Spark Advanced Topics Working Group Documentation Welcome to the Spark Advanced Topics working group documentation. This documentation is in the early stages. We have been working on a flowchart to help you solve your current problems. The documentation is collected under \"details\" (see above). Other resources Some other resources that may be useful include High Performance Spark by Holden Karau and Rachel Warren (note: some bias as a co-author), as well as the excellent on-line The Internals of Apache Spark and The Internals of Spark SQL by Jacek Laskowski.","title":"Spark Advanced Topics Working Group Documentation"},{"location":"#spark-advanced-topics-working-group-documentation","text":"Welcome to the Spark Advanced Topics working group documentation. This documentation is in the early stages. We have been working on a flowchart to help you solve your current problems. The documentation is collected under \"details\" (see above).","title":"Spark Advanced Topics Working Group Documentation"},{"location":"#other-resources","text":"Some other resources that may be useful include High Performance Spark by Holden Karau and Rachel Warren (note: some bias as a co-author), as well as the excellent on-line The Internals of Apache Spark and The Internals of Spark SQL by Jacek Laskowski.","title":"Other resources"},{"location":"details/best-pratice-collect/","text":"Bringing too much data back to the driver (collect and friends) A common anti-pattern in Apache Spark is using collect() and then processing records on the driver. There are a few different reasons why folks tend to do this and we can work through some alternatives: Label items in ascending order ZipWithIndex Index items in order Compute the size of each partition use this to assign indexes. In order processing Compute a partition at a time (this is annoying to do, sorry). Writing out to a format not supported by Spark Use foreachPartition or implement your own DataSink. Need to aggregate everything into a single record Call reduce or treeReduce Sometimes you do really need to bring the data back to the driver for some reason (e.g., updating model weights). In those cases, especially if you process the data sequentially, you can limit the amount of data coming back to the driver at one time. toLocalIterator gives you back an iterator which will only need to fetch a partition at a time (although in Python this may be pipeline for efficency). By default toLocalIterator will launch a Spark job for each partition, so if you know you will eventually need all of the data it makes sense to do a persist + a count (async or otherwise) so you don't block as long between partitions. This doesn't mean every call to collect() is bad, if the amount of data being returned is under ~1gb it's probably OK although it will limit parallelism.","title":"Bringing too much data back to the driver (collect and friends)"},{"location":"details/best-pratice-collect/#bringing-too-much-data-back-to-the-driver-collect-and-friends","text":"A common anti-pattern in Apache Spark is using collect() and then processing records on the driver. There are a few different reasons why folks tend to do this and we can work through some alternatives: Label items in ascending order ZipWithIndex Index items in order Compute the size of each partition use this to assign indexes. In order processing Compute a partition at a time (this is annoying to do, sorry). Writing out to a format not supported by Spark Use foreachPartition or implement your own DataSink. Need to aggregate everything into a single record Call reduce or treeReduce Sometimes you do really need to bring the data back to the driver for some reason (e.g., updating model weights). In those cases, especially if you process the data sequentially, you can limit the amount of data coming back to the driver at one time. toLocalIterator gives you back an iterator which will only need to fetch a partition at a time (although in Python this may be pipeline for efficency). By default toLocalIterator will launch a Spark job for each partition, so if you know you will eventually need all of the data it makes sense to do a persist + a count (async or otherwise) so you don't block as long between partitions. This doesn't mean every call to collect() is bad, if the amount of data being returned is under ~1gb it's probably OK although it will limit parallelism.","title":"Bringing too much data back to the driver (collect and friends)"},{"location":"details/big-broadcast-join/","text":"Too big broadcast joins Beware that broadcast joins put unnecessary pressure on the driver. Before the tables are broadcasted to all the executors, the data is brought back to the driver and then broadcasted to executors. So you might run into driver OOMs. Broadcast smaller tables but this is usually recommended for < 10 Mb tables. Although that is mostly the default, we can comfortably broadcast much larger datasets as long as they fit in the executor and driver memories. Remember if there are multiple broadcast joins in the same stage, you need to have enough room for all those datasets in memory. You can configure the broadcast threshold using spark.sql.autoBroadcastJoinThreshold or increase the driver memory by setting spark.driver.memory to a higher value Make sure that you need more memory on your driver than the sum of all your broadcasted data in any stage plus all the other overheads that the driver deals with!","title":"Too big broadcast joins"},{"location":"details/big-broadcast-join/#too-big-broadcast-joins","text":"Beware that broadcast joins put unnecessary pressure on the driver. Before the tables are broadcasted to all the executors, the data is brought back to the driver and then broadcasted to executors. So you might run into driver OOMs. Broadcast smaller tables but this is usually recommended for < 10 Mb tables. Although that is mostly the default, we can comfortably broadcast much larger datasets as long as they fit in the executor and driver memories. Remember if there are multiple broadcast joins in the same stage, you need to have enough room for all those datasets in memory. You can configure the broadcast threshold using spark.sql.autoBroadcastJoinThreshold or increase the driver memory by setting spark.driver.memory to a higher value Make sure that you need more memory on your driver than the sum of all your broadcasted data in any stage plus all the other overheads that the driver deals with!","title":"Too big broadcast joins"},{"location":"details/broadcast-with-disable/","text":"Tables getting broadcasted even when broadcast is disabled You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Spark tries to broadcast the bigger table and fails with a broadcast error. And you observe that the query plan has BroadcastNestedLoopJoin in the physical plan. Check for sub queries in your code using NOT IN Example : select * from TableA where id not in (select id from TableB) This typically results in a forced BroadcastNestedLoopJoin even when the broadcast setting is disabled. If the data being processed is large enough, this results in broadcast errors when Spark attempts to broadcast the table Rewrite query using not exists or a regular LEFT JOIN instead of not in Example: select * from TableA where not exists (select 1 from TableB where TableA.id = TableB.id) The query will use SortMergeJoin and will resolve any Driver memory errors because of forced broadcasts Relevant links External Resource","title":"Tables getting broadcasted even when broadcast is disabled"},{"location":"details/broadcast-with-disable/#tables-getting-broadcasted-even-when-broadcast-is-disabled","text":"You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Spark tries to broadcast the bigger table and fails with a broadcast error. And you observe that the query plan has BroadcastNestedLoopJoin in the physical plan. Check for sub queries in your code using NOT IN Example : select * from TableA where id not in (select id from TableB) This typically results in a forced BroadcastNestedLoopJoin even when the broadcast setting is disabled. If the data being processed is large enough, this results in broadcast errors when Spark attempts to broadcast the table Rewrite query using not exists or a regular LEFT JOIN instead of not in Example: select * from TableA where not exists (select 1 from TableB where TableA.id = TableB.id) The query will use SortMergeJoin and will resolve any Driver memory errors because of forced broadcasts","title":"Tables getting broadcasted even when broadcast is disabled"},{"location":"details/broadcast-with-disable/#relevant-links","text":"External Resource","title":"Relevant links"},{"location":"details/class-or-method-not-found/","text":"Class or method not found When your compile-time class path differs from the runtime class path, you may encounter errors that signal that a class or method could not be found (e.g., NoClassDefFoundError, NoSuchMethodError). java.lang.NoSuchMethodError: com.fasterxml.jackson.dataformat.avro.AvroTypeResolverBuilder.subTypeValidator(Lcom/fasterxml/jackson/databind/cfg/MapperConfig;)Lcom/fasterxml/jackson/databind/jsontype/PolymorphicTypeValidator; at com.fasterxml.jackson.dataformat.avro.AvroTypeResolverBuilder.buildTypeDeserializer(AvroTypeResolverBuilder.java:43) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findTypeDeserializer(BasicDeserializerFactory.java:1598) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findPropertyContentTypeDeserializer(BasicDeserializerFactory.java:1766) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.resolveMemberAndTypeAnnotations(BasicDeserializerFactory.java:2092) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.constructCreatorProperty(BasicDeserializerFactory.java:1069) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addExplicitPropertyCreator(BasicDeserializerFactory.java:703) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:476) ... This may be due to packaging a fat JAR with dependency versions that are in conflict with those provided by the Spark environment. When there are multiple versions of the same library in the runtime class path under the same package, Java's class loader hierarchy kicks in, which can lead to unintended behaviors. There are a few options to get around this. Identify the version of the problematic library within your Spark environment and pin the dependency to that version in your build file. To identify the version used in your Spark environment, in the Spark UI go to the Environment tab, scroll down to Classpath Entries, and find the corresponding library. Exclude the transient dependency of the problematic library from imported libraries in your build file. Shade the problematic library under a different package. If options (1) and (2) result in more dependency conflicts, it may be that the version of the problematic library in the Spark environment is incompatible with your application code. Therefore, it makes sense to shade the problematic library so that your application can run with a version of the library isolated from the rest of the Spark environment. If you are using the shadow plugin in Gradle, you can shade using: shadowJar { ... relocate 'com.fasterxml.jackson', 'shaded.fasterxml.jackson' } In this example, Jackson libraries used by your application will be available in the shaded.fasterxml.jackson package at runtime.","title":"Class or method not found"},{"location":"details/class-or-method-not-found/#class-or-method-not-found","text":"When your compile-time class path differs from the runtime class path, you may encounter errors that signal that a class or method could not be found (e.g., NoClassDefFoundError, NoSuchMethodError). java.lang.NoSuchMethodError: com.fasterxml.jackson.dataformat.avro.AvroTypeResolverBuilder.subTypeValidator(Lcom/fasterxml/jackson/databind/cfg/MapperConfig;)Lcom/fasterxml/jackson/databind/jsontype/PolymorphicTypeValidator; at com.fasterxml.jackson.dataformat.avro.AvroTypeResolverBuilder.buildTypeDeserializer(AvroTypeResolverBuilder.java:43) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findTypeDeserializer(BasicDeserializerFactory.java:1598) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findPropertyContentTypeDeserializer(BasicDeserializerFactory.java:1766) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.resolveMemberAndTypeAnnotations(BasicDeserializerFactory.java:2092) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.constructCreatorProperty(BasicDeserializerFactory.java:1069) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addExplicitPropertyCreator(BasicDeserializerFactory.java:703) at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:476) ... This may be due to packaging a fat JAR with dependency versions that are in conflict with those provided by the Spark environment. When there are multiple versions of the same library in the runtime class path under the same package, Java's class loader hierarchy kicks in, which can lead to unintended behaviors. There are a few options to get around this. Identify the version of the problematic library within your Spark environment and pin the dependency to that version in your build file. To identify the version used in your Spark environment, in the Spark UI go to the Environment tab, scroll down to Classpath Entries, and find the corresponding library. Exclude the transient dependency of the problematic library from imported libraries in your build file. Shade the problematic library under a different package. If options (1) and (2) result in more dependency conflicts, it may be that the version of the problematic library in the Spark environment is incompatible with your application code. Therefore, it makes sense to shade the problematic library so that your application can run with a version of the library isolated from the rest of the Spark environment. If you are using the shadow plugin in Gradle, you can shade using: shadowJar { ... relocate 'com.fasterxml.jackson', 'shaded.fasterxml.jackson' } In this example, Jackson libraries used by your application will be available in the shaded.fasterxml.jackson package at runtime.","title":"Class or method not found"},{"location":"details/container-oom/","text":"Container OOMs Container OOMs can be difficult to debug as the container running the problematic code is killed, and sometimes not all of the log information is available. Non-JVM language users (such as Python) are most likely to encounter issues with container OOMs. This is because the JVM is generally configured to not use more memory than the container it is running in. Everything which isn't inside the JVM is considered \"overhead\", so Tensorflow, Python, bash, etc. A first step with a container OOM is often increasing spark.executor.memoryOverhead and spark.driver.memoryOverhead to leave more memory for non-Java processes. Python users can set spark.executor.pyspark.memory to limit the Python VM to a certain amount of memory. This amount of memory is then added to the overhead. Python users performing aggregations in Python should also check out the PyUDFOOM page .","title":"Container OOMs"},{"location":"details/container-oom/#container-ooms","text":"Container OOMs can be difficult to debug as the container running the problematic code is killed, and sometimes not all of the log information is available. Non-JVM language users (such as Python) are most likely to encounter issues with container OOMs. This is because the JVM is generally configured to not use more memory than the container it is running in. Everything which isn't inside the JVM is considered \"overhead\", so Tensorflow, Python, bash, etc. A first step with a container OOM is often increasing spark.executor.memoryOverhead and spark.driver.memoryOverhead to leave more memory for non-Java processes. Python users can set spark.executor.pyspark.memory to limit the Python VM to a certain amount of memory. This amount of memory is then added to the overhead. Python users performing aggregations in Python should also check out the PyUDFOOM page .","title":"Container OOMs"},{"location":"details/correlated-column-not-allowed/","text":"spark.sql.AnalysisException: Correlated column is not allowed in predicate SPARK-35080 introduces a check for correlated subqueries with aggregates which may have previously return incorect results. Instead, starting in Spark 2.4.8, these queries will raise an org.apache.spark.sql.AnalysisException exception. One of the examples of this ( from the JIRA ) is: create or replace view t1(c) as values ('a'), ('b'); create or replace view t2(c) as values ('ab'), ('abc'), ('bc'); select c, (select count(*) from t2 where t1.c = substring(t2.c, 1, 1)) from t1; Instead you should do an explicit join and then perform your aggregation: create or replace view t1(c) as values ('a'), ('b'); create or replace view t2(c) as values ('ab'), ('abc'), ('bc'); create or replace view t3 as select t1.c from t2 INNER JOIN t1 ON t1.c = substring(t2.c, 1, 1); select c, count(*) from t3 group by c; Similarly: create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3); create or replace view t2(c) as values (6); select c, (select count(*) from t1 where a + b = c) from t2; Can be rewritten as: create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3); create or replace view t2(c) as values (6); create or replace view t3 as select t2.c from t2 INNER JOIN t1 ON t2.c = t1.a + t1.b; select c, count(*) from t3 group by c; Likewise in Scala and Python use an explicit .join and then perform your aggregation on the joined result. Now Spark can compute correct results thus avoiding the exception. Relevant links: SPARK-35080 JIRA Stackoverflow discussion for PySpark workaround of Correlated Column","title":"spark.sql.AnalysisException: Correlated column is not allowed in predicate"},{"location":"details/correlated-column-not-allowed/#sparksqlanalysisexception-correlated-column-is-not-allowed-in-predicate","text":"SPARK-35080 introduces a check for correlated subqueries with aggregates which may have previously return incorect results. Instead, starting in Spark 2.4.8, these queries will raise an org.apache.spark.sql.AnalysisException exception. One of the examples of this ( from the JIRA ) is: create or replace view t1(c) as values ('a'), ('b'); create or replace view t2(c) as values ('ab'), ('abc'), ('bc'); select c, (select count(*) from t2 where t1.c = substring(t2.c, 1, 1)) from t1; Instead you should do an explicit join and then perform your aggregation: create or replace view t1(c) as values ('a'), ('b'); create or replace view t2(c) as values ('ab'), ('abc'), ('bc'); create or replace view t3 as select t1.c from t2 INNER JOIN t1 ON t1.c = substring(t2.c, 1, 1); select c, count(*) from t3 group by c; Similarly: create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3); create or replace view t2(c) as values (6); select c, (select count(*) from t1 where a + b = c) from t2; Can be rewritten as: create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3); create or replace view t2(c) as values (6); create or replace view t3 as select t2.c from t2 INNER JOIN t1 ON t2.c = t1.a + t1.b; select c, count(*) from t3 group by c; Likewise in Scala and Python use an explicit .join and then perform your aggregation on the joined result. Now Spark can compute correct results thus avoiding the exception.","title":"spark.sql.AnalysisException: Correlated column is not allowed in predicate"},{"location":"details/correlated-column-not-allowed/#relevant-links","text":"SPARK-35080 JIRA Stackoverflow discussion for PySpark workaround of Correlated Column","title":"Relevant links:"},{"location":"details/driver-max-result-size/","text":"Result size larger than spark.driver.maxResultSize error OR Kryo serialization failed: Buffer overflow. ex: You typically run into this error for one of the following reasons. You are sending a large result set to the driver using SELECT (in SQL) or COLLECT (in dataframes/dataset/RDD): Apply a limit if your intention is to spot check a few rows as you won't be able to go through full set of rows if you have a really high number of rows. Writing the results to a temporary table in your schema and querying the new table would be an alternative if you need to query the results multiple times with a specific set of filters. You are broadcasting a table that is too big. Spark downloads all the rows for a table that needs to be broadcasted to the driver before it starts shipping to the executors. So iff you are broadcasting a table that is larger than spark.driver.maxResultSize , you will run into this error. You can overcome this by either increasing the spark.driver.maxResultSize or not broadcasting the table so Spark would use a shuffle hash or sort-merge join. You have a sort in your SQL/Dataframe: Spark internally uses range-partitioning to assign sort keys to a partition range. This involves in collecting sample rows(reservoir sampling) from input partitions and sending them to the driver for computing range boundaries. This error can further fall into one of the below scenarios. a. You have wide/bloated rows in your table: In this case, you are not sending a lot of rows to the driver, but you are sending bytes larger than the spark.driver.maxResultSize . The recommendation here is to lower the default sample size by setting the spark property spark.sql.execution.rangeExchange.sampleSizePerPartition to something lower than 20. You can also increase spark.driver.maxResultSize if lowering the sample size is causing an imbalance in partition ranges(for ex: skew in a sub-sequent stage or non-uniform output files etc..). If using the later option, be sure spark.driver.maxResultSize is less than spark.driver.memory . b. You have too many Spark partitions from the previous stage: In this case, you have a large number of map tasks while reading from a table. Since spark has to collect sample rows from every partition, your total bytes from the number of rows(partitions*sampleSize) could be larger than spark.driver.maxResultSize . A recommended way to resolve this issue is by combining the splits for the table(increase spark.(path).(db).(table).target-size ) with high map tasks. Note that having a large number of map tasks(>80k) will cause other OOM issues on driver as it needs to keep track of metadata for all these tasks/partitions. External resources: - Apache Spark job fails with maxResultSize exception","title":"Result size larger than spark.driver.maxResultSize error OR Kryo serialization failed: Buffer overflow."},{"location":"details/driver-max-result-size/#result-size-larger-than-sparkdrivermaxresultsize-error-or-kryo-serialization-failed-buffer-overflow","text":"ex: You typically run into this error for one of the following reasons. You are sending a large result set to the driver using SELECT (in SQL) or COLLECT (in dataframes/dataset/RDD): Apply a limit if your intention is to spot check a few rows as you won't be able to go through full set of rows if you have a really high number of rows. Writing the results to a temporary table in your schema and querying the new table would be an alternative if you need to query the results multiple times with a specific set of filters. You are broadcasting a table that is too big. Spark downloads all the rows for a table that needs to be broadcasted to the driver before it starts shipping to the executors. So iff you are broadcasting a table that is larger than spark.driver.maxResultSize , you will run into this error. You can overcome this by either increasing the spark.driver.maxResultSize or not broadcasting the table so Spark would use a shuffle hash or sort-merge join. You have a sort in your SQL/Dataframe: Spark internally uses range-partitioning to assign sort keys to a partition range. This involves in collecting sample rows(reservoir sampling) from input partitions and sending them to the driver for computing range boundaries. This error can further fall into one of the below scenarios. a. You have wide/bloated rows in your table: In this case, you are not sending a lot of rows to the driver, but you are sending bytes larger than the spark.driver.maxResultSize . The recommendation here is to lower the default sample size by setting the spark property spark.sql.execution.rangeExchange.sampleSizePerPartition to something lower than 20. You can also increase spark.driver.maxResultSize if lowering the sample size is causing an imbalance in partition ranges(for ex: skew in a sub-sequent stage or non-uniform output files etc..). If using the later option, be sure spark.driver.maxResultSize is less than spark.driver.memory . b. You have too many Spark partitions from the previous stage: In this case, you have a large number of map tasks while reading from a table. Since spark has to collect sample rows from every partition, your total bytes from the number of rows(partitions*sampleSize) could be larger than spark.driver.maxResultSize . A recommended way to resolve this issue is by combining the splits for the table(increase spark.(path).(db).(table).target-size ) with high map tasks. Note that having a large number of map tasks(>80k) will cause other OOM issues on driver as it needs to keep track of metadata for all these tasks/partitions. External resources: - Apache Spark job fails with maxResultSize exception","title":"Result size larger than spark.driver.maxResultSize error OR Kryo serialization failed: Buffer overflow."},{"location":"details/error-driver-max-result-size/","text":"Result size larger than spark.driver.maxResultsSize error ex: You typically run into this error for one of the following reasons. You are sending a large result set to the driver using SELECT (in SQL) or COLLECT (in dataframes/dataset/RDD): Apply a limit if your intention is to spot check a few rows as you won't be able to go through full set of rows if you have a really high no.of rows. Writing the results to a temporary table in your schema and querying the new table would be an alternative if you need to query the results multiple times with a specific set of filters. ( Collect best practices ) You are broadcasting a table that is too big. Spark downloads all the rows for a table that needs to be broadcasted to the driver before it starts shipping to the executors. So iff you are broadcasting a table that is larger than spark.driver.maxResultsSize , you will run into this error. You can overcome this by either increasing the spark.driver.maxResultsSize or not broadcasting the table so Spark would use a shuffle hash or sort-merge join. Note that Spark broadcasts a table referenced in a join if the size of the table is less than spark.sql.autoBroadcastJoinThreshold (100 MB by default at Netflix). You can change this config to include a larger tables in broadcast or reduce the threshold if you want to exclude certain tables. You can also set this to -1 if you want to disable broadcast joins. You have a sort in your SQL/Dataframe: Spark internally uses range-partitioning to assign sort keys to a partition range. This involves in collecting sample rows(reservoir sampling) from input partitions and sending them to the driver for computing range boundaries. This error can further fall into one of the below scenarios. a. You have wide/bloated rows in your table: In this case, you are not sending a lot of rows to the driver, but you are sending bytes larger than the spark.driver.maxResultsSize . The recommendation here is to lower the default sample size by setting the spark property spark.sql.execution.rangeExchange.sampleSizePerPartition to something lower than 20. You can also increase spark.driver.maxResultsSize if lowering the sample size is causing an imbalance in partition ranges(for ex: skew in a subsequent stage or non-uniform output files etc.) b. You have too many Spark partitions from the previous stage: In this case, you have a large no.of map tasks while reading from a table. Since spark has to collect sample rows from every partition, your total bytes from the no.of rows(partitions*sampleSize) could be larger than spark.driver.maxResultsSize . A recommended way to resolve this issue is by combining the splits for the table(increase spark.netflix.(db).(table).target-size ) with high map tasks. Note that having a large no.of map tasks(>80k) will cause other OOM issues on driver as it needs to keep track of metadata for all these tasks/partitions. Broadcast join related articles SQL Broadcast Join Hints Tables getting broadcasted even when broadcast is disabled","title":"Result size larger than spark.driver.maxResultsSize error"},{"location":"details/error-driver-max-result-size/#result-size-larger-than-sparkdrivermaxresultssize-error","text":"ex: You typically run into this error for one of the following reasons. You are sending a large result set to the driver using SELECT (in SQL) or COLLECT (in dataframes/dataset/RDD): Apply a limit if your intention is to spot check a few rows as you won't be able to go through full set of rows if you have a really high no.of rows. Writing the results to a temporary table in your schema and querying the new table would be an alternative if you need to query the results multiple times with a specific set of filters. ( Collect best practices ) You are broadcasting a table that is too big. Spark downloads all the rows for a table that needs to be broadcasted to the driver before it starts shipping to the executors. So iff you are broadcasting a table that is larger than spark.driver.maxResultsSize , you will run into this error. You can overcome this by either increasing the spark.driver.maxResultsSize or not broadcasting the table so Spark would use a shuffle hash or sort-merge join. Note that Spark broadcasts a table referenced in a join if the size of the table is less than spark.sql.autoBroadcastJoinThreshold (100 MB by default at Netflix). You can change this config to include a larger tables in broadcast or reduce the threshold if you want to exclude certain tables. You can also set this to -1 if you want to disable broadcast joins. You have a sort in your SQL/Dataframe: Spark internally uses range-partitioning to assign sort keys to a partition range. This involves in collecting sample rows(reservoir sampling) from input partitions and sending them to the driver for computing range boundaries. This error can further fall into one of the below scenarios. a. You have wide/bloated rows in your table: In this case, you are not sending a lot of rows to the driver, but you are sending bytes larger than the spark.driver.maxResultsSize . The recommendation here is to lower the default sample size by setting the spark property spark.sql.execution.rangeExchange.sampleSizePerPartition to something lower than 20. You can also increase spark.driver.maxResultsSize if lowering the sample size is causing an imbalance in partition ranges(for ex: skew in a subsequent stage or non-uniform output files etc.) b. You have too many Spark partitions from the previous stage: In this case, you have a large no.of map tasks while reading from a table. Since spark has to collect sample rows from every partition, your total bytes from the no.of rows(partitions*sampleSize) could be larger than spark.driver.maxResultsSize . A recommended way to resolve this issue is by combining the splits for the table(increase spark.netflix.(db).(table).target-size ) with high map tasks. Note that having a large no.of map tasks(>80k) will cause other OOM issues on driver as it needs to keep track of metadata for all these tasks/partitions.","title":"Result size larger than spark.driver.maxResultsSize error"},{"location":"details/error-driver-max-result-size/#broadcast-join-related-articles","text":"SQL Broadcast Join Hints Tables getting broadcasted even when broadcast is disabled","title":"Broadcast join related articles"},{"location":"details/error-driver-out-of-memory/","text":"Driver ran out of memory IF you see java.lang.OutOfMemoryError: in the driver log/stderr, it is most likely from driver JVM running out of memory. This article has the memory config for increasing the driver memory. One reason you could run into this error is if you are reading from a table with too many splits(s3 files) and overwhelming the driver with a lot of metadata. Another cause for driver out of memory errors is when the number of partitions is too high and you trigger a sort or shuffle where Spark samples the data, but then runs out of memory while collecting the sample. To solve this repartition to a lower number of partitions or if you're in RDDs coalesce is a more efficent option (in DataFrames coalesce can have impact upstream in the query plan). A less common, but still semi-frequent, occurnce of driver out of memory is an excessive number of tasks in the UI. This can be controlled by reducing spark.ui.retainedTasks (default 100k).","title":"Driver ran out of memory"},{"location":"details/error-driver-out-of-memory/#driver-ran-out-of-memory","text":"IF you see java.lang.OutOfMemoryError: in the driver log/stderr, it is most likely from driver JVM running out of memory. This article has the memory config for increasing the driver memory. One reason you could run into this error is if you are reading from a table with too many splits(s3 files) and overwhelming the driver with a lot of metadata. Another cause for driver out of memory errors is when the number of partitions is too high and you trigger a sort or shuffle where Spark samples the data, but then runs out of memory while collecting the sample. To solve this repartition to a lower number of partitions or if you're in RDDs coalesce is a more efficent option (in DataFrames coalesce can have impact upstream in the query plan). A less common, but still semi-frequent, occurnce of driver out of memory is an excessive number of tasks in the UI. This can be controlled by reducing spark.ui.retainedTasks (default 100k).","title":"Driver ran out of memory"},{"location":"details/error-driver-stack-overflow/","text":"Driver ran out of memory Note that it is very rare to run into this error. You may see this error when you are using too many filters(in your sql/dataframe/dataset). Workaround is to increase spark driver JVM stack size by setting below config to something higher than the default spark.driver.extraJavaOptions: \"-Xss512M\" #Sets the stack size to 512 MB","title":"Driver ran out of memory"},{"location":"details/error-driver-stack-overflow/#driver-ran-out-of-memory","text":"Note that it is very rare to run into this error. You may see this error when you are using too many filters(in your sql/dataframe/dataset). Workaround is to increase spark driver JVM stack size by setting below config to something higher than the default spark.driver.extraJavaOptions: \"-Xss512M\" #Sets the stack size to 512 MB","title":"Driver ran out of memory"},{"location":"details/error-executor-out-of-disk/","text":"Executor out of disk error By far the most common cause of executor out of disk errors is a mis-configuration of Spark's temporary directories. You should set spark.local.dir to a directory with lots of local storage available. If you are on YARN this will be overriden by LOCAL_DIRS environment variable on the workers. Kubernetes users may wish to add a large emptyDir for Spark to use for temporary storage. Another common cause is having no longer needed/used RDDs/DataFrames/Datasets in scope. This tends to happen more often with notebooks as more things are placed in the global scope where they are not automatically cleaned up. A solution to this is breaking your code into more functions so that things go out of scope, or explicitily setting no longer needed RDDs/DataFrames/Datasets to None/null. On the other hand if you have an iterative algorithm you should investigate if you may have to big of a DAG.","title":"Executor out of disk error"},{"location":"details/error-executor-out-of-disk/#executor-out-of-disk-error","text":"By far the most common cause of executor out of disk errors is a mis-configuration of Spark's temporary directories. You should set spark.local.dir to a directory with lots of local storage available. If you are on YARN this will be overriden by LOCAL_DIRS environment variable on the workers. Kubernetes users may wish to add a large emptyDir for Spark to use for temporary storage. Another common cause is having no longer needed/used RDDs/DataFrames/Datasets in scope. This tends to happen more often with notebooks as more things are placed in the global scope where they are not automatically cleaned up. A solution to this is breaking your code into more functions so that things go out of scope, or explicitily setting no longer needed RDDs/DataFrames/Datasets to None/null. On the other hand if you have an iterative algorithm you should investigate if you may have to big of a DAG.","title":"Executor out of disk error"},{"location":"details/error-executor-out-of-memory/","text":"Executor ran out of memory Executor out of memory issues can come from many sources. To narrow down what the cause of the error there are a few important places to look: the Spark Web UI, the executor log, the driver log, and (if applicable) the cluster manager (e.g. YARN/K8s) log/UI. Container OOM If the driver log indicates Container killed by YARN for exceeding memory limits for the applicable executor, or if (on K8s) the Spark UI show's the reason for the executor loss as \"OOMKill\" / exit code 137 then it's likely your program is exceeding the amount of memory assigned to it. This doesn't normally happen with pure JVM code, but instead when calling PySpark or JNI libraries (or using off-heap storage). PySpark users are the most likely to encounter container OOMs. If you have PySpark UDF in the stage you should check out Python UDF OOM to eliminate that potential cause. Another potential issue to investigate is if your have key skew as trying to load too large a partition in Python can result in an OOM. If you are using a library, like Tensorflow, which results in","title":"Executor ran out of memory"},{"location":"details/error-executor-out-of-memory/#executor-ran-out-of-memory","text":"Executor out of memory issues can come from many sources. To narrow down what the cause of the error there are a few important places to look: the Spark Web UI, the executor log, the driver log, and (if applicable) the cluster manager (e.g. YARN/K8s) log/UI.","title":"Executor ran out of memory"},{"location":"details/error-executor-out-of-memory/#container-oom","text":"If the driver log indicates Container killed by YARN for exceeding memory limits for the applicable executor, or if (on K8s) the Spark UI show's the reason for the executor loss as \"OOMKill\" / exit code 137 then it's likely your program is exceeding the amount of memory assigned to it. This doesn't normally happen with pure JVM code, but instead when calling PySpark or JNI libraries (or using off-heap storage). PySpark users are the most likely to encounter container OOMs. If you have PySpark UDF in the stage you should check out Python UDF OOM to eliminate that potential cause. Another potential issue to investigate is if your have key skew as trying to load too large a partition in Python can result in an OOM. If you are using a library, like Tensorflow, which results in","title":"Container OOM"},{"location":"details/error-invalid-file/","text":"Missing Files / File Not Found / Reading past RLE/BitPacking stream Missing files are a relatively rare error in Spark. Most commonly they are caused by non-atomic operations in the data writer and will go away when you re-run your query/job. On the other hand Reading past RLE/BitPacking stream or other file read errors tend to be non-transient. If the error is not transient it may mean that the metadata store (e.g. hive or iceberg) are pointing to a file that does not exist or has a bad format. You can cleanup Iceberg tables using Iceberg Table Cleanup from holden's spark-misc-utils , but be careful and talk with whoever produced the table to make sure that it's ok. If you get a failed to read parquet file while you are not trying to read a parquet file, it's likely that you are using the wrong metastore .","title":"Missing Files / File Not Found / Reading past RLE/BitPacking stream"},{"location":"details/error-invalid-file/#missing-files-file-not-found-reading-past-rlebitpacking-stream","text":"Missing files are a relatively rare error in Spark. Most commonly they are caused by non-atomic operations in the data writer and will go away when you re-run your query/job. On the other hand Reading past RLE/BitPacking stream or other file read errors tend to be non-transient. If the error is not transient it may mean that the metadata store (e.g. hive or iceberg) are pointing to a file that does not exist or has a bad format. You can cleanup Iceberg tables using Iceberg Table Cleanup from holden's spark-misc-utils , but be careful and talk with whoever produced the table to make sure that it's ok. If you get a failed to read parquet file while you are not trying to read a parquet file, it's likely that you are using the wrong metastore .","title":"Missing Files / File Not Found / Reading past RLE/BitPacking stream"},{"location":"details/error-job/","text":"Error Most of the errors should fall into below 4 categories. Drill-down to individual sections to isolate your error/exception. SQL Analysis Exception Memory Error Shuffle Error Other Error","title":"Error"},{"location":"details/error-job/#error","text":"Most of the errors should fall into below 4 categories. Drill-down to individual sections to isolate your error/exception. SQL Analysis Exception Memory Error Shuffle Error Other Error","title":"Error"},{"location":"details/error-memory/","text":"Memory Errors Driver Spark driver ran out of memory maxResultSize exceeded stackOverflowError Executor Spark executor ran out of memory Executor out of disk error","title":"Memory Errors"},{"location":"details/error-memory/#memory-errors","text":"","title":"Memory Errors"},{"location":"details/error-memory/#driver","text":"","title":"Driver"},{"location":"details/error-memory/#spark-driver-ran-out-of-memory","text":"","title":"Spark driver ran out of memory"},{"location":"details/error-memory/#maxresultsize-exceeded","text":"","title":"maxResultSize exceeded"},{"location":"details/error-memory/#stackoverflowerror","text":"","title":"stackOverflowError"},{"location":"details/error-memory/#executor","text":"","title":"Executor"},{"location":"details/error-memory/#spark-executor-ran-out-of-memory","text":"","title":"Spark executor ran out of memory"},{"location":"details/error-memory/#executor-out-of-disk-error","text":"","title":"Executor out of disk error"},{"location":"details/error-other/","text":"Other errors Failed to read non-parquet file Executor Failure from large record Class or method not found Invalid/Missing Files Too Big DAG","title":"Other errors"},{"location":"details/error-other/#other-errors","text":"Failed to read non-parquet file Executor Failure from large record Class or method not found Invalid/Missing Files Too Big DAG","title":"Other errors"},{"location":"details/error-shuffle/","text":"Fetch Failed exceptions No time to read, help me now. FetchFailed exceptions are mainly due to misconfiguration of spark.sql.shuffle.partitions : Too few shuffle partitions: Having too few shuffle partitions means you could have a shuffle block that is larger than the limit(Integer.MaxValue=~2GB) or OOM(Exit code 143). The symptom for this can also be long-running tasks where the blocks are large but not reached the limit. A quick fix is to increase the shuffle/reducer parallelism by increasing spark.sqlshuffle.partitions (default is 500). Too many shuffle partitions: Too many shuffle partitions could put a stress on the shuffle service and could run into errors like network timeout ```. Note that the shuffle service is a shared service for all the jobs running on the cluster so it is possible that someone else's job with high shuffle activity could cause errors for your job. It is worth checking to see if there is a pattern of these failures for your job to confirm if it is an issue with your job or not. Also note that the higher the shuffle partitions, the more likely that you would see this issue. Tell me more. FetchFailed Exceptions can be bucketed into below 4 categories: Ran out of heap memory(OOM) on an Executor Ran out of overhead memory on an Executor Shuffle block greater than 2 GB Network TimeOut. Ran out of heap memory(OOM) on an Executor This error indicates that the executor hosting the shuffle block has crashed due to Java OOM. The most likely cause for this is misconfiguration of spark.sqlshuffle.partitions . A workaround is to increase the shuffle partitions. Note that if you have skew from a single key(in join, group By), increasing this property wouldn't resolve the issue. Please refer to key-skew for related workarounds. Errors that you normally see in the executor/task logs: ExecutorLostFailure due to Exit code 143 ExecutorLostFailure due to Executor Heartbeat timed out. Ran out of overhead memory on an Executor This error indicates that the executor hosting the shuffle block has crashed due to off-heap(overhead) memory. Increasing spark.yarn.executor.Overhead should prevent this specific exception. Error that you normally see in the executor/task logs: ExecutorLostFailure, # GB of # GB physical memory used. Consider boosting the spark.yarn.executor.Overhead Shuffle block greater than 2 GB The most likely cause for this is misconfiguration of spark.sqlshuffle.partitions . A workaround is to increase the shuffle partitions(increases the no.of blocks and reduces the block size). Note that if you have skew from a single key(in join, group By), increasing this property wouldn't resolve the issue. Please refer to key-skew for related workarounds. Error that you normally see in the executor/task logs: Too Large Frame Frame size exceeding size exceeding Integer.MaxValue(~2GB) Network Timeout The most likely cause for this exception is a high shuffle activity(high network load) in your job. Reducing the shuffle partitions spark.sqlshuffle.partitions would mitigate this issue. You can also reduce the network load by modifying the shuffle config. (todo: add details) Error that you normally see in the executor/task logs: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-xxxxxxxx Caused by: org.apache.spark.shuffle.FetchFailedException: Too large frame: xxxxxxxxxxx","title":"Fetch Failed exceptions"},{"location":"details/error-shuffle/#fetch-failed-exceptions","text":"","title":"Fetch Failed exceptions"},{"location":"details/error-shuffle/#no-time-to-read-help-me-now","text":"FetchFailed exceptions are mainly due to misconfiguration of spark.sql.shuffle.partitions : Too few shuffle partitions: Having too few shuffle partitions means you could have a shuffle block that is larger than the limit(Integer.MaxValue=~2GB) or OOM(Exit code 143). The symptom for this can also be long-running tasks where the blocks are large but not reached the limit. A quick fix is to increase the shuffle/reducer parallelism by increasing spark.sqlshuffle.partitions (default is 500). Too many shuffle partitions: Too many shuffle partitions could put a stress on the shuffle service and could run into errors like network timeout ```. Note that the shuffle service is a shared service for all the jobs running on the cluster so it is possible that someone else's job with high shuffle activity could cause errors for your job. It is worth checking to see if there is a pattern of these failures for your job to confirm if it is an issue with your job or not. Also note that the higher the shuffle partitions, the more likely that you would see this issue.","title":"No time to read, help me now."},{"location":"details/error-shuffle/#tell-me-more","text":"FetchFailed Exceptions can be bucketed into below 4 categories: Ran out of heap memory(OOM) on an Executor Ran out of overhead memory on an Executor Shuffle block greater than 2 GB Network TimeOut.","title":"Tell me more."},{"location":"details/error-shuffle/#ran-out-of-heap-memoryoom-on-an-executor","text":"This error indicates that the executor hosting the shuffle block has crashed due to Java OOM. The most likely cause for this is misconfiguration of spark.sqlshuffle.partitions . A workaround is to increase the shuffle partitions. Note that if you have skew from a single key(in join, group By), increasing this property wouldn't resolve the issue. Please refer to key-skew for related workarounds. Errors that you normally see in the executor/task logs: ExecutorLostFailure due to Exit code 143 ExecutorLostFailure due to Executor Heartbeat timed out.","title":"Ran out of heap memory(OOM) on an Executor"},{"location":"details/error-shuffle/#ran-out-of-overhead-memory-on-an-executor","text":"This error indicates that the executor hosting the shuffle block has crashed due to off-heap(overhead) memory. Increasing spark.yarn.executor.Overhead should prevent this specific exception. Error that you normally see in the executor/task logs: ExecutorLostFailure, # GB of # GB physical memory used. Consider boosting the spark.yarn.executor.Overhead","title":"Ran out of overhead memory on an Executor"},{"location":"details/error-shuffle/#shuffle-block-greater-than-2-gb","text":"The most likely cause for this is misconfiguration of spark.sqlshuffle.partitions . A workaround is to increase the shuffle partitions(increases the no.of blocks and reduces the block size). Note that if you have skew from a single key(in join, group By), increasing this property wouldn't resolve the issue. Please refer to key-skew for related workarounds. Error that you normally see in the executor/task logs: Too Large Frame Frame size exceeding size exceeding Integer.MaxValue(~2GB)","title":"Shuffle block greater than 2 GB"},{"location":"details/error-shuffle/#network-timeout","text":"The most likely cause for this exception is a high shuffle activity(high network load) in your job. Reducing the shuffle partitions spark.sqlshuffle.partitions would mitigate this issue. You can also reduce the network load by modifying the shuffle config. (todo: add details) Error that you normally see in the executor/task logs: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-xxxxxxxx Caused by: org.apache.spark.shuffle.FetchFailedException: Too large frame: xxxxxxxxxxx","title":"Network Timeout"},{"location":"details/error-sql-analysis/","text":"spark.sql.AnalysisException Spark SQL AnalysisException covers a wide variety of potential issues, ranging from ambigious columns to more esotoric items like subquery issues. A good first step is making sure that your SQL is valid and your brackets are where you intend by putting your query through a SQL pretty-printer. After that hopefully the details of the AnalysisException error will guide you to one of the sub-nodes in the error graph. Known issues Correlated column is not allowed in predicate","title":"spark.sql.AnalysisException"},{"location":"details/error-sql-analysis/#sparksqlanalysisexception","text":"Spark SQL AnalysisException covers a wide variety of potential issues, ranging from ambigious columns to more esotoric items like subquery issues. A good first step is making sure that your SQL is valid and your brackets are where you intend by putting your query through a SQL pretty-printer. After that hopefully the details of the AnalysisException error will guide you to one of the sub-nodes in the error graph.","title":"spark.sql.AnalysisException"},{"location":"details/error-sql-analysis/#known-issues","text":"Correlated column is not allowed in predicate","title":"Known issues"},{"location":"details/even_partitioning_still_slow/","text":"Even Partitioning Yet Still Slow To see if a stage if evenly partitioned take a look at the Spark WebUI --> Stage tab and look at the distribution of data sizes and durations of the completed tasks. Sometimes a stage with even partitioning is still slow. There are a few common possible causes when the partitioning is even for slow stages. If your tasks are too short (e.g. finishing in under a few minutes), likely you have too many partitions/tasks. If your tasks are taking just the right amount of time but your jobs are slow you may not have enough executors. If your tasks are taking a long time you may have too large records, not enough partitions/tasks, or just slow functions. Another sign of not enough tasks can be excessive spill to disk. If the data is evenly partitioned but the max task duration is longer than desired for the stage, increasing the number of executors will not help and you'll need to re-partition the data. Insufficient partitioning can be fixed by increasing the number of partitions (e.g. repartition(5000) or change spark.sql.shuffle.partitions ). Another cause of too large partitioning can be non-splittable compression formats, like gzip, that can be worked around with tools like splittablegzip . Finally consider the possibility the records are too large.","title":"Even Partitioning Yet Still Slow"},{"location":"details/even_partitioning_still_slow/#even-partitioning-yet-still-slow","text":"To see if a stage if evenly partitioned take a look at the Spark WebUI --> Stage tab and look at the distribution of data sizes and durations of the completed tasks. Sometimes a stage with even partitioning is still slow. There are a few common possible causes when the partitioning is even for slow stages. If your tasks are too short (e.g. finishing in under a few minutes), likely you have too many partitions/tasks. If your tasks are taking just the right amount of time but your jobs are slow you may not have enough executors. If your tasks are taking a long time you may have too large records, not enough partitions/tasks, or just slow functions. Another sign of not enough tasks can be excessive spill to disk. If the data is evenly partitioned but the max task duration is longer than desired for the stage, increasing the number of executors will not help and you'll need to re-partition the data. Insufficient partitioning can be fixed by increasing the number of partitions (e.g. repartition(5000) or change spark.sql.shuffle.partitions ). Another cause of too large partitioning can be non-splittable compression formats, like gzip, that can be worked around with tools like splittablegzip . Finally consider the possibility the records are too large.","title":"Even Partitioning Yet Still Slow"},{"location":"details/failed-to-read-non-parquet-file/","text":"Failed to read non-parquet file Iceberg does not perform validation on the files specified, so it will let you create a table pointing to non-supported formats, e.g. CSV data, but will fail at query time. In this case you need to use a different metastore (e.g. Hive ) If the data is stored in a supported format, it is also possible you have an invalid iceberg table.","title":"Failed to read non-parquet file"},{"location":"details/failed-to-read-non-parquet-file/#failed-to-read-non-parquet-file","text":"Iceberg does not perform validation on the files specified, so it will let you create a table pointing to non-supported formats, e.g. CSV data, but will fail at query time. In this case you need to use a different metastore (e.g. Hive ) If the data is stored in a supported format, it is also possible you have an invalid iceberg table.","title":"Failed to read non-parquet file"},{"location":"details/failure-executor-large-record/","text":"Large record problems can show up in a few different ways. For particularly large records you may find an executor out of memory exception, otherwise you may find slow stages. You can get a Kyro serialization (for SQL) or Java serialization error (for RDD). In addition if a given column in a row is too large you may encounter a IllegalArgumentException: Cannot grow BufferHolder by size, because the size after growing exceeds size limitation 2147483632 . Some common causes of too big records are groupByKey in RDD land, UDAFs or list aggregations (like collect_list ) in Spark SQL, highly compressed or Sparse records without a sparse seriaization. For sparse records check out AltEncoder in (spark-misc-utils)[https://github.com/holdenk/spark-misc-utils]. If you are uncertain of where exactly the too big record is coming from after looking at the executor logs, you can try and seperate the stage which is failing into distinct parts of the code by using persist at the DISK_ONLY level to introduce cuts into the graph. If your exception is happening with a Python UDF, it's possible that the individual records themselves might not be too large, but the batch-size used by Spark is set too high for the size of your records. You can try turning down the record size.","title":"Large record problems can show up in a few different ways."},{"location":"details/failure-executor-large-record/#large-record-problems-can-show-up-in-a-few-different-ways","text":"For particularly large records you may find an executor out of memory exception, otherwise you may find slow stages. You can get a Kyro serialization (for SQL) or Java serialization error (for RDD). In addition if a given column in a row is too large you may encounter a IllegalArgumentException: Cannot grow BufferHolder by size, because the size after growing exceeds size limitation 2147483632 . Some common causes of too big records are groupByKey in RDD land, UDAFs or list aggregations (like collect_list ) in Spark SQL, highly compressed or Sparse records without a sparse seriaization. For sparse records check out AltEncoder in (spark-misc-utils)[https://github.com/holdenk/spark-misc-utils]. If you are uncertain of where exactly the too big record is coming from after looking at the executor logs, you can try and seperate the stage which is failing into distinct parts of the code by using persist at the DISK_ONLY level to introduce cuts into the graph. If your exception is happening with a Python UDF, it's possible that the individual records themselves might not be too large, but the batch-size used by Spark is set too high for the size of your records. You can try turning down the record size.","title":"Large record problems can show up in a few different ways."},{"location":"details/forced-computations/","text":"Force computations There are multiple use cases where you might want to measure performance for different transformations in your spark job, in which case you have to materialize the transformations by calling an explicit action. If you encounter an exception during the write phase that appears unrelated, one technique is to force computation earlier of the DataFrame or RDD to narrow down the true cause of the exception. Forcing computation on RDDs is relatively simple, all you need to do is call count() and Spark will evaluate the RDD. Forcing computation on DataFrames is more complex. Calling an action like count() on a DataFrame might not necessarily work because the optimizer will likely ignore unnecessary transformations. In order to compute the row count, Spark does not have to execute all transformations. The Spark optimizer can simplify the query plan in such a way that the actual transformation that you need to measure will be skipped because it is simply not needed for finding out the final count. In order to make sure all the transformations are called, we need to force Spark to compute them using other ways. Here are some options to force Spark to compute all transformations of a DataFrame: df.rdd.count() : convert to an RDD and perform a count df.foreach (_ => ()) : do-nothing foreach Write to an output table (not recommended for performance benchmarking since the execution time will be impacted heavily by the actual writing process) If using Spark 3.0 and above, benchmarking is simplified by supporting a \"noop\" write format which will force compute all transformations without having to write it. df.write .mode(\"overwrite\") .format(\"noop\") .save()","title":"Force computations"},{"location":"details/forced-computations/#force-computations","text":"There are multiple use cases where you might want to measure performance for different transformations in your spark job, in which case you have to materialize the transformations by calling an explicit action. If you encounter an exception during the write phase that appears unrelated, one technique is to force computation earlier of the DataFrame or RDD to narrow down the true cause of the exception. Forcing computation on RDDs is relatively simple, all you need to do is call count() and Spark will evaluate the RDD. Forcing computation on DataFrames is more complex. Calling an action like count() on a DataFrame might not necessarily work because the optimizer will likely ignore unnecessary transformations. In order to compute the row count, Spark does not have to execute all transformations. The Spark optimizer can simplify the query plan in such a way that the actual transformation that you need to measure will be skipped because it is simply not needed for finding out the final count. In order to make sure all the transformations are called, we need to force Spark to compute them using other ways. Here are some options to force Spark to compute all transformations of a DataFrame: df.rdd.count() : convert to an RDD and perform a count df.foreach (_ => ()) : do-nothing foreach Write to an output table (not recommended for performance benchmarking since the execution time will be impacted heavily by the actual writing process) If using Spark 3.0 and above, benchmarking is simplified by supporting a \"noop\" write format which will force compute all transformations without having to write it. df.write .mode(\"overwrite\") .format(\"noop\") .save()","title":"Force computations"},{"location":"details/key-skew/","text":"Key/Partition Skew Key or partition skew is a frequent problem in Spark. Key skew can result in everything from slowly running jobs (with stragglers), to failing jobs. What is data skew? Usually caused during a transformation when the data in one partition ends up being a lot more than the others, bumping up memory could resolve an OOM error but does not solve the underlying problem Processing partitions are unbalanced by a magnitude then the largest partition becomes the bottleneck How to identify skew If one task took much longer to complete than the other tasks, it's usually a sign of Skew. On the Spark UI under Summary Metrics for completed tasks if the Max duration is higher by a significant magnitude from the Median it usually represents Skew, e.g.: Things to consider Mitigating skew has a cost (e.g. repartition) hence its ignorable unless the duration or input size is significantly higher in magnitude severely impacting job time Mitigation strategies Increasing executor memory to prevent OOM exceptions -> This a short-term solution if you want to unblock yourself but does not address the underlying issue. Sometimes this is not an option when you are already running at the max memory settings allowable. Salting is a way to balance partitions by introducing a salt/dummy key for the skewed partitions. Here is a sample workbook and an example of salting in content performance show completion pipeline, where the whole salting operation is parametrized with a JOIN_BUCKETS variable which helps with maintenance of this job. Isolate the data for the skewed key, broadcast it for processing (e.g. join) and then union back the results Adaptive Query Execution is a new framework with Spark 3.0, it enables Spark to dynamically identify skew. Under the hood adaptive query execution splits (and replicates if needed) skewed (large) partitions. If you are unable to upgrade to Spark 3.0, you can build the solution into the code by using the Salting/Partitioning technique listed above. Using approximate functions/ probabilistic data structure Using approximate distinct counts (Hyperloglog) can help get around skew if absolute precision isn't important. Approximate data structures like Tdigest can help with quantile computations. If you need exact quantiles, check out the example in High Performance Spark Certain types of aggregations and windows can result in partitioning the data on a particular key.","title":"Key/Partition Skew"},{"location":"details/key-skew/#keypartition-skew","text":"Key or partition skew is a frequent problem in Spark. Key skew can result in everything from slowly running jobs (with stragglers), to failing jobs.","title":"Key/Partition Skew"},{"location":"details/key-skew/#what-is-data-skew","text":"Usually caused during a transformation when the data in one partition ends up being a lot more than the others, bumping up memory could resolve an OOM error but does not solve the underlying problem Processing partitions are unbalanced by a magnitude then the largest partition becomes the bottleneck","title":"What is data skew?"},{"location":"details/key-skew/#how-to-identify-skew","text":"If one task took much longer to complete than the other tasks, it's usually a sign of Skew. On the Spark UI under Summary Metrics for completed tasks if the Max duration is higher by a significant magnitude from the Median it usually represents Skew, e.g.: Things to consider Mitigating skew has a cost (e.g. repartition) hence its ignorable unless the duration or input size is significantly higher in magnitude severely impacting job time","title":"How to identify skew"},{"location":"details/key-skew/#mitigation-strategies","text":"Increasing executor memory to prevent OOM exceptions -> This a short-term solution if you want to unblock yourself but does not address the underlying issue. Sometimes this is not an option when you are already running at the max memory settings allowable. Salting is a way to balance partitions by introducing a salt/dummy key for the skewed partitions. Here is a sample workbook and an example of salting in content performance show completion pipeline, where the whole salting operation is parametrized with a JOIN_BUCKETS variable which helps with maintenance of this job. Isolate the data for the skewed key, broadcast it for processing (e.g. join) and then union back the results Adaptive Query Execution is a new framework with Spark 3.0, it enables Spark to dynamically identify skew. Under the hood adaptive query execution splits (and replicates if needed) skewed (large) partitions. If you are unable to upgrade to Spark 3.0, you can build the solution into the code by using the Salting/Partitioning technique listed above. Using approximate functions/ probabilistic data structure Using approximate distinct counts (Hyperloglog) can help get around skew if absolute precision isn't important. Approximate data structures like Tdigest can help with quantile computations. If you need exact quantiles, check out the example in High Performance Spark Certain types of aggregations and windows can result in partitioning the data on a particular key.","title":"Mitigation strategies"},{"location":"details/notenoughexecs/","text":"Not enough execs","title":"Notenoughexecs"},{"location":"details/notenoughexecs/#not-enough-execs","text":"","title":"Not enough execs"},{"location":"details/partial_aggregates/","text":"Partial v.s. Full Aggregates Partial Aggregation is a key concept when handling large amounts of data in Spark. Full aggregation means that all of the data for one key must be together on the same node and then it can be aggregated, whereas partial aggregation allows Spark to start the aggregation \"map-side\" (e.g. before the shuffle) and then combine these \"partial\" aggregations together. In RDD world the classic \"full\" aggregation is groupByKey and partial aggregation is reduceByKey . In DataFrame/Datasets, Scala UDAFs implement partial aggregation but the basic PySpark Panda's/Arrow UDAFs do not support partial aggregation.","title":"Partial v.s. Full Aggregates"},{"location":"details/partial_aggregates/#partial-vs-full-aggregates","text":"Partial Aggregation is a key concept when handling large amounts of data in Spark. Full aggregation means that all of the data for one key must be together on the same node and then it can be aggregated, whereas partial aggregation allows Spark to start the aggregation \"map-side\" (e.g. before the shuffle) and then combine these \"partial\" aggregations together. In RDD world the classic \"full\" aggregation is groupByKey and partial aggregation is reduceByKey . In DataFrame/Datasets, Scala UDAFs implement partial aggregation but the basic PySpark Panda's/Arrow UDAFs do not support partial aggregation.","title":"Partial v.s. Full Aggregates"},{"location":"details/pyudfoom/","text":"PySpark UDF / UDAF OOM Out of memory exceptions with Python user-defined-functions are especially likely as Spark doesn't do a good job of managing memory between the JVM and Python VM. Together this can result in exceeding container memory limits . Grouped Map / Co-Grouped The Grouped & Co-Grouped UDFs are especially likely to cause out-of-memory exceptions in PySpark when combined with key skew . Unlike most built in Spark aggregations, PySpark user-defined-aggregates do not support partial aggregation. This means that all of the data for a single key must fit in memory. If possible try and use an equivalent built-in aggregation, write a Scala aggregation supporting partial aggregates, or switch to an RDD and use reduceByKey . This limitation applies regardless of whether you are using Arrow or \"vanilla\" UDAFs. Arrow / Pandas / Vectorized UDFS If you are using PySpark's not-so-new Arrow based UDFS (sometimes called pandas UDFS or vectorized UDFs ), record batching can cause issues. You can configure spark.sql.execution.arrow.maxRecordsPerBatch , which defaults to 10k records per batch. If your records are large this default may very well be the source of your out of memory exceptions. Note: setting spark.sql.execution.arrow.maxRecordsPerBatch too-small will result in reduced performance and reduced ability to vectorize operations over the data frames. mapInPandas / mapInArrow If you use mapInPandas or mapInArrow (proposed in 3.3+) it's important to note that Spark will serialize entire records, not just the columns needed by your UDF. If you encounter OOMs here because of record sizes, one option is to minimize the amount of data being serialized in each record. Select only the minimal data needed to perform the UDF + a key to rejoin with the target dataset.","title":"PySpark UDF / UDAF OOM"},{"location":"details/pyudfoom/#pyspark-udf-udaf-oom","text":"Out of memory exceptions with Python user-defined-functions are especially likely as Spark doesn't do a good job of managing memory between the JVM and Python VM. Together this can result in exceeding container memory limits .","title":"PySpark UDF / UDAF OOM"},{"location":"details/pyudfoom/#grouped-map-co-grouped","text":"The Grouped & Co-Grouped UDFs are especially likely to cause out-of-memory exceptions in PySpark when combined with key skew . Unlike most built in Spark aggregations, PySpark user-defined-aggregates do not support partial aggregation. This means that all of the data for a single key must fit in memory. If possible try and use an equivalent built-in aggregation, write a Scala aggregation supporting partial aggregates, or switch to an RDD and use reduceByKey . This limitation applies regardless of whether you are using Arrow or \"vanilla\" UDAFs.","title":"Grouped Map / Co-Grouped"},{"location":"details/pyudfoom/#arrow-pandas-vectorized-udfs","text":"If you are using PySpark's not-so-new Arrow based UDFS (sometimes called pandas UDFS or vectorized UDFs ), record batching can cause issues. You can configure spark.sql.execution.arrow.maxRecordsPerBatch , which defaults to 10k records per batch. If your records are large this default may very well be the source of your out of memory exceptions. Note: setting spark.sql.execution.arrow.maxRecordsPerBatch too-small will result in reduced performance and reduced ability to vectorize operations over the data frames.","title":"Arrow / Pandas / Vectorized UDFS"},{"location":"details/pyudfoom/#mapinpandas-mapinarrow","text":"If you use mapInPandas or mapInArrow (proposed in 3.3+) it's important to note that Spark will serialize entire records, not just the columns needed by your UDF. If you encounter OOMs here because of record sizes, one option is to minimize the amount of data being serialized in each record. Select only the minimal data needed to perform the UDF + a key to rejoin with the target dataset.","title":"mapInPandas / mapInArrow"},{"location":"details/read-partition-issue/","text":"Partition at read time We're used to thinking of partitioning after a shuffle, but partitioning problems can occur at read time as well. This often happens when the layout of the data on disk is not well suited to our computation. Note that the number of partitions can be optionally specified when using the read API. How to decide on a partition column or partition key? Does the key have relatively low cardinality? 1k distinct values are better than 1M distinct values. Consider a numeric, date, or timestamp column. Does the key have enough data in each partition? 1Gb is a good goal. Does the key have too much data in each partition? The data must fit on a single task in memory and avoid spilling to disk. Does the key have evenly distributed data in each partition? If some partitions have orders of magnitude more data than others, those larger partitions have the potential to spill to disk, OOM, or simply consume excess resources in comparison to the partitions with median amounts of data. You don't want to size executors for the bloated partition. If none of the columns or keys has a particularly even distribution, then create a new column at the expense of saving a new version of the table/RDD/DF. A frequent approach here is to create a new column using a hash based on existing columns. Does the key allow for fewer wide transformations? Wide transformations are more costly than narrow transformations. Does the number of partitions approximate 2-3x the number of allocated cores on the executors? Reference links Learning Spark High Performance Spark","title":"Partition at read time"},{"location":"details/read-partition-issue/#partition-at-read-time","text":"We're used to thinking of partitioning after a shuffle, but partitioning problems can occur at read time as well. This often happens when the layout of the data on disk is not well suited to our computation. Note that the number of partitions can be optionally specified when using the read API. How to decide on a partition column or partition key? Does the key have relatively low cardinality? 1k distinct values are better than 1M distinct values. Consider a numeric, date, or timestamp column. Does the key have enough data in each partition? 1Gb is a good goal. Does the key have too much data in each partition? The data must fit on a single task in memory and avoid spilling to disk. Does the key have evenly distributed data in each partition? If some partitions have orders of magnitude more data than others, those larger partitions have the potential to spill to disk, OOM, or simply consume excess resources in comparison to the partitions with median amounts of data. You don't want to size executors for the bloated partition. If none of the columns or keys has a particularly even distribution, then create a new column at the expense of saving a new version of the table/RDD/DF. A frequent approach here is to create a new column using a hash based on existing columns. Does the key allow for fewer wide transformations? Wide transformations are more costly than narrow transformations. Does the number of partitions approximate 2-3x the number of allocated cores on the executors?","title":"Partition at read time"},{"location":"details/read-partition-issue/#reference-links","text":"Learning Spark High Performance Spark","title":"Reference links"},{"location":"details/revise-bad_partitioning/","text":"Bad Partitioning There are three main different types and causes of bad partitioning in Spark. Partitioning is often the limitation of parallelism for most Spark jobs. The most common (and most difficult to fix) bad partitioning in Spark is that of skewed partitioning. With key-skew the problem is not the number of partions, but that the data is not evenly distributed amongst the partions. The most frequent cause of skewed partitioning is that of \"key-skew.\" . This happens frequently since humans and machines both tend to cluster resulting in skew (e.g. NYC and null ). The other type of skewed partitioning comes from \"input partioned\" data which is not evenly partioned. With input partioned data, the RDD or Dataframe doesn't have a particular partioner it just matches however the data is stored on disk. Uneven input partioned data can be fixed with an explicit repartion/shuffle. This input partioned data can also be skewed due to key-skew if the data is written out partitioned on a skewed key. Insufficent partitioning is similar to input skewed partitioning, except instead of skew there just are not enough partions. Similarily you the number of partions (e.g. repartion(5000) or change spark.sql.shuffle.partitions ).","title":"Bad Partitioning"},{"location":"details/revise-bad_partitioning/#bad-partitioning","text":"There are three main different types and causes of bad partitioning in Spark. Partitioning is often the limitation of parallelism for most Spark jobs. The most common (and most difficult to fix) bad partitioning in Spark is that of skewed partitioning. With key-skew the problem is not the number of partions, but that the data is not evenly distributed amongst the partions. The most frequent cause of skewed partitioning is that of \"key-skew.\" . This happens frequently since humans and machines both tend to cluster resulting in skew (e.g. NYC and null ). The other type of skewed partitioning comes from \"input partioned\" data which is not evenly partioned. With input partioned data, the RDD or Dataframe doesn't have a particular partioner it just matches however the data is stored on disk. Uneven input partioned data can be fixed with an explicit repartion/shuffle. This input partioned data can also be skewed due to key-skew if the data is written out partitioned on a skewed key. Insufficent partitioning is similar to input skewed partitioning, except instead of skew there just are not enough partions. Similarily you the number of partions (e.g. repartion(5000) or change spark.sql.shuffle.partitions ).","title":"Bad Partitioning"},{"location":"details/revise-even_partitioning_still_slow/","text":"Even Partitioning Yet Still Slow To see if a stage if evenly partioned take a look at the Spark WebUI --> Stage tab and look at the distribution of data sizes and durations of the completed tasks. Sometimes a stage with even parititoning is still slow. If the max task duration is still substantailly shorter than the stages overall duration, this is often a sign of an insufficient number of executors. Spark can run (at most) spark.executor.cores * spark.dynamicAllocation.maxExecutors tasks in parallel (and in practice this will be lower since some tasks will be speculatively executed and some executors will fail). Try increasing the maxExecutors and seeing if your job speeds up. Note Setting spark.executor.cores * spark.dynamicAllocation.maxExecutors in excess of cluster capacity can result in the job waiting in PENDING state. So, try increasing maxExecutors within the limitations of the cluster resources and check if the job runtime is faster given the same input data. If the data is evenly partitioned but the max task duration is longer than desired for the stage, increasing the number of executors will not help and you'll need to re-partition the data. See Bad Partitioning .","title":"Even Partitioning Yet Still Slow"},{"location":"details/revise-even_partitioning_still_slow/#even-partitioning-yet-still-slow","text":"To see if a stage if evenly partioned take a look at the Spark WebUI --> Stage tab and look at the distribution of data sizes and durations of the completed tasks. Sometimes a stage with even parititoning is still slow. If the max task duration is still substantailly shorter than the stages overall duration, this is often a sign of an insufficient number of executors. Spark can run (at most) spark.executor.cores * spark.dynamicAllocation.maxExecutors tasks in parallel (and in practice this will be lower since some tasks will be speculatively executed and some executors will fail). Try increasing the maxExecutors and seeing if your job speeds up. Note Setting spark.executor.cores * spark.dynamicAllocation.maxExecutors in excess of cluster capacity can result in the job waiting in PENDING state. So, try increasing maxExecutors within the limitations of the cluster resources and check if the job runtime is faster given the same input data. If the data is evenly partitioned but the max task duration is longer than desired for the stage, increasing the number of executors will not help and you'll need to re-partition the data. See Bad Partitioning .","title":"Even Partitioning Yet Still Slow"},{"location":"details/slow-executor/","text":"Slow executor There can be many reasons executors are slow; here are a few things you can look into: Performance distribution among tasks in the same stage: In Spark UI - Stages - Summary Metric: check if there's uneven distribution of duration / input size. If true, there may be data skews or uneven partition splits. See uneven partitioning . Task size: In Spark UI - Stages - Summary Metrics, check the input/output size of tasks. If individual input or output tasks are larger than a few hundred megabytes, you may need more partitions. Try increasing spark.sql.shuffle.partitions or spark.sql.files.maxPartitionBytes or consider making a repartition call. GC: Check if GC time is a small fraction of duration, if it's more than a few percents, try increasing executor memory and see if any difference. If adding memory is not helping, you can now see if any optimization can be done in your code for that stage.","title":"Slow executor"},{"location":"details/slow-executor/#slow-executor","text":"There can be many reasons executors are slow; here are a few things you can look into: Performance distribution among tasks in the same stage: In Spark UI - Stages - Summary Metric: check if there's uneven distribution of duration / input size. If true, there may be data skews or uneven partition splits. See uneven partitioning . Task size: In Spark UI - Stages - Summary Metrics, check the input/output size of tasks. If individual input or output tasks are larger than a few hundred megabytes, you may need more partitions. Try increasing spark.sql.shuffle.partitions or spark.sql.files.maxPartitionBytes or consider making a repartition call. GC: Check if GC time is a small fraction of duration, if it's more than a few percents, try increasing executor memory and see if any difference. If adding memory is not helping, you can now see if any optimization can be done in your code for that stage.","title":"Slow executor"},{"location":"details/slow-job-slow-cluster/","text":"Slow Cluster How do I know if and when my job is waiting for cluster resources?? Sometimes the cluster manager may choke or otherwise not be able to allocate resources and we don't have a good way of detecting this situation making it difficult for the user to debug and tell apart from Spark not scaling up correctly. As of Spark3.4, an executor will note when and for how long it waits for cluster resources. Check the JVM metrics for this information. Reference link: https://issues.apache.org/jira/browse/SPARK-36664","title":"Slow job slow cluster"},{"location":"details/slow-job-slow-cluster/#slow-cluster","text":"How do I know if and when my job is waiting for cluster resources?? Sometimes the cluster manager may choke or otherwise not be able to allocate resources and we don't have a good way of detecting this situation making it difficult for the user to debug and tell apart from Spark not scaling up correctly. As of Spark3.4, an executor will note when and for how long it waits for cluster resources. Check the JVM metrics for this information.","title":"Slow Cluster"},{"location":"details/slow-job-slow-cluster/#reference-link","text":"https://issues.apache.org/jira/browse/SPARK-36664","title":"Reference link:"},{"location":"details/slow-job/","text":"Slow job Spark job can be slow for various reasons but here is a couple of reasons Slow stage(s): Go to Slow Stage section to identify the slow stage. In most cases, a job is slow because one or more of the stages are slow. Too big DAG: Go to TooBigDAG section for more details on this topic","title":"Slow job"},{"location":"details/slow-job/#slow-job","text":"Spark job can be slow for various reasons but here is a couple of reasons Slow stage(s): Go to Slow Stage section to identify the slow stage. In most cases, a job is slow because one or more of the stages are slow. Too big DAG: Go to TooBigDAG section for more details on this topic","title":"Slow job"},{"location":"details/slow-map/","text":"Slow Map Below is a list of reasons why your map stage might be slow. Note that this is not an exhaustive list but covers most of the scenarios. flowchart LR SlowMap[Slow Read / Map] SlowMap --> SLOWEXEC[Slow executor] SlowMap --> EVENPART_SLOW[Even partitioning] SlowMap --> SkewedMapTasks[Skewed Map Tasks and uneven partitioning] EVENPART_SLOW --> MissingSourcePredicates[Reading more data than needed] EVENPART_SLOW --> TooFewMapTasks[Not enough Read/Map Tasks] EVENPART_SLOW --> TooManyMapTasks[Too many Read/Map Tasks] EVENPART_SLOW --> SlowTransformations[Slow Transformations] EVENPART_SLOW --> UDFSLOWNESS[Slow UDF] SkewedMapTasks --> RecordSkew[Record Skew] SkewedMapTasks --> TaskSkew[Task skew] TaskSkew --> READPARTITIONISSUES[Read partition issues] MissingSourcePredicates --> FILTERNOTPUSHED[Filter not pushed] click EVENPART_SLOW \"../../details/even_partitioning_still_slow\" click SLOWEXEC \"../../details/slow-executor\" click SkewedMapTasks \"../../details/slow-map/#skewed-map-tasks-or-uneven-partitioning\" click RecordSkew \"../../details/slow-map/#skewed-map-tasks-or-uneven-partitioning\" click TaskSkew \"../../details/slow-map/#skewed-map-tasks-or-uneven-partitioning\" click MissingSourcePredicates \"../../details/slow-map/#reading-more-data-than-needed\" click UDFSLOWNESS \"../../details/udfslow\" click LARGERECORDS \"../../details/failure-executor-large-record\" click TooFewMapTasks \"../../details/slow-map/#not-enough-readmap-tasks\" click TooManyMapTasks \"../../details/slow-map/#too-many-readmap-tasks\" click SlowTransformations \"../../details/slow-map/#slow-transformations\" click FILTERNOTPUSHED \"../../details/slow-partition_filter_pushdown\" click SLOWEXEC \"../../details/slow-executor\" click READPARTITIONISSUES \"../../details/read-partition-issue\" Reading more data than needed Iceberg/Parquet provides 3 layers of data pruning/filtering, so it is recommended to make the most of it by utilizing them as upstream in your ETL as possible. Partition Pruning : Applying a filter on a partition column would mean the Spark can prune all the partitions that are not needed (ex: utc_date, utc_hour etc.). Refer to this section for some examples. Column Pruning : Parquet, a columnar format, allows us to read specific columns from a row group without having to read the entire row. By selecting the fields that you only need for your job/sql(instead of \"select *\"), you can avoid bringing unnecessary data only to drop it in the subsequent stages. Predicate Push Down: It is also recommended to use filters on non-partition columns as this would allow Spark to exclude specific row groups while reading data from S3. For ex: account_id is not null if you know that you would be dropping the NULL account_ids eventually. See also filter not pushed down , aggregation not pushed down(todo: add details), Bad storage partitioning(todo: add details). Not enough Read/Map Tasks If your map stage is taking longer, and you are sure that you are not reading more data than needed, then you may be reading the data with small no. of tasks. You can increase the no. of map tasks by decreasing target split size. Note that if you are constrained by the resources(map tasks are just waiting for resources and not in RUNNING status), you would have to request more executors for your job by increasing spark.dynamicAllocation.maxExecutors Too many Read/Map Tasks If you have large no. of map tasks in your stage, you could run into driver memory related errors as the task metadata could overwhelm the driver. This also could put a stress on shuffle(on map side) as more map tasks would create more shuffle blocks. It is recommended to keep the task count for a stage under 80k. You can decrease the no. of map tasks by increasing target split size (todo: add detail) for an Iceberg table. (Note: For a non-iceberg table, the property is spark.sql.maxPartitionBytes and it is at the job level and not at the table level) Slow Transformations Another reason for slow running map tasks could be from many reason, some common ones include: Regex : You have RegEx in your transformation. Refer to RegEx tips for tuning. udf: Make sure you are sending only the data that you need in UDF and tune UDF for performance. Refer to Slow UDF for more details. Json: TBD All these transformations may run into skew issues if you have a single row/column that is bloated. You could prevent this by checking the payload size before calling the transformation as a single row/column could potentially slow down the entire stage. Skewed Map Tasks or Uneven partitioning The most common (and most difficult to fix) bad partitioning in Spark is that of skewed partitioning. The data is not evenly distributed amongst the partitions. Uneven partitioning due to Key-skew : The most frequent cause of skewed partitioning is that of \"key-skew.\" This happens frequently since humans and machines both tend to cluster resulting in skew (e.g. NYC and null ). Uneven partitioning due to input layout: We are used to thinking of partitioning after a shuffle, but partitioning problems can occur at read time as well. This often happens when the layout of the data on disk is not well suited to our computation. In cases where the RDD or Dataframe doesn't have a particular partitioner, data is partitioned according to the storage on disk. Uneven input partitioned data can be fixed with an explicit repartition/shuffle. Spark is often able to avoid input layout issues by combinding and splitting inputs (when input formats are \"splittable\"), but not all input formats give Spark this freedom. One common example is gzip , although there is a work-around for \"splittable gzip\" but this comes at the cost of decompressing the entire file multiple times. Record Skew : A single bloated row/record could be the root cause for slow map task. The easiest way to identify this is by checking your string fields that has Json payload. ( Ex: A bug in a client could write a lot of data). You can identify the culprit by checking the max(size/length) of the field in your upstream table. For CL, snapshot is a candidate for bloated field. Task Skew : **This is only applicable to the tables with non-splittable file format(like TEXT, zip) and parquet files should never run into this issue. Task skew is where one of the tasks got more rows than others and it is possible if the upstream table has a single file that is large and has the non-splittable format.","title":"Slow Map"},{"location":"details/slow-map/#slow-map","text":"Below is a list of reasons why your map stage might be slow. Note that this is not an exhaustive list but covers most of the scenarios. flowchart LR SlowMap[Slow Read / Map] SlowMap --> SLOWEXEC[Slow executor] SlowMap --> EVENPART_SLOW[Even partitioning] SlowMap --> SkewedMapTasks[Skewed Map Tasks and uneven partitioning] EVENPART_SLOW --> MissingSourcePredicates[Reading more data than needed] EVENPART_SLOW --> TooFewMapTasks[Not enough Read/Map Tasks] EVENPART_SLOW --> TooManyMapTasks[Too many Read/Map Tasks] EVENPART_SLOW --> SlowTransformations[Slow Transformations] EVENPART_SLOW --> UDFSLOWNESS[Slow UDF] SkewedMapTasks --> RecordSkew[Record Skew] SkewedMapTasks --> TaskSkew[Task skew] TaskSkew --> READPARTITIONISSUES[Read partition issues] MissingSourcePredicates --> FILTERNOTPUSHED[Filter not pushed] click EVENPART_SLOW \"../../details/even_partitioning_still_slow\" click SLOWEXEC \"../../details/slow-executor\" click SkewedMapTasks \"../../details/slow-map/#skewed-map-tasks-or-uneven-partitioning\" click RecordSkew \"../../details/slow-map/#skewed-map-tasks-or-uneven-partitioning\" click TaskSkew \"../../details/slow-map/#skewed-map-tasks-or-uneven-partitioning\" click MissingSourcePredicates \"../../details/slow-map/#reading-more-data-than-needed\" click UDFSLOWNESS \"../../details/udfslow\" click LARGERECORDS \"../../details/failure-executor-large-record\" click TooFewMapTasks \"../../details/slow-map/#not-enough-readmap-tasks\" click TooManyMapTasks \"../../details/slow-map/#too-many-readmap-tasks\" click SlowTransformations \"../../details/slow-map/#slow-transformations\" click FILTERNOTPUSHED \"../../details/slow-partition_filter_pushdown\" click SLOWEXEC \"../../details/slow-executor\" click READPARTITIONISSUES \"../../details/read-partition-issue\"","title":"Slow Map"},{"location":"details/slow-map/#reading-more-data-than-needed","text":"Iceberg/Parquet provides 3 layers of data pruning/filtering, so it is recommended to make the most of it by utilizing them as upstream in your ETL as possible. Partition Pruning : Applying a filter on a partition column would mean the Spark can prune all the partitions that are not needed (ex: utc_date, utc_hour etc.). Refer to this section for some examples. Column Pruning : Parquet, a columnar format, allows us to read specific columns from a row group without having to read the entire row. By selecting the fields that you only need for your job/sql(instead of \"select *\"), you can avoid bringing unnecessary data only to drop it in the subsequent stages. Predicate Push Down: It is also recommended to use filters on non-partition columns as this would allow Spark to exclude specific row groups while reading data from S3. For ex: account_id is not null if you know that you would be dropping the NULL account_ids eventually. See also filter not pushed down , aggregation not pushed down(todo: add details), Bad storage partitioning(todo: add details).","title":"Reading more data than needed"},{"location":"details/slow-map/#not-enough-readmap-tasks","text":"If your map stage is taking longer, and you are sure that you are not reading more data than needed, then you may be reading the data with small no. of tasks. You can increase the no. of map tasks by decreasing target split size. Note that if you are constrained by the resources(map tasks are just waiting for resources and not in RUNNING status), you would have to request more executors for your job by increasing spark.dynamicAllocation.maxExecutors","title":"Not enough Read/Map Tasks"},{"location":"details/slow-map/#too-many-readmap-tasks","text":"If you have large no. of map tasks in your stage, you could run into driver memory related errors as the task metadata could overwhelm the driver. This also could put a stress on shuffle(on map side) as more map tasks would create more shuffle blocks. It is recommended to keep the task count for a stage under 80k. You can decrease the no. of map tasks by increasing target split size (todo: add detail) for an Iceberg table. (Note: For a non-iceberg table, the property is spark.sql.maxPartitionBytes and it is at the job level and not at the table level)","title":"Too many Read/Map Tasks"},{"location":"details/slow-map/#slow-transformations","text":"Another reason for slow running map tasks could be from many reason, some common ones include: Regex : You have RegEx in your transformation. Refer to RegEx tips for tuning. udf: Make sure you are sending only the data that you need in UDF and tune UDF for performance. Refer to Slow UDF for more details. Json: TBD All these transformations may run into skew issues if you have a single row/column that is bloated. You could prevent this by checking the payload size before calling the transformation as a single row/column could potentially slow down the entire stage.","title":"Slow Transformations"},{"location":"details/slow-map/#skewed-map-tasks-or-uneven-partitioning","text":"The most common (and most difficult to fix) bad partitioning in Spark is that of skewed partitioning. The data is not evenly distributed amongst the partitions. Uneven partitioning due to Key-skew : The most frequent cause of skewed partitioning is that of \"key-skew.\" This happens frequently since humans and machines both tend to cluster resulting in skew (e.g. NYC and null ). Uneven partitioning due to input layout: We are used to thinking of partitioning after a shuffle, but partitioning problems can occur at read time as well. This often happens when the layout of the data on disk is not well suited to our computation. In cases where the RDD or Dataframe doesn't have a particular partitioner, data is partitioned according to the storage on disk. Uneven input partitioned data can be fixed with an explicit repartition/shuffle. Spark is often able to avoid input layout issues by combinding and splitting inputs (when input formats are \"splittable\"), but not all input formats give Spark this freedom. One common example is gzip , although there is a work-around for \"splittable gzip\" but this comes at the cost of decompressing the entire file multiple times. Record Skew : A single bloated row/record could be the root cause for slow map task. The easiest way to identify this is by checking your string fields that has Json payload. ( Ex: A bug in a client could write a lot of data). You can identify the culprit by checking the max(size/length) of the field in your upstream table. For CL, snapshot is a candidate for bloated field. Task Skew : **This is only applicable to the tables with non-splittable file format(like TEXT, zip) and parquet files should never run into this issue. Task skew is where one of the tasks got more rows than others and it is possible if the upstream table has a single file that is large and has the non-splittable format.","title":"Skewed Map Tasks or Uneven partitioning"},{"location":"details/slow-partition_filter_pushdown/","text":"Partition Filters Processing more data than necessary will typically slow down the job. If the input table is partitioned then applying filters on the partition columns can restrict the input volume Spark needs to scan. A simple equality filter gets pushed down to the batch scan and enables Spark to only scan the files where dateint = 20211101 of a sample table partitioned on dateint and hour . select * from jlantos.sample_table where dateint = 20211101 limit 100 Examples when the filter does not get pushed down The filter contains an expression If instead of a particular date we'd like to load data from the 1st of any month we might rewrite the above query such as: select * from jlantos.sample_table where dateint % 100 = 1 limit 100 The query plan shows that Spark in this case scans the whole table and filters only in a later step. Filter is dynamic via a join In a more complex job we might restrict the data based on joining to another table. If the filtering criteria is not static it won't be pushed down to the scan. So in the example below the two table scans happen independently, and min(dateint) calculated in the CTE won't have an effect on the second scan. with dates as (select min(dateint) dateint from jlantos.sample_table) select * from jlantos.sample_table st join dates d on st.dateint = d.dateint","title":"Partition Filters"},{"location":"details/slow-partition_filter_pushdown/#partition-filters","text":"Processing more data than necessary will typically slow down the job. If the input table is partitioned then applying filters on the partition columns can restrict the input volume Spark needs to scan. A simple equality filter gets pushed down to the batch scan and enables Spark to only scan the files where dateint = 20211101 of a sample table partitioned on dateint and hour . select * from jlantos.sample_table where dateint = 20211101 limit 100","title":"Partition Filters"},{"location":"details/slow-partition_filter_pushdown/#examples-when-the-filter-does-not-get-pushed-down","text":"","title":"Examples when the filter does not get pushed down"},{"location":"details/slow-partition_filter_pushdown/#the-filter-contains-an-expression","text":"If instead of a particular date we'd like to load data from the 1st of any month we might rewrite the above query such as: select * from jlantos.sample_table where dateint % 100 = 1 limit 100 The query plan shows that Spark in this case scans the whole table and filters only in a later step.","title":"The filter contains an expression"},{"location":"details/slow-partition_filter_pushdown/#filter-is-dynamic-via-a-join","text":"In a more complex job we might restrict the data based on joining to another table. If the filtering criteria is not static it won't be pushed down to the scan. So in the example below the two table scans happen independently, and min(dateint) calculated in the CTE won't have an effect on the second scan. with dates as (select min(dateint) dateint from jlantos.sample_table) select * from jlantos.sample_table st join dates d on st.dateint = d.dateint","title":"Filter is dynamic via a join"},{"location":"details/slow-reduce/","text":"Slow Reduce Below is a list of reasons why your map stage might be slow. Note that this is not an exhaustive list but covers most of the scenarios. Not Enough Shuffle Tasks Too many shuffle tasks Skewed Shuffle Tasks Spill To Disk Not Enough Shuffle Tasks The default shuffle parallelism for our Spark cluster is 500, and it may not be enough for larger datasets. If you don't see skew and most/all of the tasks are taking really long to finish a reduce stage, you can improve the overall runtime by increasing the spark.sql.shuffle.partitions . Note that if you are constrained by the resources(reduce tasks are just waiting for resources and not in RUNNING status), you would have to request more executors for your job by increasing spark.dynamicAllocation.maxExecutors Too many shuffle tasks While having too many shuffle tasks has no direct effect on the stage duration, it could slow the stage down if there are multiple retries during the shuffle stage due to shuffle fetch failures. Note that the higher the shuffle partitions, the more chances of running into FetchFailure exceptions. Skewed Shuffle Tasks Partitioning problems are often the limitation of parallelism for most Spark jobs. There are two primary types of bad partitioning, skewed partitioning (where the partitions are not equal in size/work) or even but non-ideal number partitioning (where the partitions are equal in size/work). If your tasks are taking roughly equivalent times to complete then you likely have even partitioning, and if they are taking unequal times to complete then you may have skewed or uneven partitioning. What is skew and how to identify skew . Skew is typically from one of the below stages: Join: Skew is natural in most of our data sets due to the nature of the data. Both Hash join and Sort-Merge join can run into skew issue if you have a lot of data for one or more keys on either side of the join. Check Skewed Joins for handling skewed joins with example. Aggregation/Group By: All aggregate functions(UDAFs) using SQL/dataframes/Datasets implement partial aggregation(combiner in MR) so you would only run into a skew if you are using a non-algebraic functions like distinct and percentiles which can't be computed partially. Partial vs Full aggregates Sort/Repartition/Coalesce before write: It is recommended to introduce an additional stage for Sort or Repartition or Coalesce before the write stage to write optimal no. of S3 files into your target table. Check Skewed Write for more details. Slow Aggregation Below non-algebraic functions can slow down the reduce stage if you have too many values/rows for a given key. Count Distinct: Use HyperLogLog(HLL) based sketches for cardinality if you just need the approx counts for trends and don't need the exact counts. HLL can estimate with a standard error of 2%. Percentiles: Use approx_percentile or t-digest sketches which would speed up the computation for a small accuracy trade-off. Spill To Disk Spark executors will start using \"disk\" once they exceed the spark memory fraction of executor memory. This it self is not an issue but too much of \"spill to disk\" will slow down the stage/job. You can overcome this by either increasing the executor memory or tweaking the job/stage to consume less memory.(for ex: a Sort-Merge join requires a lot less memory than a Hash join)","title":"Slow reduce"},{"location":"details/slow-reduce/#slow-reduce","text":"Below is a list of reasons why your map stage might be slow. Note that this is not an exhaustive list but covers most of the scenarios. Not Enough Shuffle Tasks Too many shuffle tasks Skewed Shuffle Tasks Spill To Disk","title":"Slow Reduce"},{"location":"details/slow-reduce/#not-enough-shuffle-tasks","text":"The default shuffle parallelism for our Spark cluster is 500, and it may not be enough for larger datasets. If you don't see skew and most/all of the tasks are taking really long to finish a reduce stage, you can improve the overall runtime by increasing the spark.sql.shuffle.partitions . Note that if you are constrained by the resources(reduce tasks are just waiting for resources and not in RUNNING status), you would have to request more executors for your job by increasing spark.dynamicAllocation.maxExecutors","title":"Not Enough Shuffle Tasks"},{"location":"details/slow-reduce/#too-many-shuffle-tasks","text":"While having too many shuffle tasks has no direct effect on the stage duration, it could slow the stage down if there are multiple retries during the shuffle stage due to shuffle fetch failures. Note that the higher the shuffle partitions, the more chances of running into FetchFailure exceptions.","title":"Too many shuffle tasks"},{"location":"details/slow-reduce/#skewed-shuffle-tasks","text":"Partitioning problems are often the limitation of parallelism for most Spark jobs. There are two primary types of bad partitioning, skewed partitioning (where the partitions are not equal in size/work) or even but non-ideal number partitioning (where the partitions are equal in size/work). If your tasks are taking roughly equivalent times to complete then you likely have even partitioning, and if they are taking unequal times to complete then you may have skewed or uneven partitioning. What is skew and how to identify skew . Skew is typically from one of the below stages: Join: Skew is natural in most of our data sets due to the nature of the data. Both Hash join and Sort-Merge join can run into skew issue if you have a lot of data for one or more keys on either side of the join. Check Skewed Joins for handling skewed joins with example. Aggregation/Group By: All aggregate functions(UDAFs) using SQL/dataframes/Datasets implement partial aggregation(combiner in MR) so you would only run into a skew if you are using a non-algebraic functions like distinct and percentiles which can't be computed partially. Partial vs Full aggregates Sort/Repartition/Coalesce before write: It is recommended to introduce an additional stage for Sort or Repartition or Coalesce before the write stage to write optimal no. of S3 files into your target table. Check Skewed Write for more details.","title":"Skewed Shuffle Tasks"},{"location":"details/slow-reduce/#slow-aggregation","text":"Below non-algebraic functions can slow down the reduce stage if you have too many values/rows for a given key. Count Distinct: Use HyperLogLog(HLL) based sketches for cardinality if you just need the approx counts for trends and don't need the exact counts. HLL can estimate with a standard error of 2%. Percentiles: Use approx_percentile or t-digest sketches which would speed up the computation for a small accuracy trade-off.","title":"Slow Aggregation"},{"location":"details/slow-reduce/#spill-to-disk","text":"Spark executors will start using \"disk\" once they exceed the spark memory fraction of executor memory. This it self is not an issue but too much of \"spill to disk\" will slow down the stage/job. You can overcome this by either increasing the executor memory or tweaking the job/stage to consume less memory.(for ex: a Sort-Merge join requires a lot less memory than a Hash join)","title":"Spill To Disk"},{"location":"details/slow-regex-tips/","text":"Regular Expression Tips Spark function regexp_extract and regexp_replace can transform data using regular expressions. The regular expression pattern follows Java regex pattern . Task Running Very Slowly Stack trace shows: java.lang.Character.codePointAt(Character.java:4884) java.util.regex.Pattern$CharProperty.match(Pattern.java:3789) java.util.regex.Pattern$Curly.match1(Pattern.java:4307) java.util.regex.Pattern$Curly.match(Pattern.java:4250) java.util.regex.Pattern$GroupHead.match(Pattern.java:4672) java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812) java.util.regex.Pattern$Curly.match0(Pattern.java:4286) java.util.regex.Pattern$Curly.match(Pattern.java:4248) java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812) java.util.regex.Pattern$Curly.match0(Pattern.java:4286) java.util.regex.Pattern$Curly.match(Pattern.java:4248) java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812) java.util.regex.Pattern$Curly.match0(Pattern.java:4286) java.util.regex.Pattern$Curly.match(Pattern.java:4248) java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812) java.util.regex.Pattern$Curly.match0(Pattern.java:4286) java.util.regex.Pattern$Curly.match(Pattern.java:4248) java.util.regex.Pattern$Start.match(Pattern.java:3475) java.util.regex.Matcher.search(Matcher.java:1248) java.util.regex.Matcher.find(Matcher.java:637) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.RegExpExtract_2$(Unknown Source) Certain values in the dataset cause regexp_extract with a certain regex pattern to run very slowly. See https://stackoverflow.com/questions/5011672/java-regular-expression-running-very-slow. Match Special Character in PySpark You will need 4 backslashes to match any special character, 2 required by Python string escaping and 2 by Java regex parsing. df = spark.sql(\"SELECT regexp_replace('{{template}}', '\\\\\\\\{\\\\\\\\{', '#')\")","title":"Regular Expression Tips"},{"location":"details/slow-regex-tips/#regular-expression-tips","text":"Spark function regexp_extract and regexp_replace can transform data using regular expressions. The regular expression pattern follows Java regex pattern .","title":"Regular Expression Tips"},{"location":"details/slow-regex-tips/#task-running-very-slowly","text":"Stack trace shows: java.lang.Character.codePointAt(Character.java:4884) java.util.regex.Pattern$CharProperty.match(Pattern.java:3789) java.util.regex.Pattern$Curly.match1(Pattern.java:4307) java.util.regex.Pattern$Curly.match(Pattern.java:4250) java.util.regex.Pattern$GroupHead.match(Pattern.java:4672) java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812) java.util.regex.Pattern$Curly.match0(Pattern.java:4286) java.util.regex.Pattern$Curly.match(Pattern.java:4248) java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812) java.util.regex.Pattern$Curly.match0(Pattern.java:4286) java.util.regex.Pattern$Curly.match(Pattern.java:4248) java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812) java.util.regex.Pattern$Curly.match0(Pattern.java:4286) java.util.regex.Pattern$Curly.match(Pattern.java:4248) java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812) java.util.regex.Pattern$Curly.match0(Pattern.java:4286) java.util.regex.Pattern$Curly.match(Pattern.java:4248) java.util.regex.Pattern$Start.match(Pattern.java:3475) java.util.regex.Matcher.search(Matcher.java:1248) java.util.regex.Matcher.find(Matcher.java:637) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.RegExpExtract_2$(Unknown Source) Certain values in the dataset cause regexp_extract with a certain regex pattern to run very slowly. See https://stackoverflow.com/questions/5011672/java-regular-expression-running-very-slow.","title":"Task Running Very Slowly"},{"location":"details/slow-regex-tips/#match-special-character-in-pyspark","text":"You will need 4 backslashes to match any special character, 2 required by Python string escaping and 2 by Java regex parsing. df = spark.sql(\"SELECT regexp_replace('{{template}}', '\\\\\\\\{\\\\\\\\{', '#')\")","title":"Match Special Character in PySpark"},{"location":"details/slow-skewed-join/","text":"Skewed Joins Skewed joins happen frequently as some locations (NYC), data (null), and titles ( Mr. Farts - Farting Around The House ) are more popular than other types of data. To a certain degree Spark 3.3 query engine has improvements to handle skewed joins, so a first step should be attempting to upgrade to the most recent version of Sprk. Broadcast joins are ideal for handling skewed joins, but they only work when one table is smaller than the other. A general, albiet hacky, solution is to isolate the data for the skewed key, broadcast it for processing (e.g. join) and then union back the results. Other technique can include introduce some type of salting and doing multi-stage joins.","title":"Skewed Joins"},{"location":"details/slow-skewed-join/#skewed-joins","text":"Skewed joins happen frequently as some locations (NYC), data (null), and titles ( Mr. Farts - Farting Around The House ) are more popular than other types of data. To a certain degree Spark 3.3 query engine has improvements to handle skewed joins, so a first step should be attempting to upgrade to the most recent version of Sprk. Broadcast joins are ideal for handling skewed joins, but they only work when one table is smaller than the other. A general, albiet hacky, solution is to isolate the data for the skewed key, broadcast it for processing (e.g. join) and then union back the results. Other technique can include introduce some type of salting and doing multi-stage joins.","title":"Skewed Joins"},{"location":"details/slow-skewed-write/","text":"Skewed/Slow Write Writes can be slow depending on the preceding stage of write() , target table partition scheme, and write parallelism( spark.sql.shuffle.partitions ). The goal of this article is to go through below options and see the most optimal transformation for writing optimal files in target table/partition. When to use Sort A global sort in Spark internally uses range-partitioning to assign sort keys to a partition range. This involves in collecting sample rows(reservoir sampling) from input partitions and sending them to the driver for computing range boundaries. Use global sort If you are writing multiple partitions(especially heterogeneous partitions) as part of your write() as it can estimate the no. of files/tasks for a given target table partition based on the no. of sample rows it observes. If you want to enable predicate-push-down on a set of target table fields for down stream consumption. Tips: 1. You can increase the spark property spark.sql.execution.rangeExchange.sampleSizePerPartition to improve the estimates if you are not seeing optimal no. of files per partition. 2. You can also introduce salt to sort keys to increase the no. of write tasks if the sort keys cardinality less than the spark.sql.shuffle.partitions . Example When to use Repartition Repartition(hash partitioning) partitions rows in a round-robin manner and to produce uniform distribution across the tasks and a hash partitioning just before the write would produce uniform files and all write tasks should take about the same time. Use repartition If you are writing into a single partition or a non-partitioned table and want to get uniform file sizes. If you want to produce a specific no.o files. for ex: using repartiton(100) would generate up to 100 files. When to use Coalesce Coalesce tries to combine files without invoking a shuffle and useful when you are going from a higher parallelism to lower parallelism. Use Coalesce: If you are writing very small no. of files and the file size is relatively small. Note that, Coalesce(N) is not an optimal way to merge files as it tries to combine multiple files(until it reaches target no. of files 'N' ) without taking size into equation, and you could run into (org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0) if the size exceeds.","title":"Skewed/Slow Write"},{"location":"details/slow-skewed-write/#skewedslow-write","text":"Writes can be slow depending on the preceding stage of write() , target table partition scheme, and write parallelism( spark.sql.shuffle.partitions ). The goal of this article is to go through below options and see the most optimal transformation for writing optimal files in target table/partition.","title":"Skewed/Slow Write"},{"location":"details/slow-skewed-write/#when-to-use-sort","text":"A global sort in Spark internally uses range-partitioning to assign sort keys to a partition range. This involves in collecting sample rows(reservoir sampling) from input partitions and sending them to the driver for computing range boundaries. Use global sort If you are writing multiple partitions(especially heterogeneous partitions) as part of your write() as it can estimate the no. of files/tasks for a given target table partition based on the no. of sample rows it observes. If you want to enable predicate-push-down on a set of target table fields for down stream consumption. Tips: 1. You can increase the spark property spark.sql.execution.rangeExchange.sampleSizePerPartition to improve the estimates if you are not seeing optimal no. of files per partition. 2. You can also introduce salt to sort keys to increase the no. of write tasks if the sort keys cardinality less than the spark.sql.shuffle.partitions . Example","title":"When to use Sort"},{"location":"details/slow-skewed-write/#when-to-use-repartition","text":"Repartition(hash partitioning) partitions rows in a round-robin manner and to produce uniform distribution across the tasks and a hash partitioning just before the write would produce uniform files and all write tasks should take about the same time. Use repartition If you are writing into a single partition or a non-partitioned table and want to get uniform file sizes. If you want to produce a specific no.o files. for ex: using repartiton(100) would generate up to 100 files.","title":"When to use Repartition"},{"location":"details/slow-skewed-write/#when-to-use-coalesce","text":"Coalesce tries to combine files without invoking a shuffle and useful when you are going from a higher parallelism to lower parallelism. Use Coalesce: If you are writing very small no. of files and the file size is relatively small. Note that, Coalesce(N) is not an optimal way to merge files as it tries to combine multiple files(until it reaches target no. of files 'N' ) without taking size into equation, and you could run into (org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0) if the size exceeds.","title":"When to use Coalesce"},{"location":"details/slow-stage/","text":"Identify the slow stage When you have an event log from an earlier \"good run\" You can compare the slow and the fast runs. For this you can even use your local pyspark and calculate a ratio between slow and fast run for each stage metrics: # Helper methods (just copy-paste it) def createEventView(eventLogFile, eventViewName): sql(\"CREATE OR REPLACE TEMPORARY VIEW {} USING org.apache.spark.sql.json OPTIONS (path '{}')\".format(eventViewName, eventLogFile)) def createStageMetricsView(eventViewName, stageMetricsViewName): sql(\"CREATE OR REPLACE TEMPORARY VIEW {} AS select `Submission Time`, `Completion Time`, `Stage ID`, t3.col.* from (select `Stage Info`.* from {} where Event='SparkListenerStageCompleted') lateral view explode(Accumulables) t3\".format(stageMetricsViewName, eventViewName)) def showDiffInStage(fastStagesTable, slowStagesTable, stageID): sql(\"select {fastStages}.Name, {fastStages}.Value as Fast, {slowStages}.Value as Slow, {slowStages}.Value / {fastStages}.Value as `Slow / Fast` from {fastStages} INNER JOIN {slowStages} ON {fastStages}.ID = {slowStages}.ID where {fastStages}.`Stage ID` = {stageID} and {slowStages}.`Stage ID` = {stageID}\".format(fastStages=fastStagesTable, slowStages=slowStagesTable, stageID=stageID)).show(40, False) # Creating the views from the event logs (just an example, you have to specify your own paths) createEventView(\"\", \"FAST_EVENTS\") createStageMetricsView(\"FAST_EVENTS\", \"FAST_STAGE_METRICS\") createEventView(\"\", \"SLOW_EVENTS\") createStageMetricsView(\"SLOW_EVENTS\", \"SLOW_STAGE_METRICS\") >>> sql(\"SELECT DISTINCT `Stage ID` from FAST_STAGE_METRICS\").show() +--------+ |Stage ID| +--------+ | 0| | 1| | 2| +--------+ >>> sql(\"SELECT DISTINCT `Stage ID` from SLOW_STAGE_METRICS\").show() +--------+ |Stage ID| +--------+ | 0| | 1| | 2| +--------+ >>> showDiffInStage(\"FAST_STAGE_METRICS\", \"SLOW_STAGE_METRICS\", 2) +-------------------------------------------+-------------+-------------+------------------+ |Name |Fast |Slow |Slow / Fast | +-------------------------------------------+-------------+-------------+------------------+ |scan time total (min, med, max) |1095931 |1628308 |1.485776020570638 | |internal.metrics.executorRunTime |7486648 |12990126 |1.735105750931525 | |duration total (min, med, max) |7017645 |12322243 |1.7558943206731032| |internal.metrics.jvmGCTime |220325 |1084412 |4.921874503574266 | |internal.metrics.output.bytesWritten |34767744411 |34767744411 |1.0 | |internal.metrics.input.recordsRead |149652381 |149652381 |1.0 | |internal.metrics.executorDeserializeCpuTime|5666230304 |7760682789 |1.3696377260771504| |internal.metrics.resultSize |625598 |626415 |1.0013059504665935| |internal.metrics.executorCpuTime |6403420405851|8762799691603|1.3684560963069305| |internal.metrics.input.bytesRead |69488204276 |69488204276 |1.0 | |number of output rows |149652381 |149652381 |1.0 | |internal.metrics.resultSerializationTime |36 |72 |2.0 | |internal.metrics.output.recordsWritten |149652381 |149652381 |1.0 | |internal.metrics.executorDeserializeTime |6024 |11954 |1.9843957503320053| +-------------------------------------------+-------------+-------------+------------------+ When there is no event log from a good run Steps: Navigate to Spark UI using spark history URL Click on Stages and sort the stages(click on Duration ) in descending order to find the longest running stage. Now let's figure out if the slow stage is a Map or Reduce/Shuffle Once you identify the slow stage, check the fields \"Input\", \"Output\", \"Shuffle Read\", \"Shuffle Write\" of the slow stage and use below grid to identify the stage type and the corresponding ETL action. ----------------------------------------------------------------------------------- | Input | Output | Shuffle Read | Shuffle Write | MR Stage | ETL Action | |------------------------------------------------------------|----------------------| | X | | | X | Map | Read | |------------------------------------------------------------|----------------------| | X | X | | | Map | Read/Write | |------------------------------------------------------------|----------------------| | X | | | | Map | Sort Estimate | |------------------------------------------------------------|----------------------| | | | X | | Map | Sort Estimate | |------------------------------------------------------------|----------------------| | | | X | X | Reduce | Join/Agg/Repartition | |------------------------------------------------------------|----------------------| | | X | X | | Reduce | Write | ------------------------------------------------------------|---------------------- go to Map if the slow stage is from a Map operation. go to Reduce if the slow stage is from a Reduce/Shuffle operation.","title":"Identify the slow stage"},{"location":"details/slow-stage/#identify-the-slow-stage","text":"","title":"Identify the slow stage"},{"location":"details/slow-stage/#when-you-have-an-event-log-from-an-earlier-good-run","text":"You can compare the slow and the fast runs. For this you can even use your local pyspark and calculate a ratio between slow and fast run for each stage metrics: # Helper methods (just copy-paste it) def createEventView(eventLogFile, eventViewName): sql(\"CREATE OR REPLACE TEMPORARY VIEW {} USING org.apache.spark.sql.json OPTIONS (path '{}')\".format(eventViewName, eventLogFile)) def createStageMetricsView(eventViewName, stageMetricsViewName): sql(\"CREATE OR REPLACE TEMPORARY VIEW {} AS select `Submission Time`, `Completion Time`, `Stage ID`, t3.col.* from (select `Stage Info`.* from {} where Event='SparkListenerStageCompleted') lateral view explode(Accumulables) t3\".format(stageMetricsViewName, eventViewName)) def showDiffInStage(fastStagesTable, slowStagesTable, stageID): sql(\"select {fastStages}.Name, {fastStages}.Value as Fast, {slowStages}.Value as Slow, {slowStages}.Value / {fastStages}.Value as `Slow / Fast` from {fastStages} INNER JOIN {slowStages} ON {fastStages}.ID = {slowStages}.ID where {fastStages}.`Stage ID` = {stageID} and {slowStages}.`Stage ID` = {stageID}\".format(fastStages=fastStagesTable, slowStages=slowStagesTable, stageID=stageID)).show(40, False) # Creating the views from the event logs (just an example, you have to specify your own paths) createEventView(\"\", \"FAST_EVENTS\") createStageMetricsView(\"FAST_EVENTS\", \"FAST_STAGE_METRICS\") createEventView(\"\", \"SLOW_EVENTS\") createStageMetricsView(\"SLOW_EVENTS\", \"SLOW_STAGE_METRICS\") >>> sql(\"SELECT DISTINCT `Stage ID` from FAST_STAGE_METRICS\").show() +--------+ |Stage ID| +--------+ | 0| | 1| | 2| +--------+ >>> sql(\"SELECT DISTINCT `Stage ID` from SLOW_STAGE_METRICS\").show() +--------+ |Stage ID| +--------+ | 0| | 1| | 2| +--------+ >>> showDiffInStage(\"FAST_STAGE_METRICS\", \"SLOW_STAGE_METRICS\", 2) +-------------------------------------------+-------------+-------------+------------------+ |Name |Fast |Slow |Slow / Fast | +-------------------------------------------+-------------+-------------+------------------+ |scan time total (min, med, max) |1095931 |1628308 |1.485776020570638 | |internal.metrics.executorRunTime |7486648 |12990126 |1.735105750931525 | |duration total (min, med, max) |7017645 |12322243 |1.7558943206731032| |internal.metrics.jvmGCTime |220325 |1084412 |4.921874503574266 | |internal.metrics.output.bytesWritten |34767744411 |34767744411 |1.0 | |internal.metrics.input.recordsRead |149652381 |149652381 |1.0 | |internal.metrics.executorDeserializeCpuTime|5666230304 |7760682789 |1.3696377260771504| |internal.metrics.resultSize |625598 |626415 |1.0013059504665935| |internal.metrics.executorCpuTime |6403420405851|8762799691603|1.3684560963069305| |internal.metrics.input.bytesRead |69488204276 |69488204276 |1.0 | |number of output rows |149652381 |149652381 |1.0 | |internal.metrics.resultSerializationTime |36 |72 |2.0 | |internal.metrics.output.recordsWritten |149652381 |149652381 |1.0 | |internal.metrics.executorDeserializeTime |6024 |11954 |1.9843957503320053| +-------------------------------------------+-------------+-------------+------------------+","title":"When you have an event log from an earlier \"good run\""},{"location":"details/slow-stage/#when-there-is-no-event-log-from-a-good-run","text":"Steps: Navigate to Spark UI using spark history URL Click on Stages and sort the stages(click on Duration ) in descending order to find the longest running stage.","title":"When there is no event log from a good run"},{"location":"details/slow-stage/#now-lets-figure-out-if-the-slow-stage-is-a-map-or-reduceshuffle","text":"Once you identify the slow stage, check the fields \"Input\", \"Output\", \"Shuffle Read\", \"Shuffle Write\" of the slow stage and use below grid to identify the stage type and the corresponding ETL action. ----------------------------------------------------------------------------------- | Input | Output | Shuffle Read | Shuffle Write | MR Stage | ETL Action | |------------------------------------------------------------|----------------------| | X | | | X | Map | Read | |------------------------------------------------------------|----------------------| | X | X | | | Map | Read/Write | |------------------------------------------------------------|----------------------| | X | | | | Map | Sort Estimate | |------------------------------------------------------------|----------------------| | | | X | | Map | Sort Estimate | |------------------------------------------------------------|----------------------| | | | X | X | Reduce | Join/Agg/Repartition | |------------------------------------------------------------|----------------------| | | X | X | | Reduce | Write | ------------------------------------------------------------|---------------------- go to Map if the slow stage is from a Map operation. go to Reduce if the slow stage is from a Reduce/Shuffle operation.","title":"Now let's figure out if the slow stage is a Map or Reduce/Shuffle"},{"location":"details/slow-writes-s3/","text":"Slow writes on S3 Using the default file output committer with S3a results in double data writes (sad times!). Use a newer cloud committer such as the \"S3 magic committer\" or a committer specialized for your hadoop cluster. Alternatively, write to Apache Iceberg , Delta.io , or Apache Hudi . Reference links S3 Magic Committer blog and Hadoop documentation EMRFS S3-optimized Committer","title":"Slow writes on S3"},{"location":"details/slow-writes-s3/#slow-writes-on-s3","text":"Using the default file output committer with S3a results in double data writes (sad times!). Use a newer cloud committer such as the \"S3 magic committer\" or a committer specialized for your hadoop cluster. Alternatively, write to Apache Iceberg , Delta.io , or Apache Hudi .","title":"Slow writes on S3"},{"location":"details/slow-writes-s3/#reference-links","text":"S3 Magic Committer blog and Hadoop documentation EMRFS S3-optimized Committer","title":"Reference links"},{"location":"details/slow-writes-too-many-files/","text":"Slow writes due to Too many small files Sometimes a partitioning approach works fine for a small dataset, but can cause a surprisingly large number of partitions for a slighly larger dataset. Check out The Small File Problem in context of HDFS. Relevant links HDFS: The Small File Problem: Partition strategies to avoid IO limitations","title":"Slow writes due to Too many small files"},{"location":"details/slow-writes-too-many-files/#slow-writes-due-to-too-many-small-files","text":"Sometimes a partitioning approach works fine for a small dataset, but can cause a surprisingly large number of partitions for a slighly larger dataset. Check out The Small File Problem in context of HDFS.","title":"Slow writes due to Too many small files"},{"location":"details/slow-writes-too-many-files/#relevant-links","text":"HDFS: The Small File Problem: Partition strategies to avoid IO limitations","title":"Relevant links"},{"location":"details/slow-writes/","text":"Slow Writes The Shuffle Write time is visible as follows: Spark UI --> Stages Tab --> Stages Detail --> Event timeline. Symptom: my spark job is spending more time writing files to disk on shuffle writes. Some potential causes: the job is writing too many files the job is writing skewed files the file output committer is not suited for this many writes","title":"Slow Writes"},{"location":"details/slow-writes/#slow-writes","text":"The Shuffle Write time is visible as follows: Spark UI --> Stages Tab --> Stages Detail --> Event timeline. Symptom: my spark job is spending more time writing files to disk on shuffle writes. Some potential causes: the job is writing too many files the job is writing skewed files the file output committer is not suited for this many writes","title":"Slow Writes"},{"location":"details/toobigdag/","text":"Too Big DAG (or when iterative algorithms go bump in the night) Spark uses lazy evaluation and creates a DAG (directed acyclic graph) of the operations needed to compute a peice of data. Even if the data is persisted or cached, Spark will keep this DAG in memory on the driver so that if an executor fails it can re-create this data later. This is more likely to cause problems with iterative algorithms that create RDDs or DataFrames on each iteration based on the previous iteration, like ALS. Some signs of a DAG getting too big are: Iterative algorithm becoming slower on each iteration Driver OOM Executor out-of-disk-error If your job hasn't crashed, an easy way to check is by looking at the Spark Web UI and seeing what the DAG visualization looks like. If the DAG takes a measurable length of time to load (minutes), or fills a few screens it's likely \"too-big.\" Just because a DAG \"looks\" small though doesn't mean that it isn't necessarily an issue, medium-sized-looking DAGs with lots of shuffle files can cause executor out of disk issues too. Working around this can be complicated, but there are some tools to simplify it. The first is Spark's checkpointing which allows Spark to \"forget\" the DAG so far by writing the data out to a persistent storage like S3 or HDFS. The second is manually doing what checkpointing does, that is on your own writing the data out and loading it back in. Unfortunately, if you work in a notebook environment this might not be enough to solve your problem. While this will introduce a \"cut\" in the DAG, if the old RDDs or DataFrames/Datasets are still in scope they will still continue to reside in memory on the driver, and any shuffle files will continue to reside on the disks of the workers. To work around this it's important to explicitly clean up your old RDDs/DataFrames by setting their references to None/null. If you still run into executor out of disk space errors, you may need to look at the approach taken in Spark's ALS algorithm of triggering eager shuffle cleanups, but this is an advanced feature and can lead to non-recoverable errors.","title":"Too Big DAG (or when iterative algorithms go bump in the night)"},{"location":"details/toobigdag/#too-big-dag-or-when-iterative-algorithms-go-bump-in-the-night","text":"Spark uses lazy evaluation and creates a DAG (directed acyclic graph) of the operations needed to compute a peice of data. Even if the data is persisted or cached, Spark will keep this DAG in memory on the driver so that if an executor fails it can re-create this data later. This is more likely to cause problems with iterative algorithms that create RDDs or DataFrames on each iteration based on the previous iteration, like ALS. Some signs of a DAG getting too big are: Iterative algorithm becoming slower on each iteration Driver OOM Executor out-of-disk-error If your job hasn't crashed, an easy way to check is by looking at the Spark Web UI and seeing what the DAG visualization looks like. If the DAG takes a measurable length of time to load (minutes), or fills a few screens it's likely \"too-big.\" Just because a DAG \"looks\" small though doesn't mean that it isn't necessarily an issue, medium-sized-looking DAGs with lots of shuffle files can cause executor out of disk issues too. Working around this can be complicated, but there are some tools to simplify it. The first is Spark's checkpointing which allows Spark to \"forget\" the DAG so far by writing the data out to a persistent storage like S3 or HDFS. The second is manually doing what checkpointing does, that is on your own writing the data out and loading it back in. Unfortunately, if you work in a notebook environment this might not be enough to solve your problem. While this will introduce a \"cut\" in the DAG, if the old RDDs or DataFrames/Datasets are still in scope they will still continue to reside in memory on the driver, and any shuffle files will continue to reside on the disks of the workers. To work around this it's important to explicitly clean up your old RDDs/DataFrames by setting their references to None/null. If you still run into executor out of disk space errors, you may need to look at the approach taken in Spark's ALS algorithm of triggering eager shuffle cleanups, but this is an advanced feature and can lead to non-recoverable errors.","title":"Too Big DAG (or when iterative algorithms go bump in the night)"},{"location":"details/toofew_tasks/","text":"Too few tasks","title":"Toofew tasks"},{"location":"details/toofew_tasks/#too-few-tasks","text":"","title":"Too few tasks"},{"location":"details/toomany_tasks/","text":"Too many tasks","title":"Toomany tasks"},{"location":"details/toomany_tasks/#too-many-tasks","text":"","title":"Too many tasks"},{"location":"details/udfslow/","text":"Avoid UDFs for the most part User defined functions in Spark are black blox to Spark and can limit performance. When possible look for built-in alternatives. One important exception is that if you have multiple functions which must be done in Python, the advice changes a little bit. Since moving data from the JVM to Python is expensive, if you can chain together multiple Python UDFs on the same column, Spark is able to pipeline these together into a single copy to/from Python.","title":"Avoid UDFs for the most part"},{"location":"details/udfslow/#avoid-udfs-for-the-most-part","text":"User defined functions in Spark are black blox to Spark and can limit performance. When possible look for built-in alternatives. One important exception is that if you have multiple functions which must be done in Python, the advice changes a little bit. Since moving data from the JVM to Python is expensive, if you can chain together multiple Python UDFs on the same column, Spark is able to pipeline these together into a single copy to/from Python.","title":"Avoid UDFs for the most part"},{"location":"details/write-fails/","text":"Write Fails Write failures can sometimes mask other problems. A good first step is to insert a cache or persist right before the write step. Iceberg table writes can sometimes fail after upgrading to a new version as the partitioning of the table bubbles further up. Range based partitioning (used by default with sorted tables) can result in a small number of partitions when there is not much key distance. One option is to, as with a manual sort in Spark, add some extra higher cardinality columns to your sort order in your iceberg table. You can go back to pre-Spark 3 behaviour by instead insert your own manual sort and set write mode to none .","title":"Write Fails"},{"location":"details/write-fails/#write-fails","text":"Write failures can sometimes mask other problems. A good first step is to insert a cache or persist right before the write step. Iceberg table writes can sometimes fail after upgrading to a new version as the partitioning of the table bubbles further up. Range based partitioning (used by default with sorted tables) can result in a small number of partitions when there is not much key distance. One option is to, as with a manual sort in Spark, add some extra higher cardinality columns to your sort order in your iceberg table. You can go back to pre-Spark 3 behaviour by instead insert your own manual sort and set write mode to none .","title":"Write Fails"},{"location":"flowchart/","text":"Spark Error Flowchart: Note this uses mermaid.js which may take awhile to load. graph TD A[Start here] --> B[Slow Running Job] C[I have an exception or error] A --> C click B \"slow\" \"Slow\" click C \"error\" \"Error\"","title":"Index"},{"location":"flowchart/error/","text":"Spark Error Flowchart: Note this uses mermaid.js which may take awhile to load. flowchart LR Error[Error/Exception] Error --> MemoryError[Memory Error] Error --> ShuffleError[Shuffle Error] Error --> SqlAnalysisError[sql.AnalysisException] Error --> WriteFails[WriteFails] Error --> OtherError[Others] Error --> Serialization Serialization --> KyroBuffer[Kyro Buffer Overflow] KyroBuffer --> DriverMaxResultSize MemoryError --> DriverMemory[Driver] MemoryError --> ExecutorMemory[Executor] DriverMemory --> DriverMemoryError[Spark driver ran out of memory] DriverMemory --> DriverMaxResultSize[MaxResultSize exceeded] DriverMemory --> TooBigBroadcastJoin[Too Big Broadcast Join] DriverMemory --> ContainerOOM[Container Out Of Memory] DriverMaxResultSize --> TooBigBroadcastJoin ExecutorMemory --> ExecutorMemoryError[Spark executor ran out of memory] ExecutorMemory --> ExecutorDiskError[Executor out of disk error] ExecutorMemory --> ContainerOOM ExecutorMemory --> LARGERECORDS[Too large record / column+record] click Error \"../../details/error-job\" click MemoryError \"../../details/error-memory\" click DriverMemory \"../../details/error-memory/#driver\" click DriverMemoryError \"../../details/error-driver-out-of-memory\" click DriverMaxResultSize \"../../details/error-driver-max-result-size\" click ExecutorMemory \"../../details/error-memory/#executor\" click ExecutorMemoryError \"../../details/error-executor-out-of-memory\" click ExecutorDiskError \"../../details/error-executor-out-of-disk\" click ShuffleError \"../../details/error-shuffle\" click SqlAnalysisError \"../../details/error-sql-analysis\" click OtherError \"../../details/error-other\" click ContainerOOM \"../../details/container-oom\" click TooBigBroadcastJoin \"../../details/big-broadcast-join\" \"Broadcast Joins\" click LARGERECORDS \"../../details/failure-executor-large-record\" click WriteFails \"../../details/write-fails\"","title":"Error"},{"location":"flowchart/shared/","text":"Spark Error Flowchart: Note this uses mermaid.js which may take awhile to load. graph TD OHNOES[Contact support]","title":"Shared"},{"location":"flowchart/slow/","text":"Spark Error Flowchart: Note this uses mermaid.js which may take awhile to load. flowchart LR SlowJob[Slow Job] SlowJob --> SlowStage[Slow Stage] SlowStage --> SlowMap[Slow Read/Map] SlowStage --> SlowReduce[Slow Shuffle/Reducer/Exchange] SlowStage --> SLOWWRITESTOSTORAGE[Slow writes to storage] SlowJob --> TOOBIGDAG[Too Big DAG] SlowJob --> SlowCluster[Slow Cluster] SlowReduce --> PAGGS[Partial aggregates] SlowReduce --> TooFewShuffleTasks[Not Enough Shuffle Tasks] SlowReduce --> TooManyShuffleTasks[Too many shuffle tasks] SlowReduce --> SkewedShuffleTasks[Skewed Shuffle Tasks] SlowReduce --> SpillToDisk[Spill To Disk] SkewedShuffleTasks --> SkewedJoin[Skewed Join] SkewedShuffleTasks --> SkewedAggregation[Aggregation/Group By] click SlowJob \"../../details/slow-job\" click SlowStage \"../../details/slow-stage\" click SlowMap \"../../details/slow-map\" click SlowReduce \"../../details/slow-reduce\" click SlowCluster \"../../details/slow-job-slow-cluster\" click TOOBIGDAG \"../../details/toobigdag\" click TooFewShuffleTasks \"../../details/slow-reduce/#not-enough-shuffle-tasks\" click TooManyShuffleTasks \"../../details/slow-reduce/#too-many-shuffle-tasks\" click SkewedShuffleTasks \"../../details/slow-reduce/#skewed-shuffle-tasks\" click SpillToDisk \"../../details/slow-reduce/#spill-to-disk\" click SkewedJoin \"../../details/slow-skewed-join\" click SkewedAggregation \"../../details/slow-reduce/#skewed-shuffle-tasks\" SLOWWRITESTOSTORAGE[Slow writes to storage] SLOWWRITESTOSTORAGE --> TOOMANYFILES[Slow writes because there are too many files] SLOWWRITESTOSTORAGE --> SkewedWrite[Skewed Write: when to use Sort/Repartition/Coalesce before write] SLOWWRITESTOSTORAGE --> S3COMMITTER[Slow writes on S3 depend on the committer] click UDFSLOWNESS \"../../details/udfslow\" click PAGGS \"../../details/partial_aggregates\" click FILTERNOTPUSHED \"../../details/slow-partition_filter_pushdown\" click SLOWSTAGE \"../../details/slow-stage\" click SLOWWRITESTOSTORAGE \"../../details/slow-writes\" click SkewedWrite \"../../details/slow-skewed-write\" click TOOMANYFILES \"../../details/slow-writes-too-many-files\" click S3COMMITTER \"../../details/slow-writes-s3\" click TOOMANY \"../../details/toomany_tasks\" click TOOFEW \"../../details/toofew_tasks\" click NOTENOUGHEXEC \"../../details/notenoughexecs\" click SHUFFLEPARTITIONISSUES \"../../details/slow-reduce\" click READPARTITIONISSUES \"../../details/read-partition-issue\"","title":"Slow"}]}
\ No newline at end of file
diff --git a/search/worker.js b/search/worker.js
new file mode 100644
index 0000000..8628dbc
--- /dev/null
+++ b/search/worker.js
@@ -0,0 +1,133 @@
+var base_path = 'function' === typeof importScripts ? '.' : '/search/';
+var allowSearch = false;
+var index;
+var documents = {};
+var lang = ['en'];
+var data;
+
+function getScript(script, callback) {
+ console.log('Loading script: ' + script);
+ $.getScript(base_path + script).done(function () {
+ callback();
+ }).fail(function (jqxhr, settings, exception) {
+ console.log('Error: ' + exception);
+ });
+}
+
+function getScriptsInOrder(scripts, callback) {
+ if (scripts.length === 0) {
+ callback();
+ return;
+ }
+ getScript(scripts[0], function() {
+ getScriptsInOrder(scripts.slice(1), callback);
+ });
+}
+
+function loadScripts(urls, callback) {
+ if( 'function' === typeof importScripts ) {
+ importScripts.apply(null, urls);
+ callback();
+ } else {
+ getScriptsInOrder(urls, callback);
+ }
+}
+
+function onJSONLoaded () {
+ data = JSON.parse(this.responseText);
+ var scriptsToLoad = ['lunr.js'];
+ if (data.config && data.config.lang && data.config.lang.length) {
+ lang = data.config.lang;
+ }
+ if (lang.length > 1 || lang[0] !== "en") {
+ scriptsToLoad.push('lunr.stemmer.support.js');
+ if (lang.length > 1) {
+ scriptsToLoad.push('lunr.multi.js');
+ }
+ if (lang.includes("ja") || lang.includes("jp")) {
+ scriptsToLoad.push('tinyseg.js');
+ }
+ for (var i=0; i < lang.length; i++) {
+ if (lang[i] != 'en') {
+ scriptsToLoad.push(['lunr', lang[i], 'js'].join('.'));
+ }
+ }
+ }
+ loadScripts(scriptsToLoad, onScriptsLoaded);
+}
+
+function onScriptsLoaded () {
+ console.log('All search scripts loaded, building Lunr index...');
+ if (data.config && data.config.separator && data.config.separator.length) {
+ lunr.tokenizer.separator = new RegExp(data.config.separator);
+ }
+
+ if (data.index) {
+ index = lunr.Index.load(data.index);
+ data.docs.forEach(function (doc) {
+ documents[doc.location] = doc;
+ });
+ console.log('Lunr pre-built index loaded, search ready');
+ } else {
+ index = lunr(function () {
+ if (lang.length === 1 && lang[0] !== "en" && lunr[lang[0]]) {
+ this.use(lunr[lang[0]]);
+ } else if (lang.length > 1) {
+ this.use(lunr.multiLanguage.apply(null, lang)); // spread operator not supported in all browsers: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_operator#Browser_compatibility
+ }
+ this.field('title');
+ this.field('text');
+ this.ref('location');
+
+ for (var i=0; i < data.docs.length; i++) {
+ var doc = data.docs[i];
+ this.add(doc);
+ documents[doc.location] = doc;
+ }
+ });
+ console.log('Lunr index built, search ready');
+ }
+ allowSearch = true;
+ postMessage({config: data.config});
+ postMessage({allowSearch: allowSearch});
+}
+
+function init () {
+ var oReq = new XMLHttpRequest();
+ oReq.addEventListener("load", onJSONLoaded);
+ var index_path = base_path + '/search_index.json';
+ if( 'function' === typeof importScripts ){
+ index_path = 'search_index.json';
+ }
+ oReq.open("GET", index_path);
+ oReq.send();
+}
+
+function search (query) {
+ if (!allowSearch) {
+ console.error('Assets for search still loading');
+ return;
+ }
+
+ var resultDocuments = [];
+ var results = index.search(query);
+ for (var i=0; i < results.length; i++){
+ var result = results[i];
+ doc = documents[result.ref];
+ doc.summary = doc.text.substring(0, 200);
+ resultDocuments.push(doc);
+ }
+ return resultDocuments;
+}
+
+if( 'function' === typeof importScripts ) {
+ onmessage = function (e) {
+ if (e.data.init) {
+ init();
+ } else if (e.data.query) {
+ postMessage({ results: search(e.data.query) });
+ } else {
+ console.error("Worker - Unrecognized message: " + e);
+ }
+ };
+}
diff --git a/sitemap.xml b/sitemap.xml
new file mode 100644
index 0000000..0f8724e
--- /dev/null
+++ b/sitemap.xml
@@ -0,0 +1,3 @@
+
+
+
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
new file mode 100644
index 0000000..cfbe132
Binary files /dev/null and b/sitemap.xml.gz differ