Spark NLP 5.5.2: GGUF Embeddings, New HTML/Email/Word Ingestion, Enhanced OpenVINO Support, a New Q&A Annotator, and More Enhancements & Fixes #14487

maziyarpanahi · 2024-12-18T17:20:40Z

maziyarpanahi
Dec 18, 2024
Maintainer

📢 Spark NLP 5.5.2

We’re thrilled to introduce the latest enhancements and new features in this release of Spark NLP! These additions bring more powerful model inference capabilities, seamless data ingestion methods, and greater flexibility for scaling your NLP workflows.

Upgrade today to take advantage of these new capabilities and improvements. As always, we look forward to your feedback and contributions, and thank you for being part of the Spark NLP community!

🔥 New Features & Enhancements

🚀 Major New Features

OpenVINO Support for Transformers (#14408)
Many popular transformer-based annotators now leverage OpenVINO for faster inference on Intel hardware. Enjoy speedier pipelines across a wide array of models—such as DeBerta, DistilBert, RoBerta, XlmRoBerta, Albert, and more—enabling efficient, production-grade NLP at scale.
BLIPForQuestionAnswering Transformer (#14422)
Introducing BLIPForQuestionAnswering, a new image-based question-answering transformer. Simply provide an image and a question, and BLIP will deliver contextually relevant answers. Perfect for use cases in image analysis, e-commerce, and beyond.
AutoGGUFEmbeddings Annotator (#14433)
Seamlessly integrate AutoGGUFModels into your NLP pipeline. The new AutoGGUFEmbeddings annotator provides dense vector embeddings, making it easier than ever to incorporate advanced sentence embeddings into your workflows. We’ve included an end-to-end notebook to help you get started right away.

📜 New Data Ingestions

Parsing HTML to DataFrames (#14449)
Need to analyze web content at scale? Use sparknlp.read().html() to parse local or remote HTML files into structured Spark DataFrames. This new feature makes web-scale data analysis and downstream NLP tasks more accessible and scalable.
Email Content to DataFrames (#14455)
Leverage sparknlp.read().email() to transform email content into organized DataFrames. Analyze communications, extract insights, and enrich your NLP pipelines with minimal effort. (Requires [SPARKNLP-1092] Adding support to read HTML files #14449 to be merged first.)
Microsoft Word Document Parsing (#14476)
Turn .docx and .doc files into structured Spark DataFrames for streamlined integration into your NLP projects. From enterprise documents to reports, this feature simplifies data preparation and analysis at scale.

🐛 Bug Fixes

Microsoft Fabric Integration (#14467)
We’ve added support for Microsoft Fabric to store and retrieve word embeddings efficiently. Leverage your existing infrastructure to scale Spark NLP solutions more effectively.
cuDNN Upgrade Instructions for Databricks (#14451)
Easily upgrade cuDNN on Databricks to accelerate ONNX model inference on GPU, and take advantage of updated installation instructions for a cleaner setup.
Metadata Preservation in ChunkEmbeddings (#14462)
ChunkEmbeddings now retain original metadata, ensuring richer context and more meaningful insights in your downstream tasks.
Default Names and Languages for New Annotators (#14469)
We’ve standardized default names and languages in our seq2seq annotators for better clarity, consistency, and ease of use.

📦 Dependencies

Updated:

Jsoup has been updated from 1.18.1 to 1.18.2 to ensure compatibility and maintain security and performance standards.

New Additions for Email and Document Parsing:

Jakarta Mail (jakarta.mail:jakarta.mail-api:2.1.3): Added to support parsing and processing email content.
Angus Mail (org.eclipse.angus:angus-mail:2.0.3): Complementary mail handling library integrated for more robust email parsing capabilities.
Apache POI (org.apache.poi:poi-ooxml:4.1.2 & org.apache.poi:poi-scratchpad:4.1.2): Introduced for parsing Word documents (.docx and .doc) into structured DataFrames, enabling seamless integration of document-based data into Spark NLP workflows.

📝 Models

We have added more than 50,000 new models and pipelines. The complete list of all 83,000+ models & pipelines in 230+ languages is available on our Models Hub.

❤️ Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.5.2

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.5.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.5.2

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.5.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.5.2

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.5.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.5.2

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.5.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.5.2</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.5.2</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.5.2</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.5.2.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.5.2.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.5.2.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.5.2.jar

What's Changed

Models hub by @maziyarpanahi in Models hub #14458
Models hub by @maziyarpanahi in Models hub #14470
adding openvino support to all ClassificationForXXX annotators by @ahmedlone127 in adding openvino support to all ClassificationForXXX annotators #14408
[SPARKNLP-1068] Introducing BLIPForQuestionAnswering transformer by @danilojsl in [SPARKNLP-1068] Introducing BLIPForQuestionAnswering transformer #14422
[SPARKNLP-1091] AutoGGUFModel embeddings support by @DevinTDHa in [SPARKNLP-1091] AutoGGUFModel embeddings support #14433
Apache Spark vulnerable Fix by @maziyarpanahi in Apache Spark vulnerable Fix #14441
[SPARKNLP-1092] Adding support to read HTML files by @danilojsl in [SPARKNLP-1092] Adding support to read HTML files #14449
[SPARKNLP-1095] Add installation instructions for ONNX GPU on Databricks by @DevinTDHa in [SPARKNLP-1095] Add installation instructions for ONNX GPU on Databricks #14451
[SPARKNLP-1093] Adding support to read Email files by @danilojsl in [SPARKNLP-1093] Adding support to read Email files #14455
Small typos by @svlandeg in Small typos #14459
Addition chunk metadata to ChunkEmbeddings output by @mehmetbutgul in Addition chunk metadata to ChunkEmbeddings output #14462
[SPARKNLP-1096] Adding support to Microsoft Fabric for WordEmbeddings by @danilojsl in [SPARKNLP-1096] Adding support to Microsoft Fabric for WordEmbeddings #14467
Default name updates by @ahmedlone127 in Default name updates #14469
SPARKNLP-1094 Adding Support to Read Word Files by @danilojsl in SPARKNLP-1094 Adding Support to Read Word Files #14476
ignore html as linguist-vendored by @maziyarpanahi in ignore html as linguist-vendored #14481
Models hub by @maziyarpanahi in Models hub #14482
Models hub by @maziyarpanahi in Models hub #14485
Spark NLP 5.5.2 Release Candidate by @maziyarpanahi in Spark NLP 5.5.2 Release Candidate #14473

New Contributors

@svlandeg made their first contribution in Small typos #14459

Full Changelog: 5.5.1...5.5.2

This discussion was created from the release Spark NLP 5.5.2: GGUF Embeddings, New HTML/Email/Word Ingestion, Enhanced OpenVINO Support, a New Q&A Annotator, and More Enhancements & Fixes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 5.5.2: GGUF Embeddings, New HTML/Email/Word Ingestion, Enhanced OpenVINO Support, a New Q&A Annotator, and More Enhancements & Fixes #14487

{{title}}

Replies: 0 comments

Select a reply

Spark NLP 5.5.2: GGUF Embeddings, New HTML/Email/Word Ingestion, Enhanced OpenVINO Support, a New Q&A Annotator, and More Enhancements & Fixes #14487

maziyarpanahi Dec 18, 2024 Maintainer

📢 Spark NLP 5.5.2

🔥 New Features & Enhancements

🐛 Bug Fixes

📦 Dependencies

📝 Models

❤️ Community support

Installation

What's Changed

New Contributors

Replies: 0 comments

maziyarpanahi
Dec 18, 2024
Maintainer