Skip to content
@StabRise

StabRise

Document processing solutions

Hi there 👋

StabRise - Document Processing Solutions

Our projects

PDF DataSource for the Apache Spark

Spark Pdf


Source Code: https://github.com/StabRise/spark-pdf

Home page: https://stabrise.com/spark-pdf/

Quick Start Jupyter Notebook: https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb


The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

Key features:

  • Read PDF documents to the Spark DataFrame
  • Support read PDF files lazy per page
  • Support big files, up to 10k pages
  • Support scanned PDF files (call OCR)
  • No need to install Tesseract OCR, it's included in the package

ScaleDP

ScaleDP


Source Code: https://github.com/StabRise/scaledp

Home page: https://stabrise.com/scaledp/

Quick Start Jupyter Notebook: https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb


ScaleDP is an Open-Source Library for processing documents using Apache Spark.

Key features:

  • Load PDF documents/Images
  • Extract text from PDF documents/Images
  • Extract images from PDF documents
  • OCR Images/PDF documents
  • Run NER on text extracted from PDF documents/Images
  • Visualize NER results

De-Identify

De-Identify

De-Identify is tool for de-identification/anonymization data

Supported formats

  • text
  • images
  • pdf documents
  • DICOM files

Pinned Loading

  1. spark-pdf spark-pdf Public

    PDF DataSource for Apache Spark

    Scala 28

  2. ScaleDP ScaleDP Public

    ScaleDP is an Open-Source extension of Apache Spark for Document Processing

    Python 3

  3. ScaleDP-Tutorials ScaleDP-Tutorials Public

    Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.

    Jupyter Notebook 1

Repositories

Showing 4 of 4 repositories
  • ScaleDP Public

    ScaleDP is an Open-Source extension of Apache Spark for Document Processing

    StabRise/ScaleDP’s past year of commit activity
    Python 3 AGPL-3.0 0 7 0 Updated Dec 26, 2024
  • spark-pdf Public

    PDF DataSource for Apache Spark

    StabRise/spark-pdf’s past year of commit activity
    Scala 28 AGPL-3.0 0 4 0 Updated Dec 24, 2024
  • ScaleDP-Tutorials Public

    Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.

    StabRise/ScaleDP-Tutorials’s past year of commit activity
    Jupyter Notebook 1 AGPL-3.0 0 0 0 Updated Dec 3, 2024
  • .github Public

    Document processing solutions

    StabRise/.github’s past year of commit activity
    1 0 0 0 Updated Dec 2, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…