datalake-lib

Library built on top of Apache Spark to speed-up data lakes development.

Core concepts

Configuration file

Define all the datasets your ETLs need to interact with.

val raw = "raw"
val curated = "curated"
val config = SimpleConfiguration(
  datalake = DatalakeConf(
    sources = List(
      DatasetConf("raw_data1"    , raw    , "/data1", JSON , OverWrite),
      DatasetConf("raw_data2"    , raw    , "/data2", JSON , OverWrite),
      DatasetConf("curated_data1", curated, "/data1", DELTA, OverWrite),
      DatasetConf("curated_data2", curated, "/data2", DELTA, OverWrite)
    ),
    sparkconf = Map(
      "spark.hadoop.fs.s3a.endpoint" -> "https://example.com"
    )
  )
)

Generate a specific configuration for each environments

val localStorages = List(
  StorageConf(raw    , "~/raw"    , LOCAL),
  StorageConf(curated, "~/curated", LOCAL)
)

val devStorages = List(
  StorageConf(raw    , "s3a://dev-raw"    , S3),
  StorageConf(curated, "s3a://dev-curated", S3)
)

val prodStorages = List(
  StorageConf(raw    , "s3a://prod-raw"    , S3),
  StorageConf(curated, "s3a://prod-curated", S3)
)
val localConf = config.copy(config.datalake.copy(storages = localStorages))
val devConf = config.copy(config.datalake.copy(storages = devStorages))
val prodConf = config.copy(config.datalake.copy(storages = devStorages))

Generate a configuration file as HOCON format

ConfigurationWriter.writeTo("src/test/resources/config/local.conf", localConf)
ConfigurationWriter.writeTo("src/main/resources/config/dev.conf", devConf)
ConfigurationWriter.writeTo("src/main/resources/config/prod.conf", prodConf)

Load the configuration file and make it available in your unit tests or ETLs

implicit val conf = ConfigurationLoader.loadFromResources[SimpleConfiguration]("config/local.conf")

Define your own configuration case class

You can also define your own case class, if you want for example extend the datalake configuraion.

Define your case class, it must extend ConfigurationWrapper :

case class ExtraConf(extraOption: String, datalake: DatalakeConf) extends ConfigurationWrapper(datalake)

For writing your configuration, use ConfigurationWriter

ConfigurationWriter.writeTo("src/test/resources/config/local.conf", localConf)

For loading your configuration

implicit val conf = ConfigurationLoader.loadFromResources[ExtraConf]("config/local.conf")

ETL class

An ETL defines these main functions on top of an entry point run():

method	default behavior
reset	Delete all the files and metadata from the mainDestination of the ETL
extract	Not implemented
sampling	Takes 5% of the data from each sources returned byt the function extract()
transform	Not implemented
load	Persist all DataFrames returned by the function transform() using the default LoadResolver
publish	does nothing

These are called in order by the function run() to which you can passe a list of RunStep which dictate the steps that are going to be effectively run or skiped at runtime. For instance, assuming we instantiated an ETL called job:

job.run(RunStep.initial_load) will call reset(), skip sampling() and run all remaining steps
job.run(RunStep.default_load) will skip both reset() and sampling() and run all remaining steps
job.run(RunStep.allSteps) will all steps

It is also possible to run only certain steps on demand, for more details about this see bio.ferlab.datalake.commons.config.RunStep

datalake-commons

Common classes between all modules.

Version Matrix

The following table lists the versions supported of the main dependencies

module	Spark Version	Delta Version	Glow Version	Scala version	Zio Version
datalake-spark3	`3.0.3`	`0.8.0`	`1.0.1`	`2.12`	`1.0.6`
datalake-spark3	`3.1.3`	`1.1.0`	`1.0.1`	`2.12`	`1.0.6`
datalake-spark3	`3.2.2`	`1.2.0`	`1.2.1`	`2.12` `2.13`	`1.0.6`

release

 sbt "publishSigned; sonatypeRelease"

Name		Name	Last commit message	Last commit date
Latest commit History 570 Commits
.github/workflows		.github/workflows
datalake-commons/src		datalake-commons/src
datalake-spark3		datalake-spark3
datalake-test-utils/src		datalake-test-utils/src
project		project
.gitignore		.gitignore
.sbtopts		.sbtopts
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
publish.sbt		publish.sbt
scalastyle-configuration.xml		scalastyle-configuration.xml
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datalake-lib

Core concepts

Configuration file

Define your own configuration case class

ETL class

datalake-commons

Version Matrix

release

About

Releases 98

Packages

Contributors 14

Languages

License

Ferlab-Ste-Justine/datalake-lib

Folders and files

Latest commit

History

Repository files navigation

datalake-lib

Core concepts

Configuration file

Define your own configuration case class

ETL class

datalake-commons

Version Matrix

release

About

Resources

License

Stars

Watchers

Forks

Releases 98

Packages 0

Contributors 14

Languages

Packages