SparkJsonSchema

This repo contains an example of how you can take text files containing fixed-width and read them as Spark DataFrames based on a JSON schema definition file. This is useful for keeping the table definitions out of your code and provide a generic framework for processing files with different formats.

Building

This example uses the spark-fixedwidth library from Quartet Health, which is based on databricks-spark-csv: https://github.com/quartethealth/spark-fixedwidth

For this example I built the library and then created a local Maven repo, which you can create using the command below. Make sure to update your pom.xml to include the path to this repo before building.

mvn install:install-file -DlocalRepositoryPath=repo  -DcreateChecksum=true -Dpackaging=jar -Dfile=spark-fixedwidth-assembly-1.0.jar -DgroupId=com.quartethealth -DartifactId=spark-fixedwidth -Dversion=1.0

With this done and your pom.xml update you can run mvn package to build the example JAR file.

Running

An example class JsonSchemaExample is included which demonstrates how to convert a fixed-width field text file into Parquet. The job requires 3 parameters:

The location of the data file
The location of the JSON schema definition
The location to write the resulting Parquet files to

The JSON schema definition contains metadata describing the table, including the name, type and width of the fixed-width columns.

{
  "columns": [
    {"columnName": "id", "columnType": "Integer", "columnWidth": "5"},
    {"columnName": "firstName", "columnType": "String", "columnWidth": "10"},
    {"columnName": "lastName", "columnType": "String", "columnWidth": "10"},
    {"columnName": "gender", "columnType": "String", "columnWidth": "1"},
    {"columnName": "dateOfBirth", "columnType": "String", "columnWidth": "10"}
  ],

  "tableName": "people",
  "comment": "This is a test fixed-width field table",
  "partition": "partition_spec_goes_here"
}

To run the example submit the job using the command below. A new directory named /user/cloudera/data/people (i.e. output directory + table name) will be created containing the parquet files.

spark-submit --master yarn --jars spark-fixedwidth-assembly-1.0.jar \
  --class jhalfpenny.spark.sql.jsonSchema.JsonSchemaExample \
  JsonSchemaTest-0.0.1-SNAPSHOT.jar \
  /user/cloudera/fixed-width/data/datafile.txt \
  /user/cloudera/fixed-width/data/schema.json \
  /user/cloudera/data

We can use spark-shell to examine the Parquet data files to show that the schema and data are correct.

scala> sqlContext.read.parquet("/user/cloudera/data/people").printSchema
root
 |-- id: integer (nullable = true)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- dateOfBirth: string (nullable = true)
 
scala> sqlContext.read.parquet("/user/cloudera/data/people").show
+---+---------+---------+------+-----------+
| id|firstName| lastName|gender|dateOfBirth|
+---+---------+---------+------+-----------+
|  7|    Nancy|  Bentley|     F| 1989-17-12|
|  8|      Ben|    Smoak|     M| 1983-09-18|
|  9|    Simon| Lansdown|     M| 1985-06-07|
| 10|     Fred|     West|     M| 1992-04-23|
|  1|     John|    Smith|     M| 1970-07-01|
|  2|     Adam|    Jones|     M| 1963-12-11|
|  3|    Bruce|    Davis|     M| 1981-01-17|
|  4|   Sheila|   Fraser|     F| 1984-08-03|
|  5|      Tom|    Berne|     M| 1974-17-12|
|  6| Gertrude|Shoemaker|     F| 1991-03-09|
+---+---------+---------+------+-----------+

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
exampleData		exampleData
src/main/scala/jhalfpenny/spark/sql/jsonSchema		src/main/scala/jhalfpenny/spark/sql/jsonSchema
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkJsonSchema

Building

Running

About

Releases

Packages

Languages

Jimvin/SparkJsonSchema

Folders and files

Latest commit

History

Repository files navigation

SparkJsonSchema

Building

Running

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages