Skip to content

Example using Spark and JSON to define a schema for fixed-width field data files

Notifications You must be signed in to change notification settings

Jimvin/SparkJsonSchema

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SparkJsonSchema

This repo contains an example of how you can take text files containing fixed-width and read them as Spark DataFrames based on a JSON schema definition file. This is useful for keeping the table definitions out of your code and provide a generic framework for processing files with different formats.

Building

This example uses the spark-fixedwidth library from Quartet Health, which is based on databricks-spark-csv: https://github.com/quartethealth/spark-fixedwidth

For this example I built the library and then created a local Maven repo, which you can create using the command below. Make sure to update your pom.xml to include the path to this repo before building.

mvn install:install-file -DlocalRepositoryPath=repo  -DcreateChecksum=true -Dpackaging=jar -Dfile=spark-fixedwidth-assembly-1.0.jar -DgroupId=com.quartethealth -DartifactId=spark-fixedwidth -Dversion=1.0

With this done and your pom.xml update you can run mvn package to build the example JAR file.

Running

An example class JsonSchemaExample is included which demonstrates how to convert a fixed-width field text file into Parquet. The job requires 3 parameters:

  • The location of the data file
  • The location of the JSON schema definition
  • The location to write the resulting Parquet files to

The JSON schema definition contains metadata describing the table, including the name, type and width of the fixed-width columns.

{
  "columns": [
    {"columnName": "id", "columnType": "Integer", "columnWidth": "5"},
    {"columnName": "firstName", "columnType": "String", "columnWidth": "10"},
    {"columnName": "lastName", "columnType": "String", "columnWidth": "10"},
    {"columnName": "gender", "columnType": "String", "columnWidth": "1"},
    {"columnName": "dateOfBirth", "columnType": "String", "columnWidth": "10"}
  ],

  "tableName": "people",
  "comment": "This is a test fixed-width field table",
  "partition": "partition_spec_goes_here"
}

To run the example submit the job using the command below. A new directory named /user/cloudera/data/people (i.e. output directory + table name) will be created containing the parquet files.

spark-submit --master yarn --jars spark-fixedwidth-assembly-1.0.jar \
  --class jhalfpenny.spark.sql.jsonSchema.JsonSchemaExample \
  JsonSchemaTest-0.0.1-SNAPSHOT.jar \
  /user/cloudera/fixed-width/data/datafile.txt \
  /user/cloudera/fixed-width/data/schema.json \
  /user/cloudera/data

We can use spark-shell to examine the Parquet data files to show that the schema and data are correct.

scala> sqlContext.read.parquet("/user/cloudera/data/people").printSchema
root
 |-- id: integer (nullable = true)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- dateOfBirth: string (nullable = true)
 
scala> sqlContext.read.parquet("/user/cloudera/data/people").show
+---+---------+---------+------+-----------+
| id|firstName| lastName|gender|dateOfBirth|
+---+---------+---------+------+-----------+
|  7|    Nancy|  Bentley|     F| 1989-17-12|
|  8|      Ben|    Smoak|     M| 1983-09-18|
|  9|    Simon| Lansdown|     M| 1985-06-07|
| 10|     Fred|     West|     M| 1992-04-23|
|  1|     John|    Smith|     M| 1970-07-01|
|  2|     Adam|    Jones|     M| 1963-12-11|
|  3|    Bruce|    Davis|     M| 1981-01-17|
|  4|   Sheila|   Fraser|     F| 1984-08-03|
|  5|      Tom|    Berne|     M| 1974-17-12|
|  6| Gertrude|Shoemaker|     F| 1991-03-09|
+---+---------+---------+------+-----------+

About

Example using Spark and JSON to define a schema for fixed-width field data files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages