Skip to content

Examples: Load Multiple Files

Quang Hoang Xuan edited this page Aug 29, 2021 · 10 revisions

Purpose: Load multiple Excel files into single data frame

Dataset

Spark Excel supports loading multiple excel files with glob pattern as well as Key=Value structured folder. For example, in test's resource folder, there is an example ca_dataset:

src/test/resources/spreadsheets/ca_dataset/
└── 2019
    ├── Quarter=1
    │   └── ca_03.xlsx
    ├── Quarter=2
    │   ├── ca_04.xlsx
    │   ├── ca_05.xlsx
    │   └── ca_06.xlsx
    ├── Quarter=3
    │   ├── ca_07.xlsx
    │   ├── ca_08.xlsx
    │   └── ca_09.xlsx
    └── Quarter=4
        ├── ca_10.xlsx
        ├── ca_11.xlsx
        └── ca_12.xlsx

5 directories, 10 files

Note: Names and IDs in this ca_dataset are not real.

With Glob Pattern

This dataset can be loaded with glob pattern

ROOT="/<please-change>/to/your-path"

df = spark.read.format("excel") \
   .option("header", True) \
   .option("inferSchema", True) \
   .load(f"{ROOT}/src/test/resources/spreadsheets/ca_dataset/2019/*/*.xlsx")

With glob pattern, we can load subset of data by using a matching pattern

df = spark.read.format("excel") \
   .option("header", True) \
   .option("inferSchema", True) \
   .load(f"{ROOT}/src/test/resources/spreadsheets/ca_dataset/2019/Quarter=4/*.xlsx")

Key=Value Folder Structure

Or, relies on known structure of Key=Value

df = spark.read.format("excel") \
   .option("header", True) \
   .option("inferSchema", True) \
   .load(f"{ROOT}/src/test/resources/spreadsheets/ca_dataset/2019")

df.printSchema()
df.groupBy("Quarter").count().show()

Got

root
 |-- Day: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- Customer ID: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- Standard Package: integer (nullable = true)
 |-- Extra Option 1: integer (nullable = true)
 |-- Extra Option 2: integer (nullable = true)
 |-- Extra Option 3: integer (nullable = true)
 |-- Staff: string (nullable = true)
 |-- Quarter: integer (nullable = true)

+-------+-----+
|Quarter|count|
+-------+-----+
|      1|    7|
|      3|  186|
|      4|  224|
|      2|  161|
+-------+-----+

With Quarter column and its values are populated from directory structure.

References