-
Notifications
You must be signed in to change notification settings - Fork 146
Examples: Input File Name
Quang Hoang Xuan edited this page Aug 21, 2021
·
3 revisions
Purpose: Have extra column with values of original file paths
from pyspark.sql.functions import input_file_name
ROOT="/<please-change>/to/your-path"
df = spark.read.format("excel") \
.option("header", True) \
.option("inferSchema", True) \
.load(f"{ROOT}/src/test/resources/spreadsheets/ca_dataset/2019/*/*.xlsx") \
.withColumn("file_name", input_file_name())
df.show(10)
Result
+---+-----+-----------+-----------------+----------------+--------------+--------------+--------------+--------------+--------------------+
|Day|Month|Customer ID| Customer Name|Standard Package|Extra Option 1|Extra Option 2|Extra Option 3| Staff| file_name|
+---+-----+-----------+-----------------+----------------+--------------+--------------+--------------+--------------+--------------------+
| 1| 11| CA767| Vũ Phương Thảo| 14000| null| null| null| Teresa Teng|file:///home/quan...|
| 2| 11| CA768| Lê Thị Trâm| null| null| 2000| null|Marilyn Monroe|file:///home/quan...|
| 2| 11| CA769| Lê Trung| null| null| 1200| null| Teresa Teng|file:///home/quan...|
| 3| 11| CA770| Nguyễn Thảo Hiền| null| null| 1700| null| Teresa Teng|file:///home/quan...|
| 3| 11| CA771| Nguyễn Thu Huệ| null| null| 1800| null| Teresa Teng|file:///home/quan...|
| 3| 11| CA772| Vũ Cẩm Linh| null| null| 200| null| Teresa Teng|file:///home/quan...|
| 3| 11| CA773| Phạm Kim Ngân| 12000| null| 2000| null|Marilyn Monroe|file:///home/quan...|
| 4| 11| CA774|Hoàng Thj Mai Anh| 5000| null| null| null|Marilyn Monroe|file:///home/quan...|
| 4| 11| CA775| Đỗ Phương Anh | null| null| 800| null|Marilyn Monroe|file:///home/quan...|
| 5| 11| CA776| Ngô Mai Quý| null| null| 1500| null| Teresa Teng|file:///home/quan...|
+---+-----+-----------+-----------------+----------------+--------------+--------------+--------------+--------------+--------------------+
only showing top 10 rows
from pyspark.sql.functions import regexp_extract
df.withColumn("file_name", regexp_extract('file_name', '.*/(.*)\.xlsx$', 1)) \
.show(5)
Got
+---+-----+-----------+----------------+----------------+--------------+--------------+--------------+--------------+---------+
|Day|Month|Customer ID| Customer Name|Standard Package|Extra Option 1|Extra Option 2|Extra Option 3| Staff|file_name|
+---+-----+-----------+----------------+----------------+--------------+--------------+--------------+--------------+---------+
| 1| 11| CA767| Vũ Phương Thảo| 14000| null| null| null| Teresa Teng| ca_11|
| 2| 11| CA768| Lê Thị Trâm| null| null| 2000| null|Marilyn Monroe| ca_11|
| 2| 11| CA769| Lê Trung| null| null| 1200| null| Teresa Teng| ca_11|
| 3| 11| CA770|Nguyễn Thảo Hiền| null| null| 1700| null| Teresa Teng| ca_11|
| 3| 11| CA771| Nguyễn Thu Huệ| null| null| 1800| null| Teresa Teng| ca_11|
+---+-----+-----------+----------------+----------------+--------------+--------------+--------------+--------------+---------+
only showing top 5 rows