mansueto-institute · claireboyd · May 9, 2024 · May 13, 2024 · May 13, 2024 · May 14, 2024
diff --git a/README.md b/README.md
@@ -1,5 +1,57 @@
+# First American ETL Pipeline
+
+The `fa-etl.py` python script in this repository conducts an Extract, Transform, Load (ETL) process for national assessment data from First American. 
+
+It automates the conversion of data from .txt.zip files into parquet format, does some  and joins all relevant data, returning a `unified.parquet` file. 
+
+
+
+
+## Functionality
+
+Here's a breakdown of its functionalities:
+
+1. **Setup Environment**: The script sets up the necessary directories (`staging`, `unzipped`, `unified`, and `raw`) and a log file specified by the user.
+
+2. **Convert to Parquet**: It converts each `.txt.zip` file in the `raw` directory into Parquet format, which is more efficient for processing and storage.
+
+3. **Data Joining**: After conversion, the script joins all relevant Parquet files, filtering observations that contain assessed values and sales values. This ensures that only meaningful data is retained.
+
+4. **Geographic Validation**: It validates and standardizes geographic elements using spatial joins, enhancing the quality and consistency of the data.
+
+## Inputs
+
+- **input_dir** (str): Path to the input directory containing the `.txt.zip` files.
+- **log_file** (str): Path to the log file where logging information will be saved.
+- **annual_file_string** (str): Substring used to identify the annual file (`Prop` or `Annual`).
+- **value_history_file_string** (str): Substring used to identify the value history file (`ValHist` or `ValueHistory`).
+
+## Outputs
+
+The script saves the processed data into the following directories:
+
+- **staging**: Contains intermediate Parquet files for all input `.txt.zip` files.
+- **unified**: Contains final Parquet files with merged content after data joining.
+- **unzipped**: Temporary directory that gets deleted at the end of the script, containing unzipped `.txt` files before conversion.
+
+## How to Run
+
+You can execute the script from the command line using the provided arguments. Here's an example:
+
+```bash
+python fa-etl.py --input_dir <input_dir> --log_file <log_file> --annual_file_string <annual_file_string> --value_history_file_string <value_history_file_string>
+
+
+
 # firstamerican-etl
 
 
 ## Geographic data
 https://uchicago.box.com/s/mmhsg7s9qs6jlov9u4kkt7vdoordt5kv
+
+
+
+command to check the size of files in a directory:
+```bash
+du -h --max-depth=0 * | sort -hr
+```