Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplified join #14

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,57 @@
# First American ETL Pipeline

The `fa-etl.py` python script in this repository conducts an Extract, Transform, Load (ETL) process for national assessment data from First American.

It automates the conversion of data from .txt.zip files into parquet format, does some and joins all relevant data, returning a `unified.parquet` file.




## Functionality

Here's a breakdown of its functionalities:

1. **Setup Environment**: The script sets up the necessary directories (`staging`, `unzipped`, `unified`, and `raw`) and a log file specified by the user.

2. **Convert to Parquet**: It converts each `.txt.zip` file in the `raw` directory into Parquet format, which is more efficient for processing and storage.

3. **Data Joining**: After conversion, the script joins all relevant Parquet files, filtering observations that contain assessed values and sales values. This ensures that only meaningful data is retained.

4. **Geographic Validation**: It validates and standardizes geographic elements using spatial joins, enhancing the quality and consistency of the data.

## Inputs

- **input_dir** (str): Path to the input directory containing the `.txt.zip` files.
- **log_file** (str): Path to the log file where logging information will be saved.
- **annual_file_string** (str): Substring used to identify the annual file (`Prop` or `Annual`).
- **value_history_file_string** (str): Substring used to identify the value history file (`ValHist` or `ValueHistory`).

## Outputs

The script saves the processed data into the following directories:

- **staging**: Contains intermediate Parquet files for all input `.txt.zip` files.
- **unified**: Contains final Parquet files with merged content after data joining.
- **unzipped**: Temporary directory that gets deleted at the end of the script, containing unzipped `.txt` files before conversion.

## How to Run

You can execute the script from the command line using the provided arguments. Here's an example:

```bash
python fa-etl.py --input_dir <input_dir> --log_file <log_file> --annual_file_string <annual_file_string> --value_history_file_string <value_history_file_string>



# firstamerican-etl


## Geographic data
https://uchicago.box.com/s/mmhsg7s9qs6jlov9u4kkt7vdoordt5kv



command to check the size of files in a directory:
```bash
du -h --max-depth=0 * | sort -hr
```
Loading