Automated exploration of files in a folder structure to extract metadata and potential usage of information.
Features include:
- Recursive approach to get all files in a directory
- Extracting formats and size of the files
- Number of lines when a file is plain text
- Supporting formats such as:
txt
,csv
,tab
,dat
,excel
. - Pending supported formats
arff
,json
,xml
. - Identifies separator and quote character for text files
- Loading the files in memory
- Loading the files to a database
- Format conversion
- Logging
It works in three phases:
- Phase 1: Reckon of the files. Generates a dataframe with a summary of the files
- Phase 2: Execution. Generates python code to load the files to memory.
- Phase 3: [Optional] Sends the files to a database.
Read the documentation to know how to use it or check out the notebook-example.
You need to import the auto_fe.py file and call it as follows.
from afes import afe
df_files = afe.reckon_phase('<YOUR_FILE_PATH>')
Checkout the example.py file and then run it from a terminal with python as the following code, or using a Jupyter notebook.
python example.py
The reckon_phase
function will generate an Excel file with the results of the exploration called files_explored.xlsx
.
Using the dataframe df_files
generated in the reckon phase, the function generate_python_code()
will generate python code to load the files using pandas
.
afe.generate_python_code(df_files)
By default the code is printed to the standard output but also written by default to the code.txt
file.
Using the dataframe df_files
generated in the reckon phase, the function pandas_profile_files(df_files)
will load the files and run a pandas-profiling report.
afe.pandas_profile_files(df_files)
By default, it will process the files by size order starting with the smallest file. It will create the reports and export them in HTML format. It will store the reports in the same directory where the code is running or it save them in a given directory with the output_path = '<YOUR_OUTPUT_PATH>'
argument.
Automatic load of data from plain files to a database.
afe.load_datasets_to_database(df, "section")
Where "section"
is the name of a section in the databases.ini
file where the
[<section>]
db_engine = postgres
host = <IP_OR_HOSTNAME>
schema = <DATABASE_SCHEMA>
catalog = <DATABASE_CATALOG>
user = <DATABASE_USER>
password = <DATABASE_PASSWORD>
port = <DATABASE_PORT>
Currently only supports Postgres but create an issue if you want to use it with other databases.
- Open an issue to request more functionalities or feedback.