Skip to content

Commit

Permalink
fix: dtypes pickle format name
Browse files Browse the repository at this point in the history
  • Loading branch information
hmatalonga committed Aug 25, 2019
1 parent 2756ec5 commit 51b0940
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 5 deletions.
6 changes: 2 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# tags from Docker Hub.
FROM python:3.7-slim

LABEL Name=dataset-converter Version=0.1.0
LABEL Name=dataset-converter Version=0.1.1
LABEL maintainer="Hugo Matalonga <[email protected]>"

ARG UID=1000
Expand All @@ -21,8 +21,6 @@ RUN addgroup --system --gid ${GID} user \
WORKDIR /home/user
COPY ./app/requirements.txt /home/user

# Using pip:
RUN python3 -m pip install --upgrade --no-cache-dir --compile pip
RUN python3 -m pip install --no-cache-dir --compile -r requirements.txt

ADD ./entrypoint.sh /usr/local/bin
Expand All @@ -34,4 +32,4 @@ RUN chown -R user:user /home/user
USER user
ENV PATH=${PATH}:/home/user/.local/bin

CMD ["/usr/local/bin/entrypoint.sh"]
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,15 @@ $ docker-compose up
```
- It will look for all `.yml`, for each dataset configured file, it will produce an optimized parquet file and a pickle file containing the pandas dtypes. The generated files are located in the `./data` folder.

### Output files
For each config file found, keeps the same file `name` as set in the config and create the following files:

#### ${name}.dytpes.pickle
Contains a dict python with the column:dtype for each entry.

#### ${name}.parquet.7z
Creates a parquet binary file compressed in 7z format from the dataframe processed.

## Plugins

A plugin system is available, where is possible to call additional procedures to modify the dataset files.
Expand Down
2 changes: 1 addition & 1 deletion app/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def load_tasks(df, plugins, category):
def export_files(df, name, compression):
filename = name.split('.')[0]

filepath = os.path.join(data_path, filename + '.dtypes.p')
filepath = os.path.join(data_path, filename + '.dtypes.pickle')
print('Creating dtypes file -> {}'.format(filepath))
save_dtypes(cache_dtypes(df), filepath)

Expand Down

0 comments on commit 51b0940

Please sign in to comment.