This repository contains Python example code for working with raw data delivered by Parse.ly's fully-managed Data Pipeline product, at http://parse.ly/data-pipeline.
This Python repository is a suite of tools, mostly usable from the command-line, which make it easy to evaluate and integrate the Parse.ly raw data.
Customers can use this repository to:
- gain batch and streaming access to the raw data that Parse.ly collects from their sites; streaming access is provided via Amazon Kinesis Streams, and batch access via Amazon S3
- generate schemas and DDL for common data warehousing tools, such as Redshift, BigQuery, and Apache Spark
- create data samples that can be evaluated using in-memory analyst tools such as Excel or R Studio (xlsx/csv samples)
To make use of Parse.ly raw data, you must be a customer of Parse.ly's Data Pipeline product. To gain access for your Parse.ly account, please contact Parse.ly directly at http://help.parsely.com.
You can download this repository by cloning it from Github, e.g.
$ git clone https://github.com/Parsely/parsely_raw_data.git
Or, you can install it into an environment with pip, e.g.
$ pip install parsely_raw_data
The files in this module are named for the services they interface with. You can simply run modules to use command-line tools provided, or import the modules to script them yourselves using your own Python scripts.
If you have the project installed with pip, you can use the following console scripts anywhere:
parsely_redshift
=parsely_raw_data.redshift
parsely_bigquery
=parsely_raw_data.bigquery
parsely_s3
=parsely_raw_data.s3
parsely_stream
=parsely_raw_data.stream
parsely_schema
=parsely_raw_data.docgen
Alternately, you can clone the repo, and run each module from within the repo directory, like this:
cd <path_to_parsely_raw_data_repo_directory>
python -m parsely_raw_data.samples
: Generate data samples in CSV and XLSX formatpython -m parsely_raw_data.s3
: Fetch archived event data from Parse.ly S3 Bucketpython -m parsely_raw_data.stream
: Consume a Parse.ly Kinesis Stream of real-time event datapython -m parsely_raw_data.schema
: Inspect schemas for Redshift, BigQuery, and Sparkpython -m parsely_raw_data.redshift
: Create an Amazon Redshift table for events and load datapython -m parsely_raw_data.bigquery
: Create a Google BigQuery table for events and load data
These are the steps that should be followed when releasing a new version of this library
- Increment the version number in
__init__.py
according to semantic versioning rules git commit -m 'increment version'
git tag x.x.x
wherex.x.x
is the new version numbergit push origin master --tags
- Create a new release for the new tag in github, noting any relevant changes
- Push to PyPI with
python setup.py sdist upload
python -m parsely_redshift_etl
: A schedule-able command to complete the full ETL into the Redshift star schema using DBT.