Skip to content

singer-io/tap-s3-csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tap-s3-csv

This is a Singer tap that reads data from files located inside a given S3 bucket and produces JSON-formatted data following the Singer spec.

How to use it

tap-s3-csv works together with any other Singer Target to move data from s3 to any target destination.

Install and Run

First, make sure Python 3 is installed on your system or follow these installation instructions for Mac or Ubuntu.

It's recommended to use a virtualenv:

 python3 -m venv ~/.virtualenvs/tap-s3-csv
 source ~/.virtualenvs/tap-s3-csv/bin/activate
 pip install -U pip setuptools
 pip install -e '.[dev]'

Configuration

Here is an example of basic config, and a bit of a run down on each of the properties:

{
    "start_date": "2017-11-02T00:00:00Z",
    "account_id": "1234567890",
    "role_name": "role_with_bucket_access",
    "bucket": "my-bucket",
    "external_id": "my_optional_secret_external_id",
    "tables": "[{\"search_prefix\":\"exports\",\"search_pattern\":\"my_table\\\\/.*\\\\.csv\",\"table_name\":\"my_table\",\"key_properties\":\"id\",\"date_overrides\":\"created_at\",\"delimiter\":\",\"}]",
    "request_timeout": 300
}
  • start_date: This is the datetime that the tap will use to look for newly updated or created files, based on the modified timestamp of the file.
  • account_id: This is your AWS account id
  • role_name: In order to access a bucket, the tap uses boto3 to assume a role in your AWS account. If you have your AWS account credentials set up locally, you can specify this as a role which your local user has access to assume, and boto3 should by default pick up your AWS keys from the local environment.
  • bucket: The name of the bucket to search for files under.
  • external_id: (potentially optional) Running this locally, you should be able to omit this property, it is provided to allow the tap to access buckets in accounts where the user doesn't have access to the account itself, but is able to assume a role in that account, through a shared secret. This is that secret, in that case.
  • tables: An escaped JSON string that the tap will use to search for files, and emit records as "tables" from those files. Will be used by a voluptuous-based configuration checker.
  • request_timeout: (optional) The maximum time for which request should wait to get a response. Default request_timeout is 300 seconds.

Below are the additional properties, to add in config if running this tap using proxy AWS account as middleware:

    "proxy_account_id": "221133445566",
    "proxy_role_name": "proxy_role_with_bucket_access"

Proxy AWS account will act as a middleware.

  • proxy_account_id: This is the Proxy AWS account id.
  • proxy_role_name: This is the Proxy IAM role that allows the product AWS account to assume it and then use this role to access S3 bucket in your account.

The table field consists of one or more objects, JSON encoded as an array and escaped using backslashes (e.g., \" for " and \\ for \), that describe how to find files and emit records. A more detailed (and unescaped) example below:

[
    {
        "search_prefix": "exports"
        "search_pattern": "my_table\\/.*\\.csv",
        "table_name": "my_table",
        "key_properties": "id",
        "date_overrides": "created_at",
        "delimiter": ","
    },
    ...
]
  • search_prefix: This is a prefix to apply after the bucket, but before the file search pattern, to allow you to find files in "directories" below the bucket.
  • search_pattern: This is an escaped regular expression that the tap will use to find files in the bucket + prefix. It's a bit strange, since this is an escaped string inside of an escaped string, any backslashes in the RegEx will need to be double-escaped.
  • table_name: This value is a string of your choosing, and will be used to name the stream that records are emitted under for files matching content.
  • key_properties: These are the "primary keys" of the CSV files, to be used by the target for deduplication and primary key definitions downstream in the destination.
  • date_overrides: Specifies field names in the files that are supposed to be parsed as a datetime. The tap doesn't attempt to automatically determine if a field is a datetime, so this will make it explicit in the discovered schema.
  • delimiter: This allows you to specify a custom delimiter, such as \t or |, if that applies to your files.

A sample configuration is available inside config.sample.json


Copyright © 2018 Stitch

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages