Replies: 3 comments 3 replies
-
@shellcromancer @Bin-security @chencaoverkada for feedback and additional thoughts |
Beta Was this translation helpful? Give feedback.
-
Thanks @jshlbrd. I can imagine that custom file formats are needed in some use cases. Some users might want to write the events in parquet or other columnar formats for storage and query efficiency. Adding the flexibility to facilitate users to support other formats would be a good idea. In terms of the object names, we walked around the problem by using the |
Beta Was this translation helpful? Give feedback.
-
This was merged to main in #93, will be in the next release (v0.9.0), and has updated documentation. |
Beta Was this translation helpful? Give feedback.
-
Spinning off the conversation in #89, there are some opportunities to add new settings to the AWS S3 sink so that object creation is more customizable. These ideas might not be acted on, but are worth discussing.
Custom File Formats
The sink currently always creates objects that are compressed with gzip. This seems to be the standard for most cloud storage services and is a good default, but if we want to support alternative formats then it has to be exposed to users as a setting. Some formats are more complex than others in their requirements (e.g. schema files, compression settings, etc.), so what I propose is that we consider adding a new setting called
format
(name TBD) that is a nested struct containing the format type referenced by common name (e.g., gzip, bzip2, etc.) and format-specific settings.If needed then this can use the common config. Not sure if a factory would be required but whomever works on it can figure that out.
Custom Object Names
The sink currently always creates objects with this pattern:
[prefix : optional]/[year]/[month]/[day]/[uuid].gz
. Most cloud services that I have seen don't care what the object is called so long as the object is in the correct format, but there are some cases where it matters (e.g., AWS Glue).Ignoring the file extension (that will be changed through the addition of custom file formats), here are some examples of how other systems make this customizable:
filename_*
optionsOf these options, Vector's approach is the safest for the user while still allowing for flexibility. Substation isn't built for function interpolation and even if it were, that route allows users to footgun themselves in ways that we shouldn't support. If we adopted similar options, the naming pattern could become
[prefix : optional]/[time format: optional]/[uuid: optional].[extension]
. To protect users from a footgun, time format and UUID should be enabled by default but allow for having a "nil" setting.Beta Was this translation helpful? Give feedback.
All reactions