Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] core: signal blob implementation #9326

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from
Draft

[DRAFT] core: signal blob implementation #9326

wants to merge 19 commits into from

Conversation

edsiper
Copy link
Member

@edsiper edsiper commented Sep 2, 2024

The following pull request is an active work in process implementation of Blob signal.

Blob

There are use cases where is desired to move very large files with Fluent Bit, and most of the time these are binary files like videos, AI models or others.

We implement the handling of binary files through the new signal called Blob which has the following design principles:

  • Zero copy: do not buffer the big file as normal chunks. Instead use a different approach.
  • in_blob: implementation of an input plugin that can lookup and queue binary files through the Blob signal implementation API.
  • output plugins: certain output plugins can be extended to support the handling of Blob files, first candidates to implement are:
  • output plugins will be able to do multi-part upload and handling connection resumes if needed.

Note that the code base in this branch is in active development and and might be updated with breaking changes.

Components and status

High-level list of things to implement in this pull request:

in_blob

  • input plugin structure
  • files lookup with glob patterns
  • database file
    • support database to register files being processed
    • [ ] list cleanup on restart

core

  • signal registration
  • chunk blob reference implementation
  • chunk content validation: event mechanism to evict inconsistent files at start
  • scheduler: extend functionality to support timers that runs under a coroutine context

out_azure_blob

  • add support for blob signal
  • implement database files for blob files (resume)
  • add support for multi-part upload
    • implement a timer callback to upload file parts.

Other changes in out_azure_blob:

  • The plugin can handle Logs and Blob signals.
  • When a Blob is received, it store the files references into the database file and also it create a list of parts of that file for further upload.
  • A new callback in the plugin is being created that is in charge to upload file parts. The big change has been into the Fluent Bit scheduler to allow to have timer-base callbacks that runs under a coroutine context (previously this did not exists). This last enhancement allows to share async I/O and introduce a new way to process data in output plugins.
Database example
sqlite> select * from out_azure_blob_files ;
id     path                                           size          created
-----  ---------------------------------------------  ------------  ------------
2      /home/edsiper/logs/blob/sample-photo.avif      370495        1725995504
3      /home/edsiper/logs/blob/second.bin             15            1725995505
sqlite> select * from out_azure_blob_parts ;
id  file_id  part_id  uploaded  in_progress  offset_start  offset_end
--  -------  -------  --------  -----------  ------------  ----------
39  2        0        0         0            0             10000
40  2        1        0         0            10000         20000
41  2        2        0         0            20000         30000
42  2        3        0         0            30000         40000
43  2        4        0         0            40000         50000
44  2        5        0         0            50000         60000
45  2        6        0         0            60000         70000
46  2        7        0         0            70000         80000
47  2        8        0         0            80000         90000
48  2        9        0         0            90000         100000
49  2        10       0         0            100000        110000
50  2        11       0         0            110000        120000
51  2        12       0         0            120000        130000
52  2        13       0         0            130000        140000
53  2        14       0         0            140000        150000
54  2        15       0         0            150000        160000
55  2        16       0         0            160000        170000
56  2        17       0         0            170000        180000
57  2        18       0         0            180000        190000
58  2        19       0         0            190000        200000
59  2        20       0         0            200000        210000
60  2        21       0         0            210000        220000
61  2        22       0         0            220000        230000
62  2        23       0         0            230000        240000
63  2        24       0         0            240000        250000
64  2        25       0         0            250000        260000
65  2        26       0         0            260000        270000
66  2        27       0         0            270000        280000
67  2        28       0         0            280000        290000
68  2        29       0         0            290000        300000
69  2        30       0         0            300000        310000
70  2        31       0         0            310000        320000
71  2        32       0         0            320000        330000
72  2        33       0         0            330000        340000
73  2        34       0         0            340000        350000
74  2        35       0         0            350000        360000
75  2        36       0         0            360000        370000
76  2        37       0         0            370000        370495
77  3        0        0         0            0             15

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@edsiper edsiper marked this pull request as draft September 2, 2024 18:21
@edsiper edsiper added this to the Fluent Bit v3.2.0 milestone Sep 2, 2024
Signed-off-by: Eduardo Silva <[email protected]>
Signed-off-by: Eduardo Silva <[email protected]>
Recent patch series add support to process/route large binary files through a zero-copy
strategy. This new in_blob plugin allows to scan a path from the file system and register
files that matched the pattern.

service:
  flush: 1
  log_level: info

pipeline:
  inputs:
    - name: blob
      path: '~/logs/blob/*'
      database_file: blob.db

  outputs:
    - name: stdout
      match: '*'

    - name:           azure_blob
      match:          '*'
      path:           kubernetes
      container_name: blobs
      auto_create_container: on
      database_file: azure.db
      part_size: 4M
      upload_parts_timeout: 1s
      workers: 10

Signed-off-by: Eduardo Silva <[email protected]>
The recent changes in Fluent Bit, allows to process Blob signal types which
represents large binary files. When a blob arrives to the plugin, it's enqueued
and processed through parts and uploaded as a Block Blob. Part sizes are configurable
and survives service restart.

example usage:

service:
  flush: 1
  log_level: info

pipeline:
  inputs:
    - name: blob
      path: '~/logs/blob/*'
      database_file: blob.db

  outputs:
    - name: stdout
      match: '*'

    - name:           azure_blob
      match:          '*'
      path:           kubernetes
      container_name: blobs
      auto_create_container: on
      database_file: azure.db
      part_size: 4M
      upload_parts_timeout: 1s
      workers: 10
      account_name: abcdefghijk
      shared_key: asdkljaskldjaskldjaskldjasioduasoudaskldjaskld
      tls: on

Signed-off-by: Eduardo Silva <[email protected]>
Signed-off-by: Eduardo Silva <[email protected]>
Signed-off-by: leonardo-albertovich <[email protected]>
Signed-off-by: leonardo-albertovich <[email protected]>
@patrick-stephens
Copy link
Contributor

Interesting, I did a POC a while back using Fluent Bit to essentially rsync files for "reasons" but that has issues with the line-driven approach not guaranteeing ordering which this would I presume now.

@leonardo-albertovich
Copy link
Collaborator

Yes, you are right @patrick-stephens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants