Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Data Sync Plugin for external database synchronization #75

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

onyedikachi-david
Copy link

@onyedikachi-david onyedikachi-david commented Jan 24, 2025

Purpose

Fixes: #72
/claim #72
Implement data synchronization functionality between external data sources and StarbaseDB's internal SQLite database. This plugin enables close-to-edge replica creation by automatically pulling and synchronizing data from external sources (like PostgreSQL) at configurable intervals.

Tasks

  • Implement core Data Sync Plugin

    • Configurable sync intervals via sync_interval option
    • Selective table synchronization via tables option
    • Incremental updates using created_at and id columns
    • PostgreSQL to SQLite type mapping
    • Metadata tracking for sync state
  • Add plugin configuration support

    • Wrangler.toml configuration options
    • Environment variable support for sensitive data
    • Plugin registration system integration
  • Implement sync mechanisms

    • Pull-based synchronization
    • Schema auto-detection and mapping
    • Incremental data fetching
    • Error handling and retry logic
  • Create comprehensive demo and documentation

    • Testing environment with Docker PostgreSQL
    • Example endpoints for monitoring
    • Query hook demonstrations
    • Security best practices

Verify

  1. Start a PostgreSQL instance:
docker run --name starbasedb-postgres -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=demo -p 5432:5432 -d postgres:15
  1. Configure the plugin in wrangler.toml:
[plugins.data-sync]
sync_interval = 300 # 5 minutes
tables = ["users", "products"]
  1. Set up environment variables for database credentials

  2. Test synchronization:

  • Monitor sync status: curl http://localhost:8787/sync-status
  • View synced data: curl http://localhost:8787/sync-data
  • Check sync metadata: curl http://localhost:8787/debug

Before

StarbaseDB instances could only:

  • Use internal SQLite database
  • Connect to external databases for direct queries
  • No automatic data synchronization
  • No edge-replica capabilities

After

image image

StarbaseDB now supports:

  • Automatic data synchronization from external sources
  • Configurable sync intervals and table selection
  • Incremental updates based on timestamps and IDs
  • Type mapping between PostgreSQL and SQLite
  • Close-to-edge replica functionality
  • Monitoring and debugging capabilities

Copy link
Member

@Brayden Brayden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a few thoughts on the pull request. I still need to run through the demo and test, just starting with a code review.

One question I have: is there any way to abstract the Postgres specific code out so we could easily support other types of databases easily in the future? For example the column mappings between Postgres <> SQLite would obviously only work if the user connects a Postgres database to it, and would break likely for MySQL. Maybe there is a high level plugin like DataSyncPlugin and then another plugin you can pass inside of it that can override functions that would be database specific like:

new DataSyncPlugin(syncSource: PostgresSync | MySQLSync | etc)

And maybe PostgresSync overrides some default class implementation and all the database specific logic (e.g. calling to the information_schema & column type mapping) could exist in there?

Thoughts?

2. For each configured table:
- Retrieves the table schema from the external database
- Creates a corresponding table in the internal database
- Periodically checks for new or updated records based on `created_at` timestamp and `id`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a mechanism we can put in place for tables that don't contain a created_at or id column? Perhaps as part of the sync config where they define an array of tables they want to sync they can also say what key they want to use to annotate latest entry.

1. The plugin creates a metadata table in the internal database to track sync state
2. For each configured table:
- Retrieves the table schema from the external database
- Creates a corresponding table in the internal database
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when a Postgres table contains both a schema and table name (e.g. users.profile) and SQLite only supports tables without schemas. Would the name of the table become users.profile? Would we want the users moving forward to query that table with the ${schema}.${table} name notation moving forward?

I assume for any Postgres public schema tables we would just create them with simply their table name (e.g. ${table}) without a schema prefix, correct?

Lastly, if the user did decide to do public.users would we have a beforeQuery hook that was smart enough in this plugin to know we could omit public. from it as that table is in our SQLite root?

// Create metadata table if it doesn't exist
await this.dataSource?.rpc.executeQuery({
sql: `
CREATE TABLE IF NOT EXISTS data_sync_metadata (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far for our plugins we've been following the tmp_ prefix for our table names so users know what might have been created from Starbase versus what they've created themselves. Would you mind updating to contain that prefix so it should be:

CREATE TABLE IF NOT EXISTS tmp_data_sync_metadata (

@onyedikachi-david
Copy link
Author

Put a few thoughts on the pull request. I still need to run through the demo and test, just starting with a code review.

One question I have: is there any way to abstract the Postgres specific code out so we could easily support other types of databases easily in the future? For example the column mappings between Postgres <> SQLite would obviously only work if the user connects a Postgres database to it, and would break likely for MySQL. Maybe there is a high level plugin like DataSyncPlugin and then another plugin you can pass inside of it that can override functions that would be database specific like:

new DataSyncPlugin(syncSource: PostgresSync | MySQLSync | etc)

And maybe PostgresSync overrides some default class implementation and all the database specific logic (e.g. calling to the information_schema & column type mapping) could exist in there?

Thoughts?

It's best actually, let me see how to implement it.

@onyedikachi-david
Copy link
Author

onyedikachi-david commented Jan 27, 2025

@Brayden

Refactored the data sync plugin to:

  • Abstract database-specific code into DatabaseSyncSource class
  • Add tmp_ prefix to all synced tables
  • Support custom schemas with proper table name mapping
  • Add flexible sync configuration (custom columns, batch sizes)
  • Improve type mapping and validation
  • Add comprehensive error handling and logging
  • Removed demo test; not working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replicate data from external source to internal source with a Plugin
2 participants