Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

github - add new workflows with the folders upsert #9562

Merged
merged 4 commits into from
Dec 20, 2024

Conversation

aubin-tchoi
Copy link
Contributor

@aubin-tchoi aubin-tchoi commented Dec 19, 2024

Description

  • In the context of #9157, the work was split between the PR [KW search] Upsert and backfill GitHub folders #9556 that updated existing activities and this one that performs workflow-level changes.
  • This PR aims at adding the folders upserts for the issues, the discussions, the code root and the repository root in the workflows.
  • To avoid non-deterministic errors we will proceed in the following steps:
    • This PR adds new workflows without changing the existing ones and updates the client to only launch the new workflows.
    • When the long-running tasks on old workflows will all be done, we will safely remove them then.
  • Workflow setup (old/new):
    • githubFullSyncWorkflow/githubSyncAllReposWorkflow fetches the repositories and spawns githubRepoSyncWorkflow/githubSyncRepoWorkflow.
    • githubRepoSyncWorkflow/githubSyncRepoWorkflow spawns githubRepoIssuesSyncWorkflow/githubSyncRepoIssuesWorkflow and githubRepoDiscussionsSyncWorkflow/githubSyncRepoDiscussionsWorkflow and then runs githubCodeSyncActivity.
    • githubCodeSyncActivity upserts the "Code" folder itself (this activity has the same level of responsibility than the 2 workflows for the discussions and issues).
    • githubSyncRepoIssuesWorkflow runs the activity to upsert the "Issues" folder (only diff with githubRepoIssuesSyncWorkflow).
    • githubSyncRepoDiscussionsWorkflow runs the activity to upsert the "Discussions" folder (only diff with githubRepoDiscussionsSyncWorkflow).
    • The other workflows are unaffected, these are the ones that run on webhooks or daily.

Risk

  • Will need to be handled safely to avoid crashing the workflows and possibly missing webhooks.

Deploy Plan

  • Deploy connectors.
  • Run the migration 20241219_backfill_github_folders.
  • Check up on the old workflows (on Monday)

) {
await githubSaveStartSyncActivity(dataSourceConfig);

const queue = new PQueue({ concurrency: MAX_CONCURRENT_REPO_SYNC_WORKFLOWS });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why we don't use the concurrentExecutor here?
If the reason was to not change the workflow code, now is a good time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea why was #3379 reverted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. workflows are executed by temporal servers, we use as little libs as possible since it can break it (but maybe it'd work
  2. I remember somebody (flavien I think) implemented, then went back, then implemented again but couldn't say why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT concurrentExecutor is only a thin layer on top of p-queue so that's really be the same here, only less code and we'd be using the cool stuff other eng made for the sake of our happiness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but well, as seen IRL let's not bother with this, we'll deal with this separately if we want to (I'm not sure I really really do :p)

@@ -68,7 +68,7 @@ export async function launchGithubFullSyncWorkflow({
return;
}

await client.workflow.start(githubFullSyncWorkflow, {
await client.workflow.start(githubSyncAllReposWorkflow, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should keep the name "full sync" that's commonly used across the eng team to refer to... full syncs

we could go githubFullSync, ghFullSyncWorkflow, fullSyncWorkflow which are used elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the record, even if it's a long shot I am in favor of slowly phasing out the terminology FullSync when appropriate as there can be mixups between fullSync === !incrementalSync and fullSync === syncs everything (not true for some connectors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this file since we have some other workflows untouched that start with github and end with workflow I want to keep that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do v2 👍

Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great 👍 Minor coms
Thanks for handling this 😱 situation

@@ -159,6 +230,51 @@ export async function githubReposSyncWorkflow(
await githubSaveSuccessSyncActivity(dataSourceConfig);
}

export async function githubSyncReposWorkflow(
dataSourceConfig: DataSourceConfig,
connectorId: ModelId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we say we need to copy if the workflow does not change, but only the child workflow?
i don't recall (but I'd support if we're unsure 👍 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as seen IRL, let's not risk it, duplicating it is safe with no cost

@aubin-tchoi aubin-tchoi merged commit 0cf6530 into main Dec 20, 2024
3 checks passed
@aubin-tchoi aubin-tchoi deleted the github-new-workflows branch December 20, 2024 09:46
philipperolet pushed a commit that referenced this pull request Dec 20, 2024
* add new workflows that are duplicates of the old ones with the new upsert activities

* fix backfill script

* rename workflows with v2

* 📝
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants