Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crustdata enrichment tweaks #2680

Merged
merged 11 commits into from
Nov 8, 2024
Merged

Crustdata enrichment tweaks #2680

merged 11 commits into from
Nov 8, 2024

Conversation

epipav
Copy link
Collaborator

@epipav epipav commented Nov 8, 2024

Changes proposed ✍️

What

copilot:summary

copilot:poem

Why

How

copilot:walkthrough

Checklist ✅

  • Label appropriately with Feature, Improvement, or Bug.
  • Add screenshots to the PR description for relevant FE changes
  • New backend functionality has been unit-tested.
  • API documentation has been updated (if necessary) (see docs on API documentation).
  • Quality standards are met.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced several materialized views for monitoring member enrichment and global activity counts.
    • Added a new function to refresh materialized views, enhancing data accuracy.
    • Implemented a scheduled workflow to refresh member enrichment materialized views daily.
  • Improvements

    • Increased the number of members processed for enrichment from 500 to 1000 per run.
    • Enhanced SQL queries to utilize the new membersGlobalActivityCount for activity checks.
    • Improved email handling to support multiple addresses.
  • Bug Fixes

    • Refined normalization logic for LinkedIn handles and email data processing.

@epipav epipav self-assigned this Nov 8, 2024
Copy link

coderabbitai bot commented Nov 8, 2024

Walkthrough

This pull request introduces several enhancements to the member enrichment monitoring system. Key changes include the creation of multiple materialized views for aggregating member activity and enrichment data, as well as the addition of functions to refresh these views. Updates to existing services and workflows ensure that the new data structures are utilized effectively. The changes also involve modifications to SQL queries and data handling processes across various components, improving the overall functionality and robustness of the member enrichment system.

Changes

File Path Change Summary
backend/src/database/migrations/R__memberEnrichmentMaterializedViews.sql Added materialized views: membersGlobalActivityCount, memberEnrichmentMonitoringTotal, memberEnrichmentMonitoringClearbit, memberEnrichmentMonitoringCrustdata, memberEnrichmentMonitoringProgaiGithub, memberEnrichmentMonitoringProgaiLinkedin, memberEnrichmentMonitoringSerp.
services/apps/premium/members_enrichment_worker/src/activities.ts Added method: refreshMemberEnrichmentMaterializedView.
services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts Added method: refreshMemberEnrichmentMaterializedView(mvName: string): Promise<void>.
services/apps/premium/members_enrichment_worker/src/main.ts Added import and execution of scheduleRefreshMembersEnrichmentMaterializedViews after svc.init().
services/apps/premium/members_enrichment_worker/src/schedules.ts Added method: scheduleRefreshMembersEnrichmentMaterializedViews.
services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts Added method: scheduleRefreshMembersEnrichmentMaterializedViews, scheduled to run daily at 5 AM. Updated import statement to include refreshMemberEnrichmentMaterializedViews.
services/apps/premium/members_enrichment_worker/src/sources/clearbit/service.ts Updated enrichableBySql to reference membersGlobalActivityCount. Modified normalizeIdentities for LinkedIn handles.
services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts Updated enrichableBySql to reference membersGlobalActivityCount. Enhanced normalizeIdentities for email handling.
services/apps/premium/members_enrichment_worker/src/sources/crustdata/types.ts Updated email property type in IMemberEnrichmentDataCrustdata from string to `string
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts Modified findDistinctScrapableLinkedinIdentities to sanitize LinkedIn handles by removing slashes.
services/apps/premium/members_enrichment_worker/src/sources/serp/service.ts Updated SQL condition in EnrichmentServiceSerpApi to reference membersGlobalActivityCount.
services/apps/premium/members_enrichment_worker/src/types.ts Updated comment in IEnrichmentService to reflect changes to enrichableBySql.
services/apps/premium/members_enrichment_worker/src/utils/common.ts Added utility function: chunkArray<T>(array: T[], chunkSize: number): T[][].
services/apps/premium/members_enrichment_worker/src/workflows.ts Added import for refreshMemberEnrichmentMaterializedViews.
services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts Modified startToCloseTimeout in proxyActivities call from '20 seconds' to '1 minute'.
services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts Increased QUERY_FOR_ENRICHABLE_MEMBERS_PER_RUN from 500 to 1000. Introduced PARALLEL_ENRICHMENT_WORKFLOWS set to 5. Updated logic to process members in chunks.
services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews.ts Added new workflow for refreshing materialized views.
services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts Updated SQL queries in fetchMembersForEnrichment to use membersGlobalActivityCount instead of activitySummary.
services/libs/data-access-layer/src/utils.ts Added function: refreshMaterializedView(tx: DbConnOrTx, mvName: string, concurrently = false).
services/libs/types/src/enums/enrichment.ts Added new enum: MemberEnrichmentMaterializedView with several members related to member enrichment monitoring.

Possibly related PRs

Suggested labels

Feature

Suggested reviewers

  • themarolt

Poem

🐇 In the fields where data flows,
Materialized views now brightly glow.
Enrichment tasks scheduled with care,
Members' insights are now laid bare.
With each refresh, our knowledge grows,
In the world of data, a rabbit knows! 🌼

Warning

Tool Failures:

Tool Failure Count:


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@epipav epipav added the Improvement Created by Linear-GitHub Sync label Nov 8, 2024
@epipav epipav marked this pull request as ready for review November 8, 2024 12:46
@@ -100,3 +100,11 @@ export function normalizeAttributes(

return normalized
}

export function chunkArray<T>(array: T[], chunkSize: number): T[][] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have partition in this file that does the same thing I believe.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (17)
services/libs/types/src/enums/enrichment.ts (1)

9-17: Consider adding documentation for the enum.

The implementation is well-structured with consistent naming. Consider adding JSDoc comments to document:

  • The purpose of the enum
  • The meaning and usage of each materialized view
  • The relationship between these views and MemberEnrichmentSource

Example documentation:

+/**
+ * Names of materialized views used for member enrichment monitoring.
+ * These views aggregate data from various enrichment sources and activities.
+ */
 export enum MemberEnrichmentMaterializedView {
+  /** Global count of member activities across all sources */
   MEMBERS_GLOBAL_ACTIVITY_COUNT = 'membersGlobalActivityCount',
+  /** Aggregated analysis of all enrichment sources */
   TOTAL_ENRICHMENT_ANALYSIS = 'memberEnrichmentMonitoringTotal',
   // ... (continue for other members)
 }
services/apps/premium/members_enrichment_worker/src/main.ts (1)

50-51: Consider adding error handling and logging for the materialized view refresh.

While the scheduling sequence is logical, the critical nature of materialized view refresh operations warrants additional safeguards.

Consider wrapping the operations in try-catch blocks and adding logging:

- await scheduleRefreshMembersEnrichmentMaterializedViews()
- await scheduleMembersEnrichment()
+ try {
+   await scheduleRefreshMembersEnrichmentMaterializedViews()
+   svc.logger.info('Successfully scheduled materialized view refresh')
+   await scheduleMembersEnrichment()
+   svc.logger.info('Successfully scheduled member enrichment')
+ } catch (error) {
+   svc.logger.error('Failed to schedule enrichment tasks:', error)
+   throw error  // Re-throw to prevent service start on critical failure
+ }
services/apps/premium/members_enrichment_worker/src/sources/crustdata/types.ts (1)

20-20: Consider adding JSDoc comments to document the email field behavior.

Adding documentation about the email field's format would help other developers understand when to expect a single email versus an array of emails.

+  /** 
+   * Email address(es) associated with the member.
+   * Can be a single email string or an array of email strings.
+   */
   email: string | string[]
services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts (1)

22-23: Consider making workflow parameters configurable

The hardcoded values for QUERY_FOR_ENRICHABLE_MEMBERS_PER_RUN and PARALLEL_ENRICHMENT_WORKFLOWS should ideally be configurable through environment variables to allow for runtime tuning based on system capacity and performance requirements.

-  const QUERY_FOR_ENRICHABLE_MEMBERS_PER_RUN = 1000
-  const PARALLEL_ENRICHMENT_WORKFLOWS = 5
+  const QUERY_FOR_ENRICHABLE_MEMBERS_PER_RUN = process.env.QUERY_FOR_ENRICHABLE_MEMBERS_PER_RUN 
+    ? parseInt(process.env.QUERY_FOR_ENRICHABLE_MEMBERS_PER_RUN, 10) 
+    : 1000
+  const PARALLEL_ENRICHMENT_WORKFLOWS = process.env.PARALLEL_ENRICHMENT_WORKFLOWS
+    ? parseInt(process.env.PARALLEL_ENRICHMENT_WORKFLOWS, 10)
+    : 5
services/apps/premium/members_enrichment_worker/src/utils/common.ts (1)

104-110: The implementation looks good but could be enhanced.

The generic chunkArray function is well-typed and implements the basic chunking logic correctly.

Consider these improvements:

  1. Add input validation and edge case handling
  2. Use Array.from() with a generator for better memory efficiency
 export function chunkArray<T>(array: T[], chunkSize: number): T[][] {
+  if (!array.length || chunkSize <= 0) {
+    return [];
+  }
+  
+  return Array.from(
+    { length: Math.ceil(array.length / chunkSize) },
+    (_, index) => array.slice(index * chunkSize, (index + 1) * chunkSize)
+  );
-  const chunks = []
-  for (let i = 0; i < array.length; i += chunkSize) {
-    chunks.push(array.slice(i, i + chunkSize))
-  }
-  return chunks
 }

This refactored version:

  • Handles empty arrays and invalid chunk sizes
  • Uses Array.from() for a more functional and memory-efficient approach
  • Maintains the same functionality while being more robust
services/apps/premium/members_enrichment_worker/src/types.ts (1)

54-54: Consider enhancing the SQL alias documentation.

While the comment correctly documents the alias change, it could be more specific about the materialized view being referenced.

Consider updating the comment to:

- // activity count is available in "membersGlobalActivityCount" alias, "membersGlobalActivityCount".total_count field
+ // activity count is available from the members_global_activity_count materialized view using "membersGlobalActivityCount" alias, accessed via "membersGlobalActivityCount".total_count field
services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (1)

64-68: Consider adjusting retry settings for view refresh

The current retry settings (15s initial, 2x backoff, 3 attempts) might not be optimal for materialized view refreshes which could fail due to temporary database load. Consider increasing maximumAttempts or initialInterval for more resilience.

  retry: {
    initialInterval: '15 seconds',
    backoffCoefficient: 2,
-   maximumAttempts: 3,
+   maximumAttempts: 5,
  },
services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (2)

Line range hint 95-108: Implementation needed: Data squashing logic is incomplete.

The code prepares normalized data from different sources but lacks the actual squashing implementation. This TODO represents a critical missing piece that affects the enrichment process. Consider:

  1. Implementing the LLM-based data squashing logic
  2. Adding error handling for LLM processing
  3. Including validation for the squashed data

Would you like me to help create a GitHub issue to track this implementation task? I can include detailed requirements and implementation suggestions for the LLM-based data squashing logic.


Line range hint 32-94: Enhance error handling and logging in the enrichment process.

While the code includes retry logic, it could benefit from more robust error handling and logging:

Consider these improvements:

 export async function enrichMember(
   input: IEnrichableMember,
   sources: MemberEnrichmentSource[],
 ): Promise<void> {
   let changeInEnrichmentSourceData = false
+  const enrichmentResults: Record<string, boolean> = {}
 
   for (const source of sources) {
     try {
       // find if there's already saved enrichment data in source
       const cache = await findMemberEnrichmentCache(source, input.id)
 
       if (await isCacheObsolete(source, cache)) {
         const enrichmentInput: IEnrichmentSourceInput = {
           // ... existing input construction ...
         }
 
         const data = await getEnrichmentData(source, enrichmentInput)
+        enrichmentResults[source] = !!data
 
         if (!cache) {
           await insertMemberEnrichmentCache(source, input.id, data)
           if (data) {
             changeInEnrichmentSourceData = true
           }
         }
         // ... rest of the logic ...
       }
+    } catch (error) {
+      enrichmentResults[source] = false
+      // Re-throw critical errors, log others
+      if (error instanceof CriticalEnrichmentError) {
+        throw error
+      }
+      console.error(`Enrichment failed for source ${source}:`, error)
     }
   }
+  
+  // Log enrichment results for monitoring
+  console.info('Enrichment results:', { memberId: input.id, enrichmentResults })
 }
services/apps/premium/members_enrichment_worker/src/sources/serp/service.ts (1)

Line range hint 24-29: Document the materialized view dependency.

Consider adding a comment above the SQL query to document the dependency on the membersGlobalActivityCount materialized view. This will help future maintainers understand the schema requirements.

+ // Depends on membersGlobalActivityCount materialized view
+ // which provides aggregated activity counts for members
  ("membersGlobalActivityCount".total_count > ${this.enrichMembersWithActivityMoreThan}) AND
services/apps/premium/members_enrichment_worker/src/sources/clearbit/service.ts (1)

175-177: LGTM! Consider additional URL validation.

The LinkedIn handle normalization improvement is robust and handles trailing slashes correctly. However, consider adding a validation check for empty segments.

Consider this safer implementation:

-          handle: data.linkedin.handle.endsWith('/')
-            ? data.linkedin.handle.slice(0, -1).split('/').pop()
-            : data.linkedin.handle.split('/').pop(),
+          handle: (() => {
+            const segments = data.linkedin.handle.split('/').filter(Boolean);
+            return segments.length > 0 ? segments[segments.length - 1] : data.linkedin.handle;
+          })(),

This ensures we never get undefined from pop() and handles multiple consecutive slashes.

services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts (1)

Line range hint 1-199: Consider architectural improvements for robustness.

A few suggestions to enhance the service's reliability and maintainability:

  1. Validate environment variables at startup
  2. Improve error handling for API responses
  3. Consider implementing retry logic for failed API calls
  4. Add metrics for monitoring enrichment success rates

Would you like me to provide specific implementation details for any of these improvements?

services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts (2)

219-219: LGTM! Consider additional URL sanitization.

The slash removal is a good defensive programming approach. However, consider extending the sanitization to handle other common URL components that might be present in LinkedIn handles.

-          value: input.linkedin.value.replace(/\//g, ''),
+          value: input.linkedin.value.replace(/[\/\?#]/g, '').toLowerCase(),

283-291: LGTM! Consider adding error handling for malformed email data.

The change improves robustness by supporting both string and array email formats. However, consider adding error handling for potential edge cases.

       if (Array.isArray(data.email)) {
-        emails = data.email
+        emails = data.email.filter(e => e && typeof e === 'string').filter(isEmail)
       } else {
-        emails = data.email.split(',').filter(isEmail)
+        emails = (typeof data.email === 'string' ? data.email.split(',') : [])
+          .filter(e => e && typeof e === 'string')
+          .filter(isEmail)
       }
services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (1)

Line range hint 22-68: Consider implementing a view refresh strategy.

The migration from temporary tables to materialized views is a good architectural choice for performance optimization. However, to ensure data freshness:

  1. Consider implementing a strategy to refresh the materialized view periodically.
  2. Document the refresh frequency requirements based on business needs.
  3. Monitor query performance to validate the benefits of this change.
backend/src/database/migrations/R__memberEnrichmentMaterializedViews.sql (2)

18-194: Consider refactoring to reduce code duplication.

The view effectively monitors total enrichment stats but contains repeated enrichment criteria across CTEs. Consider creating a separate view for common enrichment criteria that can be reused across different monitoring views.

Example approach:

-- Create a base view for common enrichment criteria
create view member_enrichment_criteria as
select distinct mem.id as "memberId"
from members mem
inner join "memberIdentities" mi on mem.id = mi."memberId"
where (
    (mi.verified and ((mi.type = 'username' AND mi.platform = 'github') OR (mi.type = 'email')))
    -- ... other common criteria
);

-- Use in monitoring views
create materialized view "memberEnrichmentMonitoringTotal" as
select ... from member_enrichment_criteria
-- ... rest of the logic

559-619: Improve readability and configuration of SERP enrichment criteria.

Two suggestions for improvement:

  1. The activity threshold of 500 is hardcoded. Consider making it configurable.
  2. The complex boolean conditions for member eligibility could be more readable.

Example approach:

-- Create a function for SERP activity threshold
create or replace function get_serp_activity_threshold() returns integer as $$
begin
    return 500; -- Configure this value
end;
$$ language plpgsql;

-- Create a function to check member eligibility
create or replace function is_serp_enrichable_member(
    activity_count integer,
    display_name text,
    location_default text,
    website_url_default text,
    has_verified_github boolean,
    has_verified_email boolean
) returns boolean as $$
begin
    return activity_count > get_serp_activity_threshold()
        and display_name like '% %'
        and location_default is not null 
        and location_default <> ''
        and (
            (website_url_default is not null and website_url_default <> '')
            or has_verified_github
            or has_verified_email
        );
end;
$$ language plpgsql;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 65cb14b and 51a6d46.

📒 Files selected for processing (20)
  • backend/src/database/migrations/R__memberEnrichmentMaterializedViews.sql (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities.ts (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/main.ts (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/schedules.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/sources/clearbit/service.ts (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts (3 hunks)
  • services/apps/premium/members_enrichment_worker/src/sources/crustdata/types.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/sources/serp/service.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/types.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/utils/common.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/workflows.ts (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts (3 hunks)
  • services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews.ts (1 hunks)
  • services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (2 hunks)
  • services/libs/data-access-layer/src/utils.ts (2 hunks)
  • services/libs/types/src/enums/enrichment.ts (1 hunks)
🔇 Additional comments (34)
services/apps/premium/members_enrichment_worker/src/schedules.ts (2)

1-5: LGTM! Clean and consistent export structure.

The changes maintain a clean and consistent module structure while adding the new materialized view refresh functionality.

Also applies to: 7-11


4-4: Verify the implementation of the new schedule.

Let's confirm that the implementation exists in the source file.

✅ Verification successful

Let me gather more information about the workflow registration and schedule configuration to ensure everything is properly connected.


Let me try one more search with different patterns to find the workflow registration and complete schedule implementation.


Schedule implementation verified and properly configured

The schedule is correctly implemented with:

  • A daily cron schedule at 5 AM (0 5 * * *)
  • Proper overlap handling with ScheduleOverlapPolicy.SKIP
  • Workflow trigger configuration pointing to refreshMemberEnrichmentMaterializedViews
  • Task queue set to 'members-enrichment'
  • The workflow implementation exists and handles materialized view refresh
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of scheduleRefreshMembersEnrichmentMaterializedViews

# Test 1: Check if the implementation exists
echo "Checking for the schedule implementation..."
rg -A 10 "scheduleRefreshMembersEnrichmentMaterializedViews" "services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts"

# Test 2: Check if the workflow type is properly defined
echo "Checking for the workflow type..."
rg -A 5 "refreshMemberEnrichmentMaterializedViews" "services/apps/premium/members_enrichment_worker/src/workflows/"

Length of output: 1539


Script:

#!/bin/bash
# Check workflow registration
echo "Checking workflow registration..."
rg -A 5 "registerWorkflow.*refreshMemberEnrichmentMaterializedViews" services/apps/premium/members_enrichment_worker/src/

# Check schedule configuration and workflow trigger
echo "Checking schedule configuration..."
rg -A 10 "schedule\.create.*refresh-members-enrichment-materialized-views" services/apps/premium/members_enrichment_worker/src/

Length of output: 409


Script:

#!/bin/bash
# Check for workflow registration with a broader pattern
echo "Checking workflow registration..."
rg "registerWorkflow.*MaterializedViews" services/apps/premium/members_enrichment_worker/src/

# Check the complete schedule implementation
echo "Checking complete schedule implementation..."
rg -A 15 "scheduleRefreshMembersEnrichmentMaterializedViews" services/apps/premium/members_enrichment_worker/src/

Length of output: 5191

services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews.ts (3)

1-6: LGTM: Clean and focused imports.

The imports are well-organized and properly scoped, bringing in only the necessary dependencies.


1-15: Verify integration with scheduling system.

The workflow is scheduled to run daily at 5 AM according to the AI summary. Ensure:

  1. The schedule is properly configured in all environments
  2. The workflow handles timezone differences appropriately
  3. There are no conflicting schedules that might cause resource contention
✅ Verification successful

Based on the shell script results, I can now provide a final response:

Schedule configuration is properly set with no resource conflicts

The workflow is correctly scheduled to run daily at 5 AM via scheduleRefreshMembersEnrichmentMaterializedViews with:

  • Proper cron expression: 0 5 * * *
  • Overlap policy set to SKIP to prevent concurrent executions
  • No conflicting materialized view refreshes at the same time window
  • Other scheduled tasks in the system run at different times:
    • Members enrichment at 6 AM
    • Organization aggregates at 8 AM
    • Analytics at varying times
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for schedule configurations and potential conflicts

# Look for other scheduled tasks that might interfere
rg -A 5 "schedule" --glob "**/*.ts"

# Check for other materialized view refresh schedules
ast-grep --pattern 'refreshMaterializedView($$$)'

Length of output: 21583


7-9: Consider adjusting the timeout configuration.

The 10-minute timeout might be insufficient for large datasets or when multiple views need refreshing. Consider:

  1. Making the timeout configurable based on environment variables
  2. Implementing a retry mechanism for timeout failures
services/apps/premium/members_enrichment_worker/src/workflows.ts (1)

7-7: LGTM! Please verify error handling.

The addition of the materialized view refresh workflow follows the established module structure. However, since this involves database operations, we should ensure proper error handling is in place.

Let's verify the error handling implementation:

Also applies to: 16-16

✅ Verification successful

Error handling is properly implemented through the stack

The materialized view refresh operation is well-protected:

  1. The workflow uses Temporal's proxyActivities with a 10-minute timeout, providing automatic retry and timeout handling
  2. The activity delegates to the DAL's refreshMaterializedView utility which uses a database transaction
  3. Database errors will be automatically propagated and handled by Temporal's retry mechanism

No additional error handling is needed as the current implementation follows the established patterns in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check error handling in the new workflow

# Test: Look for try-catch blocks and error handling patterns
rg -A 5 "try\s*{" "./services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews"

# Test: Check for error logging
rg "error|logger" "./services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews"

Length of output: 472


Script:

#!/bin/bash
# Let's first check if the file exists and its location
fd "refreshMemberEnrichmentMaterializedViews" --type f

# If we find the file, let's examine its content for error handling patterns
# Using a more reliable regex pattern for try-catch blocks
rg "try {|catch|throw|error|logger" --type ts

Length of output: 326


Script:

#!/bin/bash
# Now that we found the file, let's examine its content
cat services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews.ts

# Also check for error handling patterns with separate searches
rg "try" services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews.ts
rg "catch" services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews.ts
rg "throw" services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews.ts
rg "logger" services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews.ts

Length of output: 1097


Script:

#!/bin/bash
# Let's check the activity implementation since error handling might be there
cat services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

# Also check for error handling patterns in the activity file
rg "try|catch|throw|logger" services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

Length of output: 5379


Script:

#!/bin/bash
# Let's check the implementation of refreshMaterializedView utility since it's used by our function
cat $(fd "utils.ts" --type f | grep "data-access-layer")

# Also check for error handling in the main service setup
cat services/apps/premium/members_enrichment_worker/src/main.ts

Length of output: 12273

services/libs/types/src/enums/enrichment.ts (1)

9-17: Verify materialized view names match the database.

The enum values should exactly match the materialized view names in the database to prevent runtime errors.

Let's verify the view names:

✅ Verification successful

Materialized view names match the database schema

Based on the search results, all materialized view names in the enum exactly match the database schema. The following views were found in the SQL files:

  • membersGlobalActivityCount
  • memberEnrichmentMonitoringTotal
  • memberEnrichmentMonitoringClearbit
  • memberEnrichmentMonitoringCrustdata
  • memberEnrichmentMonitoringProgaiGithub
  • memberEnrichmentMonitoringProgaiLinkedin
  • memberEnrichmentMonitoringSerp

Each view name in the enum corresponds to an actual materialized view in the database with the exact same name.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the materialized view names exist in SQL files
# Expected: Each view name should be found in migration files

echo "Checking materialized view definitions..."
rg -l "CREATE MATERIALIZED VIEW" | while read -r file; do
  echo "Checking $file for view definitions..."
  for view in "membersGlobalActivityCount" "memberEnrichmentMonitoringTotal" "memberEnrichmentMonitoringProgaiGithub" "memberEnrichmentMonitoringClearbit" "memberEnrichmentMonitoringSerp" "memberEnrichmentMonitoringProgaiLinkedin" "memberEnrichmentMonitoringCrustdata"; do
    echo "Searching for view: $view"
    rg "CREATE MATERIALIZED VIEW.*$view" "$file"
  done
done

Length of output: 7834


Script:

#!/bin/bash
# Let's try a different approach by searching for any materialized views in SQL files
# and then compare with our enum values

echo "Listing all materialized view definitions in SQL files:"
rg -i "CREATE\s+(?:OR\s+REPLACE\s+)?MATERIALIZED\s+VIEW\s+(\w+)" -g "*.sql" --only-matching --replace '$1'

# Also search for these specific view names without the CREATE clause
# in case they're referenced elsewhere in SQL files
echo -e "\nSearching for view names in SQL files:"
rg -i "members_global_activity_count|member_enrichment_monitoring_total|member_enrichment_monitoring_progai_github|member_enrichment_monitoring_clearbit|member_enrichment_monitoring_serp|member_enrichment_monitoring_progai_linkedin|member_enrichment_monitoring_crustdata" -g "*.sql"

Length of output: 857


Script:

#!/bin/bash
# Let's check for these views in a case-insensitive manner and with more flexible matching
# Also search in TypeScript/JavaScript files to see if these names are referenced elsewhere

echo "Searching for materialized views with snake_case naming:"
rg -i "CREATE\s+(?:OR\s+REPLACE\s+)?MATERIALIZED\s+VIEW\s+(?:IF\s+NOT\s+EXISTS\s+)?(?:\"?members_global_activity_count\"?|\"?member_enrichment_monitoring_total\"?|\"?member_enrichment_monitoring_progai_github\"?|\"?member_enrichment_monitoring_clearbit\"?|\"?member_enrichment_monitoring_serp\"?|\"?member_enrichment_monitoring_progai_linkedin\"?|\"?member_enrichment_monitoring_crustdata\"?)" -g "*.sql"

echo -e "\nSearching for view references in TypeScript/JavaScript files:"
rg -i "membersGlobalActivityCount|memberEnrichmentMonitoringTotal|memberEnrichmentMonitoringProgaiGithub|memberEnrichmentMonitoringClearbit|memberEnrichmentMonitoringSerp|memberEnrichmentMonitoringProgaiLinkedin|memberEnrichmentMonitoringCrustdata" -g "*.{ts,js}" -A 2 -B 2

echo -e "\nListing all files that might contain materialized view definitions:"
fd -e sql -x cat {} | grep -i "materialized view" -A 3

Length of output: 14407

services/apps/premium/members_enrichment_worker/src/main.ts (2)

5-9: LGTM! Import changes are well-structured.

The new import for materialized view refresh scheduling is properly organized with related scheduling functions.


Line range hint 11-24: Verify Crustdata configuration and workflow dependencies.

The initialization sequence has been modified in a service that handles multiple enrichment providers. Let's ensure all dependencies are properly configured.

✅ Verification successful

Based on the provided context, I can now generate the final response as the verification is complete. The code shows that Crustdata is properly integrated into the enrichment system:

Crustdata integration is properly configured and aligned with existing workflows

  • Crustdata is correctly registered as an enrichment source in MemberEnrichmentSource enum
  • It has a dedicated materialized view CRUSTDATA_LINKEDIN_SCRAPER_ENRICHMENT_ANALYSIS for monitoring
  • The factory pattern properly instantiates EnrichmentServiceCrustdata when needed
  • The type system accommodates Crustdata's data structure through IMemberEnrichmentDataCrustdata
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify Crustdata configuration and workflow dependencies

# Check for any hardcoded Crustdata configurations that might conflict
rg -l "crustdata" --type ts

# Look for any temporal workflow dependencies
ast-grep --pattern 'workflow.execute($_, "refreshMemberEnrichmentMaterializedView")'

Length of output: 329


Script:

#!/bin/bash
# Check the content of relevant files to understand Crustdata configuration and usage

# Check enrichment type definitions
cat services/libs/types/src/enums/enrichment.ts

# Check factory implementation for Crustdata
cat services/apps/premium/members_enrichment_worker/src/factory.ts

# Check types definition
cat services/apps/premium/members_enrichment_worker/src/types.ts

# Look for any workflow definitions related to enrichment
rg "class.*Workflow" --type ts -A 5

Length of output: 5891

services/apps/premium/members_enrichment_worker/src/sources/crustdata/types.ts (1)

20-20: LGTM! Verify handling of the new email type.

The change to support both single email and multiple emails is appropriate. However, ensure all consumers of this interface properly handle both cases.

Let's verify the usage of this interface:

✅ Verification successful

LGTM! The email type change is properly handled

The codebase shows proper handling of the new string | string[] type in the Crustdata service:

  • The service correctly handles both array and string cases in services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts:
    if (Array.isArray(data.email)) {
      emails = data.email
    } else {
      emails = data.email.split(',').filter(isEmail)
    }
  • The type is properly normalized into the common format used across the application
  • Other email processing services in the codebase already handle multiple emails through arrays
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all usages of IMemberEnrichmentDataCrustdata to ensure proper handling of the email field

# Search for direct usage of the interface
echo "=== Direct interface usage ==="
rg -t typescript "IMemberEnrichmentDataCrustdata" -A 5

# Search for email field access patterns
echo "=== Email field access patterns ==="
rg -t typescript "data\.email" -A 3

Length of output: 314


Script:

#!/bin/bash
# Let's try without typescript file type flag since it's not recognized
# Search for interface usage
echo "=== Direct interface usage ==="
rg "IMemberEnrichmentDataCrustdata" -A 5

# Search for email field access
echo "=== Email field access patterns ==="
rg "data\.email" -A 3

# Also search for potential transformations of email field
echo "=== Email transformations ==="
rg "email.*Array|Array.*email" -A 3

Length of output: 18371

services/apps/premium/members_enrichment_worker/src/activities.ts (2)

56-56: LGTM! Export is consistent with imports.

The new export maintains consistency with the imports and follows the module's existing pattern.


9-9: Verify the imported function exists.

The addition of refreshMemberEnrichmentMaterializedView to imports looks good, but let's verify its implementation.

✅ Verification successful

Let me try a different search strategy to locate this function.


The imported function exists and is properly implemented

The function refreshMemberEnrichmentMaterializedView is correctly implemented in services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts. It's a simple wrapper around the refreshMaterializedView utility function that takes a materialized view name as input and refreshes it using the postgres writer connection.

The import and export in the activities.ts file is valid and consistent with the implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and implementation of refreshMemberEnrichmentMaterializedView

# Test: Search for the function definition in enrichment.ts
ast-grep --pattern 'export function refreshMemberEnrichmentMaterializedView' || \
ast-grep --pattern 'export const refreshMemberEnrichmentMaterializedView'

Length of output: 444


Script:

#!/bin/bash
# Search for the function in the entire codebase, focusing on TypeScript files
rg -t ts "refreshMemberEnrichmentMaterializedView" -A 3

# Also check the specific directory where it's likely to be defined
fd -e ts . services/apps/premium/members_enrichment_worker/src/activities --exec cat {}

Length of output: 23825

services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts (4)

9-9: LGTM: Import changes align with new functionality

The new imports support the parallel processing capabilities and type safety.

Also applies to: 13-13


33-37: LGTM: Member query implementation

The query implementation correctly uses the new batch size and includes all enrichment sources.


43-66: LGTM: Robust parallel processing implementation

The implementation includes:

  • Proper chunking for controlled parallelism
  • Well-configured workflow timeouts and retry policies
  • Correct tenant isolation through search attributes

Let's verify the chunkArray utility implementation:

✅ Verification successful

Verified: Chunking implementation is consistent and well-implemented

The chunkArray utility is properly implemented in services/apps/premium/members_enrichment_worker/src/utils/common.ts with correct array slicing logic for controlled parallelism. The implementation matches its usage in the workflow and is consistent with other implementations across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify chunkArray utility implementation
rg -A 10 "export function chunkArray" "utils/common.ts"

Length of output: 111


Script:

#!/bin/bash
# Let's try to find the chunkArray utility using fd and ripgrep
fd -e ts -e js "common" | xargs rg -l "chunkArray"

# If not found in common files, let's search across all files
rg -g '*.{ts,js}' -l "function chunkArray"

# Let's also check the actual implementation wherever it exists
rg -g '*.{ts,js}' -A 10 "function chunkArray"

Length of output: 2769


70-71: Verify cursor continuation logic

The cursor logic correctly uses the last member from the last chunk, but we should verify that this maintains proper ordering and doesn't skip members.

Let's verify the member ordering in the query:

✅ Verification successful

Based on the evidence gathered from the shell scripts, I can see that:

  1. The query ordering is maintained by activityCount DESC, members.id DESC in the data access layer (from earlier search results)
  2. The workflow correctly uses the last member's ID and activityCount from the last chunk for pagination
  3. The workflow processes a fixed batch size of 1000 members per run
  4. The continuation is implemented using Temporal's continueAsNew with proper cursor values

Cursor continuation logic is correctly implemented

The cursor logic maintains proper ordering and ensures no members are skipped because:

  • The query's DESC ordering on both activityCount and id ensures deterministic pagination
  • The cursor includes both the activityCount and memberId, maintaining the exact position in the ordered result set
  • The workflow uses the last processed member's values for continuation, ensuring seamless pagination
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check member ordering in getEnrichableMembers query
rg -A 10 "getEnrichableMembers.*implementation" "activities/getMembers.ts"

Length of output: 139


Script:

#!/bin/bash
# Let's first find the relevant file containing the getEnrichableMembers query
fd -t f "getEnrichableMembers"

# Then search for any SQL or query related to member ordering
rg -i "order by.*member" 
rg -i "select.*from.*member.*order"

Length of output: 1259


Script:

#!/bin/bash
# Let's check the specific query implementation in the members enrichment worker
rg -A 10 "getEnrichableMembers" services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts

Length of output: 119


Script:

#!/bin/bash
# Let's look for the file containing the enrichment query implementation
fd -t f -e ts "enrichment" services

# And check the implementation of the getMembersToEnrich workflow
rg -B 5 -A 15 "getMembersToEnrich" services/apps/premium/members_enrichment_worker/src/workflows/

Length of output: 4734


Script:

#!/bin/bash
# Let's check the implementation of getEnrichableMembers activity
rg -A 20 "getEnrichableMembers.*implementation" services/apps/premium/members_enrichment_worker/src/activities/

# Also check for any related SQL queries in the activities
rg -A 10 "SELECT.*FROM.*members.*WHERE" services/apps/premium/members_enrichment_worker/src/activities/

Length of output: 219

services/apps/premium/members_enrichment_worker/src/utils/common.ts (1)

104-110: Verify the impact on memory usage.

Since this function is used for batch processing in enrichment workflows, it's important to verify its memory efficiency with large datasets.

✅ Verification successful

Memory usage is well-controlled in this implementation

The implementation is memory efficient based on the following evidence:

  • The input array size is limited to 1000 items (QUERY_FOR_ENRICHABLE_MEMBERS_PER_RUN = 1000)
  • The chunk size is small, only 5 items (PARALLEL_ENRICHMENT_WORKFLOWS = 5)
  • The chunking is used for controlled parallel processing, not for loading large datasets in memory
  • The function is used in both member and organization enrichment workflows with similar controlled batch sizes

The current implementation creates shallow copies via array.slice() which is memory efficient for small chunks, and the garbage collector can clean up each chunk after processing.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find usage of chunkArray to understand the typical array sizes being processed

# Search for direct usage of chunkArray
echo "Direct usage of chunkArray:"
rg -A 5 "chunkArray\(" --type typescript

# Search for getMembersToEnrich implementations to understand the context
echo -e "\nContext in getMembersToEnrich workflow:"
fd -e ts "getMembersToEnrich" --exec rg -A 10 "chunkArray"

Length of output: 1224


Script:

#!/bin/bash
# Let's search for the actual implementation and usage patterns with proper file type

# Search for direct usage of chunkArray with proper typescript extension
echo "Direct usage of chunkArray:"
rg -A 5 "chunkArray\(" .

# Look for batch size constants and array size limits
echo -e "\nBatch size constants:"
rg "PARALLEL_ENRICHMENT_WORKFLOWS|QUERY_FOR_ENRICHABLE_MEMBERS_PER_RUN" .

# Search for getEnrichableMembers implementation to understand input size
echo -e "\ngetEnrichableMembers implementation:"
rg -A 10 "getEnrichableMembers" .

Length of output: 8842

services/libs/data-access-layer/src/utils.ts (2)

4-4: LGTM!

The import is correctly placed and necessary for type safety in the new function.


131-137: 🛠️ Refactor suggestion

Add error handling and documentation for the materialized view refresh function.

While the implementation is functional, consider the following improvements:

  1. Add input validation for the view name to prevent SQL injection
  2. Add error handling for database operations
  3. Add JSDoc documentation explaining the function's purpose, parameters, and concurrent refresh requirements

Consider applying these improvements:

+/**
+ * Refreshes a materialized view in the database.
+ * @param tx - Database transaction or connection
+ * @param mvName - Name of the materialized view to refresh
+ * @param concurrently - If true, uses CONCURRENTLY option (requires unique index on the view)
+ * @throws {Error} If the view name is invalid or the refresh operation fails
+ */
 export async function refreshMaterializedView(
   tx: DbConnOrTx,
   mvName: string,
   concurrently = false,
 ) {
+  // Validate view name to prevent SQL injection
+  if (!/^[a-zA-Z0-9_\.]+$/.test(mvName)) {
+    throw new Error('Invalid materialized view name');
+  }
+
+  try {
     await tx.query(`REFRESH MATERIALIZED VIEW ${concurrently ? 'CONCURRENTLY' : ''} "${mvName}"`);
+  } catch (error) {
+    if (error.message?.includes('concurrent refresh') && concurrently) {
+      throw new Error(`Concurrent refresh failed. Ensure the materialized view "${mvName}" has a unique index`);
+    }
+    throw new Error(`Failed to refresh materialized view "${mvName}": ${error.message}`);
+  }
 }

Let's verify if the materialized views have unique indexes:

✅ Verification successful

Let me gather more information about the usage of this function to provide a comprehensive verification.


The suggested improvements for error handling and documentation are valid but need refinement

The verification reveals that:

  1. The function is used in production code for refreshing materialized views like organization_segments_mv and member_segments_mv
  2. The organization_segments_mv has a unique index, making it safe for concurrent refresh
  3. The function is used in both scheduled jobs and worker processes

Consider these refined improvements:

+/**
+ * Refreshes a materialized view in the database.
+ * @param tx - Database transaction or connection
+ * @param mvName - Name of the materialized view to refresh
+ * @param concurrently - If true, uses CONCURRENTLY option. Only use for views with unique indexes
+ *                      (e.g. organization_segments_mv has ix_organization_segments_orgid_segmentid)
+ * @throws {Error} If the refresh operation fails
+ */
 export async function refreshMaterializedView(
   tx: DbConnOrTx,
   mvName: string,
   concurrently = false,
 ) {
-  // Validate view name to prevent SQL injection (not needed as views are internal)
-  if (!/^[a-zA-Z0-9_\.]+$/.test(mvName)) {
-    throw new Error('Invalid materialized view name');
-  }
+  try {
     await tx.query(`REFRESH MATERIALIZED VIEW ${concurrently ? 'CONCURRENTLY' : ''} "${mvName}"`);
+  } catch (error) {
+    throw new Error(`Failed to refresh materialized view "${mvName}": ${error.message}`);
+  }
 }

Changes from original suggestion:

  • Removed view name validation as the function is only used internally with known view names
  • Simplified error handling as the concurrent refresh error is handled by Postgres
  • Added example of view with unique index in JSDoc
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if materialized views have unique indexes
# This helps verify if CONCURRENTLY option can be safely used

rg -A 5 'CREATE MATERIALIZED VIEW' | rg -B 5 'CREATE UNIQUE INDEX'

Length of output: 729


Script:

#!/bin/bash
# Search for calls to refreshMaterializedView
rg -B 3 -A 3 "refreshMaterializedView"

# Search for materialized view definitions
fd -e sql -x cat {} \; | rg -B 2 -A 5 "CREATE MATERIALIZED VIEW"

Length of output: 4558

services/apps/premium/members_enrichment_worker/src/types.ts (1)

54-54: Verify SQL alias consistency across the codebase.

The comment update from "activitySummary" to "membersGlobalActivityCount" looks good, but let's ensure this change is consistently applied across all SQL queries.

✅ Verification successful

The initial results show no occurrences of the old "activitySummary" alias, which is good. However, let's verify the actual implementation to ensure the new alias is being used correctly in the SQL and service implementations.


The SQL alias change is correctly implemented and consistent

The verification confirms that:

  1. The materialized view is properly defined as "membersGlobalActivityCount"
  2. The alias is consistently used across all SQL queries in the materialized views
  3. All enrichment services (clearbit, crustdata, serp) correctly reference the new alias in their SQL conditions
  4. No occurrences of the old "activitySummary" alias were found

The comment in types.ts accurately reflects the current implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that all SQL queries use the new "membersGlobalActivityCount" alias
# and that there are no remaining references to the old "activitySummary" alias

echo "Checking for old 'activitySummary' alias usage in SQL queries..."
rg -l "activitySummary"

echo "Checking for new 'membersGlobalActivityCount' alias usage in SQL queries..."
rg -l "membersGlobalActivityCount"

Length of output: 873


Script:

#!/bin/bash
# Let's examine the actual SQL queries and code to verify the alias usage
echo "Checking SQL migration file for alias definition:"
cat backend/src/database/migrations/R__memberEnrichmentMaterializedViews.sql

echo -e "\nChecking implementation in enrichment services:"
rg -A 5 "membersGlobalActivityCount" services/apps/premium/members_enrichment_worker/src/sources/

Length of output: 33321

services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (2)

6-10: LGTM! Clean import addition

The new import for refreshMemberEnrichmentMaterializedViews follows the existing pattern and integrates well with the module's structure.


49-79: Verify materialized view refresh timing requirements

The schedule is set to run at 5 AM daily, before the members enrichment at 6 AM, which seems logical. However, consider the following points:

  1. The 1-minute catchup window might be too short if the view refresh takes longer.
  2. Unlike scheduleMembersLFIDEnrichment, there's no environment-specific scheduling.

Let's check if there are any environment-specific configurations or timing requirements in the workflow:

✅ Verification successful

Let me check the materialized view refresh implementation to understand the timing requirements better.


Let me check the activity implementation and the materialized view enum to understand the timing requirements better.


Catchup window and timing configuration are appropriate

The current configuration is well-suited for the materialized view refresh requirements:

  1. The 1-minute catchup window is sufficient because:

    • Each materialized view refresh has a 10-minute timeout set in the workflow
    • The refresh is implemented using standard PostgreSQL refresh mechanism
    • Views being refreshed are analysis/monitoring views that can tolerate small delays
  2. Environment-specific scheduling is not needed because:

    • These are database-level materialized views for monitoring enrichment stats
    • Unlike LFID enrichment which might need different timing per environment, view refresh timing is consistent
    • The 5 AM scheduling ensures fresh data before the 6 AM member enrichment process
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for environment-specific configurations and timing info in the workflow
rg -A 5 "refreshMemberEnrichmentMaterializedViews" 

Length of output: 2966


Script:

#!/bin/bash
# Check the materialized view refresh implementation
cat services/apps/premium/members_enrichment_worker/src/workflows/refreshMemberEnrichmentMaterializedViews.ts

# Also check for any environment-specific configurations
rg -l "MemberEnrichmentMaterializedView" 

Length of output: 1113


Script:

#!/bin/bash
# Check the activity implementation
cat services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

# Check the materialized view enum
cat services/libs/types/src/enums/enrichment.ts

Length of output: 6103

services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (1)

Line range hint 24-31: Verify if 1-minute timeout is sufficient for all proxy activities.

The timeout increase from 20 seconds to 1 minute suggests previous timeout issues. While this change might resolve immediate timeout problems, we should verify that it's sufficient for all scenarios, especially during high load or when external enrichment services are slow.

✅ Verification successful

Based on the search results, I can now provide a final response since I have enough context about timeout configurations across the codebase:

1-minute timeout for enrichment activities appears appropriate

The 1-minute timeout for proxy activities in the member enrichment workflow aligns well with timeout patterns across similar enrichment and data processing workflows in the codebase:

  • Other enrichment workflows use similar timeouts:
    • Organization enrichment uses 75 seconds timeout
    • LF Auth0 enrichment uses 2 minutes timeout
    • Member merge/suggestions workflows use 1-5 minute timeouts

The retry configuration with exponential backoff (4 attempts, starting at 5s and maxing at 30s) provides additional resilience against transient issues.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any timeout-related error logs or comments
rg -i "timeout|timed out" --type ts

# Look for any related configuration in other files
rg "startToCloseTimeout" --type ts

Length of output: 42642

services/apps/premium/members_enrichment_worker/src/sources/serp/service.ts (1)

24-24: Verify the materialized view dependency.

The change from "activitySummary" to "membersGlobalActivityCount" looks correct, but we should verify that the new materialized view is properly set up.

✅ Verification successful

The materialized view change is verified and consistent

The verification confirms that:

  • The membersGlobalActivityCount materialized view is properly defined in backend/src/database/migrations/R__memberEnrichmentMaterializedViews.sql
  • It has appropriate indexes for performance optimization
  • The view is consistently used across the codebase with the same activity thresholds
  • The change aligns with other enrichment services (Clearbit, Crustdata) that also use this view
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and structure of the membersGlobalActivityCount view

# Check if the view exists and its definition
echo "Checking view definition..."
fd -e sql -e SQL . | xargs rg -l "CREATE.*MATERIALIZED.*VIEW.*membersGlobalActivityCount" -A 10

# Check for any other references to ensure consistent usage
echo "Checking other references to the view..."
rg "membersGlobalActivityCount"

Length of output: 10185

services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (2)

9-9: LGTM: Import statement is correctly placed.

The import of refreshMaterializedView is properly added and aligns with the new functionality being introduced.


141-143: 🛠️ Refactor suggestion

Add error handling, input validation, and documentation.

While the implementation is functional, consider these essential improvements:

  1. Add error handling for database operations
  2. Validate the mvName parameter
  3. Add JSDoc documentation
  4. Consider SQL injection protection

Here's a suggested implementation:

+/**
+ * Refreshes a specified member enrichment materialized view
+ * @param mvName - The name of the materialized view to refresh
+ * @throws {Error} If the refresh operation fails
+ */
 export async function refreshMemberEnrichmentMaterializedView(mvName: string): Promise<void> {
+  if (!mvName || typeof mvName !== 'string') {
+    throw new Error('Invalid materialized view name');
+  }
+
+  try {
     await refreshMaterializedView(svc.postgres.writer.connection(), mvName);
+  } catch (error) {
+    throw new Error(`Failed to refresh materialized view ${mvName}: ${error.message}`);
+  }
 }

Additionally, consider:

  1. Adding logging for monitoring and debugging purposes
  2. Implementing retry logic for transient database failures
  3. Adding metrics for tracking refresh durations

Let's verify the materialized view names that will be used:

services/apps/premium/members_enrichment_worker/src/sources/clearbit/service.ts (1)

32-32: Verify materialized view structure and indexes.

The SQL condition looks correct. However, let's verify the materialized view structure to ensure optimal query performance.

✅ Verification successful

The materialized view exists and is properly structured

The SQL condition is correctly referencing the membersGlobalActivityCount materialized view which exists in the codebase. The view has:

  1. Proper structure with total_count column
  2. Required indexes:
    • ix_member_global_activity_count_member_id (unique index on memberId)
    • ix_member_global_activity_count (index on total_count)
  3. Correct join conditions with member identities table
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the materialized view exists and has proper indexes
# Expected: View should exist with total_count column and appropriate indexes

# Check if the view exists and its definition
rg -g '*.sql' -A 10 'CREATE MATERIALIZED VIEW.*membersGlobalActivityCount'

# Check for indexes on the view
rg -g '*.sql' 'CREATE INDEX.*membersGlobalActivityCount'

Length of output: 135


Script:

#!/bin/bash
# Let's try a broader search to find SQL files and view definitions
fd -e sql | xargs cat

# Alternatively, let's search for any mentions of this view
rg "membersGlobalActivityCount"

# Also search for migration files that might contain the view definition
fd -e ts -e js migration | xargs cat

Length of output: 306588

services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts (1)

Line range hint 1-199: Verify integration with the new materialized views.

Since this PR introduces new materialized views for member activity aggregation, we should verify that this service's enrichment data is properly reflected in these views.

✅ Verification successful

The search results show that there are no materialized views related to member enrichment in the codebase. Additionally, the EnrichmentServiceProgAILinkedinScraper is only referenced in the factory pattern for service instantiation and in its own implementation file.

Let me verify if there are any database migrations or schema changes related to this service:


Let me search for any references to member enrichment cache in the code to understand how it's used:


Let me search for the actual implementation of the enrichment service to understand how it uses the materialized views:


Based on the search results, I can see that the materialized views mentioned in the review comment are actually monitoring views (memberEnrichmentMonitoring*) that track enrichment statistics. The EnrichmentServiceProgAILinkedinScraper service is properly integrated with these views through the memberEnrichmentCache table.

The service's changes don't impact the materialized views since:

  1. The views are read-only monitoring views that aggregate statistics
  2. The service correctly writes to the memberEnrichmentCache table which is the source for these views
  3. The views will automatically reflect any changes when they are refreshed

No changes needed for materialized view integration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for materialized view definitions and their usage
rg -A 10 "CREATE MATERIALIZED VIEW.*members.*enrichment" 

# Search for references to this enrichment service in other files
rg -A 5 "EnrichmentServiceProgAILinkedinScraper"

Length of output: 2310


Script:

#!/bin/bash
# Search for any database migrations or schema files
fd -e sql -e prisma . | xargs cat

# Search for any references to database tables or views in the codebase
rg -A 5 "member.*enrichment.*cache|enrichment.*member.*cache" 

Length of output: 296693


Script:

#!/bin/bash
# Search for references to member enrichment cache in the code
rg -A 5 "memberEnrichmentCache" 

# Search for any enrichment-related database tables or views
rg -A 5 "CREATE TABLE.*enrichment|CREATE VIEW.*enrichment|CREATE MATERIALIZED VIEW.*enrichment"

Length of output: 36222


Script:

#!/bin/bash
# Search for the implementation files
rg -A 5 "class EnrichmentServiceProgAILinkedinScraper" --type ts

# Search for any references to materialized views in the enrichment worker
rg -A 5 "memberEnrichmentMonitoring" --type sql

Length of output: 6273

services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts (1)

46-46: Verify the materialized view exists and maintains data consistency.

The change from "activitySummary" to "membersGlobalActivityCount" appears to be part of the materialized views refactoring. Let's verify the view exists and maintains the same data structure.

✅ Verification successful

The materialized view change is verified and maintains data consistency

The migration file confirms that membersGlobalActivityCount is a valid materialized view that maintains the same data structure with the total_count column. The view aggregates activity counts from memberSegmentsAgg table and is properly indexed. The view is also referenced consistently throughout the codebase in various enrichment monitoring views, confirming its intended usage for activity-based filtering.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and structure of the new materialized view

# Test 1: Check if the view exists
rg -A 5 'CREATE MATERIALIZED VIEW.*membersGlobalActivityCount' 

# Test 2: Check if the view maintains the total_count column
rg -A 10 'CREATE MATERIALIZED VIEW.*membersGlobalActivityCount.*total_count'

Length of output: 143


Script:

#!/bin/bash
# Let's try a broader search to understand the context of this view

# Search for any reference to membersGlobalActivityCount in SQL files
fd -e sql -x rg -l "membersGlobalActivityCount"

# Search for any reference to this view in migration files
fd -e sql migrations -x rg -l "membersGlobalActivityCount"

# Search for any TypeScript/JavaScript files using this view
rg "membersGlobalActivityCount" -g "*.{ts,js}"

# Let's also check for the old view name for comparison
rg "activitySummary" -g "*.{ts,sql}"

Length of output: 1835


Script:

#!/bin/bash
# Let's check the materialized view definition in the migration file we found

cat ./backend/src/database/migrations/R__memberEnrichmentMaterializedViews.sql

Length of output: 30435

services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (3)

22-22: LGTM: Cursor filter updated correctly for materialized view.

The cursor filter has been properly adapted to use the new membersGlobalActivityCount table while maintaining the same pagination logic and NULL handling.


64-64: LGTM: Activity count aggregation updated correctly.

The MAX aggregation has been properly updated to use the new materialized view while maintaining NULL handling with coalesce.


68-68: LGTM: JOIN clause correctly updated for materialized view.

The LEFT JOIN with membersGlobalActivityCount is properly implemented, ensuring that members without activities are still included in the results.

Let's verify the materialized view setup:

✅ Verification successful

LGTM: Materialized view is properly configured with correct indexes

The materialized view membersGlobalActivityCount is correctly defined with:

  • Proper LEFT JOIN usage in the query
  • Unique index on memberId column: ix_member_global_activity_count_member_id
  • Index on total_count column: ix_member_global_activity_count
  • View calculates global activity counts from member segment aggregates
  • Used consistently across enrichment monitoring views
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify materialized view exists and has proper indexes

# Check if the materialized view exists and its definition
echo "Checking materialized view definition..."
rg -A 10 'CREATE MATERIALIZED VIEW.*membersGlobalActivityCount'

# Check for indexes on the materialized view
echo "Checking indexes on the materialized view..."
rg -A 5 'CREATE INDEX.*ON.*membersGlobalActivityCount'

Length of output: 312


Script:

#!/bin/bash
# Let's try a broader search for the materialized view definition and related migrations

# Search for any SQL files or migration files
echo "Looking for SQL/migration files..."
fd -e sql -e ts -e js migration

# Search for any mention of membersGlobalActivityCount in SQL-like content
echo "Searching for membersGlobalActivityCount table/view definition..."
rg -l "membersGlobalActivityCount" $(fd -e sql -e ts -e js)

# Search specifically in migration files for the view
echo "Searching in migration files..."
rg -B 5 -A 5 "membersGlobalActivityCount" --type ts --type js --type sql

Length of output: 189766

backend/src/database/migrations/R__memberEnrichmentMaterializedViews.sql (2)

1-16: LGTM! Well-structured view with appropriate indexing.

The view effectively aggregates member activity counts with proper filtering. The indexing strategy is particularly good:

  • Unique index on memberId for fast lookups
  • Regular index on total_count for efficient sorting and filtering

196-238: LGTM! Robust error handling in calculations.

The view properly handles potential division by zero scenarios in progress and hit rate calculations using case statements.

@epipav epipav merged commit e3e0784 into main Nov 8, 2024
7 checks passed
@epipav epipav deleted the bugfix/crustdata-enrichment-tweaks branch November 8, 2024 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Created by Linear-GitHub Sync
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants