Skip to content

Commit

Permalink
[connectors] - refactor(google_drive): streamline file.ts (#6210)
Browse files Browse the repository at this point in the history
* [connectors/google_drive] - refactor: streamline file processing for Google Drive

 - Extract Google document export logic into a separate function to improve modularity
 - Implement generic file download handling based on MIME type with dedicated functions
 - Introduce specialized functions for handling PDF, DOCX, PPTX, and CSV file conversions
 - Ensure proper error handling and logging across different file processing scenarios

* [connectors] - refactor: remove PDF.js dependency and switch to pdftotext for PDF handling

 - Removed the @types/pdfjs-dist and pdfjs-dist packages from dependencies, as they are no longer used
 - Extracting PDF text is now handled using the pdftotext command-line utility instead of PDF.js
 - Updated the `handlePdfFile` function to use the new dpdf2text method for PDF text extraction
 - Created a new dpdf2text utility function that spawns a child process to run pdftotext and handle its output
 - Removed the deprecated createPdfTextStream and extractTextFromPDF functions as they were part of the old PDF.js-based implementation
 - Added error handling for the pdftotext child process in dpdf2text, resolving promise based on the exit code

[connectors] - chore: clean up dependencies and mark some as peers

 - Marked multiple dependencies as "peer" to presumably align with package requirements and ensure correct installation in consumer packages
 - Performed cleanup in package-lock.json to reflect the removal of the pdfjs-dist package and its types
 - Removed unused path2d package from package-lock.json as it was likely related to the previous PDF handling method

* [google_drive/temporal] - fix: return empty section on csv file upsert success

 - Updated the handleCsvFile function to return an "empty" CoreAPIDataSourceDocumentSection upon successful table upsert instead of null
 - This change allows distinguishing between failed and successful table upsert operations for CSV files

* [connectors] - refactor: update logger import path in google_drive connector

 - Replace the direct pino logger import with a custom logger module path
 - Ensure consistency in logger usage across the connectors module

[connectors] - fix: add newline at end of package-lock.json

 - Conform to standard JSON formatting by ensuring a newline at the end of package-lock.json

* [connectors/google_drive] - refactor: rename functions to reflect broader export scope

 - Renamed handleGoogleDocExport to handleGoogleDriveExport to generalize export function usage beyond Google Docs
 - Changed handleFileDownload to handleFileExport to match renaming convention and clarify functionality

* [types] - refactor: remove cheerio dependency from package

 - Cheerio and its related dependencies have been removed reducing the package complexity and potential security issues
 - This change likely indicates a shift away from HTML/XML parsing within this scope

[front] - refactor: synchronize package-lock with types scope changes

 - Alignment of front end package-lock.json with the recent removal of cheerio in the types scope

* [types] - refactor: remove cheerio dependency from package

 - Cheerio and its related dependencies have been removed reducing the package complexity and potential security issues
 - This change likely indicates a shift away from HTML/XML parsing within this scope

[front] - refactor: synchronize package-lock with types scope changes

 - Alignment of front end package-lock.json with the recent removal of cheerio in the types scope

* fix: lint/format

* [connectors] - feature: improve handling of unsupported mimeTypes for text extraction

 - Added a check to skip text extraction for files with mimeTypes not supported by the text extraction service
 - Enhanced logging for cases when an unexpected mimeType is encountered to aid in debugging issues

* fix: lint/format

---------

Co-authored-by: Jules <[email protected]>
Co-authored-by: Flavien David <[email protected]>
  • Loading branch information
3 people authored Jul 17, 2024
1 parent d19ce10 commit 7306ef5
Show file tree
Hide file tree
Showing 5 changed files with 372 additions and 367 deletions.
1 change: 1 addition & 0 deletions connectors/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 5 additions & 5 deletions connectors/src/connectors/google_drive/temporal/activities.ts
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ import { FILE_ATTRIBUTES_TO_FETCH } from "@connectors/types/google_drive";
const FILES_SYNC_CONCURRENCY = 10;
const FILES_GC_CONCURRENCY = 5;

type LightGoogledrive = {
type LightGoogleDrive = {
id: string;
name: string;
isSharedDrive: boolean;
Expand All @@ -53,7 +53,7 @@ export const statsDClient = new StatsD();

export async function getDrives(
connectorId: ModelId
): Promise<LightGoogledrive[]> {
): Promise<LightGoogleDrive[]> {
const connector = await ConnectorResource.fetchById(connectorId);
if (!connector) {
throw new Error(`Connector ${connectorId} not found`);
Expand All @@ -62,7 +62,7 @@ export async function getDrives(

let nextPageToken: string | undefined | null = undefined;
const authCredentials = await getAuthObject(connector.connectionId);
const drives: LightGoogledrive[] = [];
const drives: LightGoogleDrive[] = [];
const myDriveId = await getMyDriveIdCached(authCredentials);
drives.push({ id: myDriveId, name: "My Drive", isSharedDrive: false });
do {
Expand Down Expand Up @@ -94,7 +94,7 @@ export async function getDrives(
// Get the list of drives that have folders selected for sync.
export async function getDrivesToSync(
connectorId: ModelId
): Promise<LightGoogledrive[]> {
): Promise<LightGoogleDrive[]> {
const selectedFolders = await GoogleDriveFolders.findAll({
where: {
connectorId: connectorId,
Expand All @@ -106,7 +106,7 @@ export async function getDrivesToSync(
}
const allSharedDrives = await getDrives(connectorId);
const authCredentials = await getAuthObject(connector.connectionId);
const drives: Record<string, LightGoogledrive> = {};
const drives: Record<string, LightGoogleDrive> = {};

for (const folder of selectedFolders) {
const remoteFolder = await getGoogleDriveObject(
Expand Down
Loading

0 comments on commit 7306ef5

Please sign in to comment.