Skip to content

Commit

Permalink
[chore] - chore: store fragment content type (#5838)
Browse files Browse the repository at this point in the history
* [types] - feature: extend content types for content fragments

 - Added new content type literals for text and image formats
 - Replaced 'slack_thread_content' and 'file_attachment' with more specific content types
 - Defined supported formats for text and images and applied them to the content fragment schemas

* [connectors] - fix: update contentType for Slack thread content

 - Change the value of contentType from "slack_thread_content" to "dust-application/slack" to match a standardized format

[front] - feature: support display of various content types in ContentFragment

 - Extend switch case to handle additional content types like plain text, CSV, markdown, PDF, and image formats
 - Set the logo type to 'document' for these newly supported types to enhance visual representation in the UI

* [types] - feature: enhance typing for supported format constants

 - Define explicit constant types for text and image formats
 - Implement utility functions to check for supported content formats
 - Improve type safety by narrowing down the types for content fragment formats

* [front/components/assistant/conversation/input_bar] - feature: enhance file upload handling in AssistantInputBar

- Add support for additional content formats in file uploads
- Implement an error notification for unsupported file formats
- Introduce separate handling for image uploads alongside text files

* [misc] - refactor: update handling and support for content fragments

 - Removed image formats from supported content fragment types in ContentFragment.tsx
 - Updated file upload handling to remove specific image upload process
 - Added content type validation logic to createConversationWithMessage
 - Changed file upload acceptance to use a defined list of supported extensions
 - Modified postNewContent roundup functions to include content type in their processes
 - Removed dead code related to image format handling as it's no longer necessary

[types] - refactor: remove image content types from supported formats schemas

 - Updated internal and public assistant API schemas to reflect the removal of image content types
 - Adjusted type definitions and utility functions to no longer include image formats as supported
 - Clarified and reinforced text format content handling across various modules

* fix: lint/format

* [front] - feature: support Slack thread and file attachment content types

 - Added 'slack_thread_content' and 'file_attachment' to content type handling, allowing integration with Slack threads and attachment display
 - Marked 'slack_thread_content' and 'file_attachment' for future removal after data migration is complete

[types] - feature: extend content fragment types for Assistant API

 - Included new literals 'file_attachment' and 'slack_thread_content' to the content fragment schema definition to support more content types
 - Added TODO for removal of newly added content types post data backfilling process

* [assistant/conversation/input_bar] - fix: correct file upload validation logic and update supported formats list

 - The condition to check unsupported file formats during upload was inverted, now correctly validating supported content
 - Removed '.tsv' from the list of supported file extensions for uploads to reflect current specifications

[lib/resources/content_fragment_resource] - refactor: remove unused getSignedUrlForContentFragment function

 - Deleted the getSignedUrlForContentFragment function as it is no longer needed in the codebase

* [front] - refactor: streamline ContentFragment logic for logo types

 - Simplified the switch-case to if-else statements for determining logoType based on contentType
 - Removed redundant case strings for text content types, using `startsWith` method for brevity

[types] - refactor: clean up supported content types in schemas

 - Removed specific text types (tsv, comma/tab-separated values) to streamline supported formats
 - Adjusted validation schemas to reflect supported content type changes across internal and public APIs

* [assistant/conversation] - refactor: update types for content fragment handling

 - Change `ContentFragmentInput` to `UploadedContentFragment` to enhance type specificity
 - Include `contentType` in `UploadedContentContentFragment` for better content type management

[assistant/conversation] - fix: standardize error handling for unsupported content types

 - Implement direct checks for unsupported content types during file uploads
 - Remove deprecated `isSupportedContentFormat` dependency to streamline file handling

[assistant] - refactor: adjust input handling for creating and submitting conversations

 - Align conversation-related functions with updated content fragment types
 - Ensure consistent content fragment processing within conversation creation workflow

[content_fragment] - feature: extend file upload handling to include content type

 - Update `handleFileUploadToText` to return content type, ensuring compatibility with updated fragment structure
 - Add error handling for unsupported file types during upload processing

* [assistant/conversation/input_bar] - feature: handle markdown file uploads correctly

 - Add a check to correctly identify markdown files during uploads
 - Recreate File object with the correct MIME type if a markdown file is detected

* [front] - refactor: streamline content fragment type checks and naming

 - Replaced multiple instances of isSupportedContentFormat with isSupportedContentFragmentType to ensure consistent type checking across modules
 - Renamed several variables and types to align with the new supportedContentFragment naming, improving readability

[types] - refactor: update content fragment type names and collections

 - Deprecated slack_thread_content in supportedTextFormat, moving it to supportedContentFragment for clarity
 - Renamed SupportedTextFormatType to SupportedTextContentFragmentType to mirror changes in content fragment type checks
 - Adjusted function naming from isSupportedTextContentFormat to isSupportedTextContentFragmentType to better reflect their purpose

* [front] - refactor: standardize UploadedContentFragment type import

 - Moved the `UploadedContentFragment` import to use the `@dust-tt/types` package across different components
 - Removed redundant declarations of the `UploadedContentFragment` type to clean up the codebase

[types] - feature: add UploadedContentFragment definition to the types package

 - Defined a new `UploadedContentFragment` type to be used across different parts of the application for consistent type checking

* [front/components/assistant] - refactor: centralize MIME type handling for file uploads

 - Replace individual file type checks with a centralized `getMimeTypeFromFile` utility function
 - Remove redundant code for copying and handling markdown files by using the new utility function

[front/lib/client] - fix: ensure correct MIME type used when handling text in file uploads

 - Utilize `getMimeTypeFromFile` to determine the correct MIME type for supporting text extraction from uploads

[front/lib/resources] - refactor: enforce stronger typing for content type in content fragment storage

 - Change contentType parameter to use `SupportedContentFragmentType` for better type safety

* [types] - refactor: centralize content type definitions for consistency

 - Replaced repetitive content type definitions with a centralized codec function
 - Enhanced maintainability by using getSupportedTextContentFragmentCodec in API handler schemas

* [front/lib] - feature: add function to detect markdown file types

 - Introduce a utility to check if a file is of type markdown based on its name or MIME type
 - Extend MIME type recognition to cover markdown files without a type from the file object

* [types] - refactor: update content fragment codec functions

 - Replaced `getSupportedTextContentFragmentCodec` with `getSupportedContentFragmentCodec` for broader fragment support
 - Updated usage in internal and public assistant API handlers to reflect codec function change

* [front] - refactor: streamline file upload handling and content fragment creation

- Remove checks for unsupported file types during file uploads to simplify the workflow
- Eliminate unused content type parameters in content fragment storage and API endpoints

[types] - refactor: update codec function for content fragment types

- Rename codec function to better reflect its purpose of getting supported content fragment types
- Expand list of supported text content fragment types to include CSV and TSV formats

* [types] - refactor: move content fragment logic to a dedicated module

 - Extract ContentFragment related types and functions from the conversation module
 - Create a new module content_fragment to manage them separately
 - Update import statements in internal and public API handlers to refer to the new module location
 - Prepare for removal of deprecated content fragment types with a TODO in the new module

* [types] - feature: add textUrl attribute to ContentFragmentType

 - Introduced a new mandatory textUrl field to the content fragment structure to manage text content by URL reference

---------

Co-authored-by: Jules <[email protected]>
  • Loading branch information
JulesBelveze and Jules authored Jun 26, 2024
1 parent 0f068b3 commit ac32eca
Show file tree
Hide file tree
Showing 18 changed files with 201 additions and 117 deletions.
2 changes: 1 addition & 1 deletion connectors/src/connectors/slack/bot.ts
Original file line number Diff line number Diff line change
Expand Up @@ -575,7 +575,7 @@ async function makeContentFragment(
title: `Thread content from #${channel.channel.name}`,
content: sectionFullText(content),
url: url,
contentType: "slack_thread_content",
contentType: "dust-application/slack",
context: null,
});
}
4 changes: 2 additions & 2 deletions front/components/assistant/TryAssistant.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import type {
MentionType,
UserType,
} from "@dust-tt/types";
import type { UploadedContentFragment } from "@dust-tt/types";
import { isEqual } from "lodash";
import React, {
useCallback,
Expand All @@ -15,7 +16,6 @@ import React, {
useState,
} from "react";

import type { ContentFragmentInput } from "@app/components/assistant/conversation/lib";
import {
createConversationWithMessage,
submitMessage,
Expand Down Expand Up @@ -165,7 +165,7 @@ export function useTryAssistantCore({
const handleSubmit = async (
input: string,
mentions: MentionType[],
contentFragments: ContentFragmentInput[]
contentFragments: UploadedContentFragment[]
) => {
if (!user) {
return;
Expand Down
23 changes: 13 additions & 10 deletions front/components/assistant/conversation/ContentFragment.tsx
Original file line number Diff line number Diff line change
@@ -1,19 +1,22 @@
import { Citation } from "@dust-tt/sparkle";
import type { ContentFragmentType } from "@dust-tt/types";
import { assertNever } from "@dust-tt/types";

export function ContentFragment({ message }: { message: ContentFragmentType }) {
let logoType: "document" | "slack" = "document";
switch (message.contentType) {
case "slack_thread_content":
logoType = "slack";
break;
case "file_attachment":
logoType = "document";
break;

default:
assertNever(message.contentType);
if (
message.contentType === "slack_thread_content" ||
message.contentType === "dust-application/slack"
) {
logoType = "slack";
} else if (
message.contentType.startsWith("text/") ||
message.contentType === "application/pdf" ||
message.contentType === "file_attachment"
) {
logoType = "document";
} else {
throw new Error(`Unsupported ContentFragmentType '${message.contentType}'`);
}
return (
<Citation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import type {
UserType,
WorkspaceType,
} from "@dust-tt/types";
import type { UploadedContentFragment } from "@dust-tt/types";
import { Transition } from "@headlessui/react";
import { cloneDeep } from "lodash";
import { useRouter } from "next/router";
Expand Down Expand Up @@ -159,7 +160,7 @@ export function ConversationContainer({
async (
input: string,
mentions: MentionType[],
contentFragments: ContentFragmentInput[]
contentFragments: UploadedContentFragment[]
) => {
const conversationRes = await createConversationWithMessage({
owner,
Expand Down
16 changes: 10 additions & 6 deletions front/components/assistant/conversation/input_bar/InputBar.tsx
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
import { Button, Citation, StopIcon } from "@dust-tt/sparkle";
import type { WorkspaceType } from "@dust-tt/types";
import type { LightAgentConfigurationType } from "@dust-tt/types";
import type {
LightAgentConfigurationType,
WorkspaceType,
} from "@dust-tt/types";
import type { AgentMention, MentionType } from "@dust-tt/types";
import type { UploadedContentFragment } from "@dust-tt/types";
import {
useCallback,
useContext,
Expand All @@ -18,7 +21,6 @@ import InputBarContainer, {
INPUT_BAR_ACTIONS,
} from "@app/components/assistant/conversation/input_bar/InputBarContainer";
import { InputBarContext } from "@app/components/assistant/conversation/input_bar/InputBarContext";
import type { ContentFragmentInput } from "@app/components/assistant/conversation/lib";
import { SendNotificationsContext } from "@app/components/sparkle/Notification";
import { compareAgentsForSort } from "@app/lib/assistant";
import { handleFileUploadToText } from "@app/lib/client/handle_file_upload";
Expand Down Expand Up @@ -68,7 +70,7 @@ export function AssistantInputBar({
onSubmit: (
input: string,
mentions: MentionType[],
contentFragments: ContentFragmentInput[]
contentFragments: UploadedContentFragment[]
) => void;
conversationId: string | null;
stickyMentions?: AgentMention[];
Expand All @@ -81,7 +83,7 @@ export function AssistantInputBar({
const { mutate } = useSWRConfig();

const [contentFragmentData, setContentFragmentData] = useState<
{ title: string; content: string; file: File }[]
UploadedContentFragment[]
>([]);

const { agentConfigurations: baseAgentConfigurations } =
Expand Down Expand Up @@ -167,6 +169,7 @@ export function AssistantInputBar({
title: cf.title,
content: cf.content,
file: cf.file,
contentType: cf.contentType,
};
})
);
Expand Down Expand Up @@ -222,6 +225,7 @@ export function AssistantInputBar({
title: res.value.title,
content: res.value.content,
file,
contentType: res.value.contentType,
},
]);
});
Expand Down Expand Up @@ -375,7 +379,7 @@ export function FixedAssistantInputBar({
onSubmit: (
input: string,
mentions: MentionType[],
contentFragments: ContentFragmentInput[]
contentFragments: UploadedContentFragment[]
) => void;
stickyMentions?: AgentMention[];
conversationId: string | null;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ export const INPUT_BAR_ACTIONS = ["attachment", "quick-actions"] as const;

export type InputBarAction = (typeof INPUT_BAR_ACTIONS)[number];

const supportedFileExtensions = [".txt", ".csv", ".md", ".pdf"];

export interface InputBarContainerProps {
allAssistants: LightAgentConfigurationType[];
agentConfigurations: LightAgentConfigurationType[];
Expand Down Expand Up @@ -126,7 +128,7 @@ const InputBarContainer = ({
{actions.includes("attachment") && (
<>
<input
accept=".txt,.pdf,.md,.csv"
accept={supportedFileExtensions.join(",")}
onChange={async (e) => {
await onInputFileChange(e);
editorService.focusEnd();
Expand Down
8 changes: 5 additions & 3 deletions front/components/assistant/conversation/lib.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,12 @@ import type {
UserType,
WorkspaceType,
} from "@dust-tt/types";
import type { UploadedContentFragment } from "@dust-tt/types";
import { Err, Ok } from "@dust-tt/types";
import type * as t from "io-ts";

import type { NotificationType } from "@app/components/sparkle/Notification";
import { getMimeTypeFromFile } from "@app/lib/file";
import type { PostConversationsResponseBody } from "@app/pages/api/w/[wId]/assistant/conversations";

/**
Expand Down Expand Up @@ -109,7 +111,7 @@ export async function submitMessage({
title: contentFragment.title,
content: contentFragment.content,
url: null,
contentType: "file_attachment",
contentType: getMimeTypeFromFile(contentFragment.file),
context: {
timezone:
Intl.DateTimeFormat().resolvedOptions().timeZone || "UTC",
Expand Down Expand Up @@ -222,7 +224,7 @@ export async function createConversationWithMessage({
messageData: {
input: string;
mentions: MentionType[];
contentFragments: ContentFragmentInput[];
contentFragments: UploadedContentFragment[];
};
visibility?: ConversationVisibility;
title?: string;
Expand All @@ -244,7 +246,7 @@ export async function createConversationWithMessage({
content: cf.content,
title: cf.title,
url: null, // sourceUrl will be set on raw content upload success
contentType: "file_attachment",
contentType: cf.contentType,
context: {
profilePictureUrl: user.image,
},
Expand Down
22 changes: 11 additions & 11 deletions front/lib/api/assistant/conversation.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ import type {
AgentMessageSuccessEvent,
AgentMessageType,
AgentMessageWithRankType,
ContentFragmentContentType,
ContentFragmentContextType,
ContentFragmentType,
ConversationTitleEvent,
Expand All @@ -23,13 +22,15 @@ import type {
MentionType,
PlanType,
Result,
SupportedContentFragmentType,
UserMessageContext,
UserMessageErrorEvent,
UserMessageNewEvent,
UserMessageType,
UserMessageWithRankType,
WorkspaceType,
} from "@dust-tt/types";
import { isSupportedTextContentFragmentType } from "@dust-tt/types";
import {
assertNever,
getSmallWhitelistedModel,
Expand Down Expand Up @@ -1618,7 +1619,7 @@ export async function postNewContentFragment(
title: string;
content: string;
url: string | null;
contentType: ContentFragmentContentType;
contentType: SupportedContentFragmentType;
context: ContentFragmentContextType;
}
): Promise<ContentFragmentType> {
Expand All @@ -1630,15 +1631,14 @@ export async function postNewContentFragment(

const messageId = generateModelSId();

const sourceUrl =
contentType === "file_attachment"
? fileAttachmentLocation({
workspaceId: owner.sId,
conversationId: conversation.sId,
messageId,
contentFormat: "raw",
}).downloadUrl
: url;
const sourceUrl = isSupportedTextContentFragmentType(contentType)
? fileAttachmentLocation({
workspaceId: owner.sId,
conversationId: conversation.sId,
messageId,
contentFormat: "raw",
}).downloadUrl
: url;

const textBytes = await storeContentFragmentText({
workspaceId: owner.sId,
Expand Down
69 changes: 34 additions & 35 deletions front/lib/client/handle_file_upload.ts
Original file line number Diff line number Diff line change
@@ -1,19 +1,37 @@
import type { Result } from "@dust-tt/types";
import type { Result, SupportedContentFragmentType } from "@dust-tt/types";
import { isSupportedTextContentFragmentType } from "@dust-tt/types";
import { Err, Ok } from "@dust-tt/types";
// @ts-expect-error: type package doesn't load properly because of how we are loading pdfjs
import * as PDFJS from "pdfjs-dist/build/pdf";
PDFJS.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/${PDFJS.version}/pdf.worker.mjs`;

const supportedFileExtensions = [".txt", ".pdf", ".md", ".csv", ".tsv"];
import { getMimeTypeFromFile } from "@app/lib/file";
PDFJS.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/${PDFJS.version}/pdf.worker.mjs`;

export async function handleFileUploadToText(
file: File
): Promise<Result<{ title: string; content: string }, Error>> {
export async function handleFileUploadToText(file: File): Promise<
Result<
{
title: string;
content: string;
contentType: SupportedContentFragmentType;
},
Error
>
> {
const contentFragmentMimeType = getMimeTypeFromFile(file);
if (!isSupportedTextContentFragmentType(contentFragmentMimeType)) {
return new Err(new Error("Unsupported file type."));
}
return new Promise((resolve) => {
const handleFileLoadedText = (e: ProgressEvent<FileReader>) => {
const content = e.target?.result;
if (content && typeof content === "string") {
return resolve(new Ok({ title: file.name, content }));
return resolve(
new Ok({
title: file.name,
content,
contentType: contentFragmentMimeType,
})
);
} else {
return resolve(
new Err(
Expand Down Expand Up @@ -52,7 +70,13 @@ export async function handleFileUploadToText(
});
text += `Page: ${pageNum}/${pdf.numPages}\n${strings.join(" ")}\n\n`;
}
return resolve(new Ok({ title: file.name, content: text }));
return resolve(
new Ok({
title: file.name,
content: text,
contentType: contentFragmentMimeType,
})
);
} catch (e) {
console.error("Failed extracting text from PDF", e);
const errorMessage =
Expand All @@ -64,28 +88,14 @@ export async function handleFileUploadToText(
};

try {
if (file.type === "application/pdf") {
if (contentFragmentMimeType === "application/pdf") {
const fileReader = new FileReader();
fileReader.onloadend = handleFileLoadedPDF;
fileReader.readAsArrayBuffer(file);
} else if (
isTextualFile(file) ||
supportedFileExtensions
.map((ext) => file.name.endsWith(ext))
.includes(true)
) {
} else {
const fileData = new FileReader();
fileData.onloadend = handleFileLoadedText;
fileData.readAsText(file);
} else {
return resolve(
new Err(
new Error(
"File type not supported. Supported file types: " +
supportedFileExtensions.join(", ")
)
)
);
}
} catch (e) {
console.error("Error handling file", e);
Expand All @@ -96,14 +106,3 @@ export async function handleFileUploadToText(
}
});
}

export function isTextualFile(file: File): boolean {
return [
"text/plain",
"text/csv",
"text/markdown",
"text/tsv",
"text/comma-separated-values",
"text/tab-separated-values",
].includes(file.type);
}
15 changes: 15 additions & 0 deletions front/lib/file.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
function isMarkdownFile(file: File): boolean {
if (file.type === "") {
const fileExtension = file.name.split(".").at(-1)?.toLowerCase();
// Check if the file extension corresponds to a markdown file
return fileExtension === "md" || fileExtension === "markdown";
}
return file.type === "text/markdown";
}

export function getMimeTypeFromFile(file: File): string {
if (isMarkdownFile(file)) {
return "text/markdown";
}
return file.type;
}
4 changes: 2 additions & 2 deletions front/lib/resources/storage/models/content_fragment.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import type { ContentFragmentContentType } from "@dust-tt/types";
import type { SupportedContentFragmentType } from "@dust-tt/types";
import type {
CreationOptional,
ForeignKey,
Expand All @@ -19,7 +19,7 @@ export class ContentFragmentModel extends Model<
declare updatedAt: CreationOptional<Date>;

declare title: string;
declare contentType: ContentFragmentContentType;
declare contentType: SupportedContentFragmentType;
declare sourceUrl: string | null; // GCS (upload) or Slack or ...

// The field below should be set for all fragments that are converted to text
Expand Down
Loading

0 comments on commit ac32eca

Please sign in to comment.