Skip to content

Commit

Permalink
[Caloris 2] Move Notion to chunking V2 with structure (#3137)
Browse files Browse the repository at this point in the history
* [Caloris 2] Move Notion to chunking V2 with structure

Related [task](#2458)

Notes
1. differs a bit from framing => rather than 1 section per block,
uses the already-rendered markdown with pre-existing lib function `renderMarkdownSection`

2. as a consequence, no extra line return between blocks (one
line return only)

3. Wondering about removing last editor / last edited to lighten the
prefix size?

* removed last_editor / last_edit_date from prefix
  • Loading branch information
philipperolet authored Jan 10, 2024
1 parent a6cd159 commit 017eb4e
Showing 1 changed file with 32 additions and 15 deletions.
47 changes: 32 additions & 15 deletions connectors/src/connectors/notion/temporal/activities.ts
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ import {
import {
deleteFromDataSource,
MAX_DOCUMENT_TXT_LEN,
renderSectionForTitleAndContent,
renderMarkdownSection,
renderPrefixSection,
upsertToDatasource,
} from "@connectors/lib/data_sources";
import { Connector } from "@connectors/lib/models";
Expand Down Expand Up @@ -1555,12 +1556,6 @@ export async function renderAndUpsertPageFromCache({
const parsedProperties = parsePageProperties(
JSON.parse(pageCacheEntry.pagePropertiesText) as PageObjectProperties
);
for (const p of parsedProperties) {
if (!p.text) continue;
// We skip the title as it is added separately as prefix to the top-level document section.
if (p.key === "title") continue;
renderedPage += `$${p.key}: ${p.text}\n`;
}
renderedPage += "\n";

let pageHasBody = false;
Expand Down Expand Up @@ -1730,8 +1725,8 @@ export async function renderAndUpsertPageFromCache({
skipReason = "body_too_large";
}

const createdTime = new Date(pageCacheEntry.createdTime).getTime();
const updatedTime = new Date(pageCacheEntry.lastEditedTime).getTime();
const createdTime = new Date(pageCacheEntry.createdTime);
const updatedTime = new Date(pageCacheEntry.lastEditedTime);

if (!pageHasBody) {
localLogger.info(
Expand All @@ -1748,10 +1743,32 @@ export async function renderAndUpsertPageFromCache({
pageId,
runTimestamp.toString()
);
const titlePrefix = title ? `$title: ${title}\n` : null;

const propsCreatedTime = createdTime.toLocaleString("en-US", {
month: "short",
day: "numeric",
year: "numeric",
hour: "2-digit",
minute: "2-digit",
});

// Properties and tags are added as prefix to the document top level section
// section Title is in a separate prefix (both title and properties can be
// lenghty, this avoids cutting one entirely)
const content = renderPrefixSection(titlePrefix);
let propsPrefix = `${
title ? "\n" : ""
}$author: ${author}\n$created: ${propsCreatedTime}\n`;
for (const p of parsedProperties) {
if (!p.text) continue;
// We skip the title as it is added separately as prefix to the top-level document section.
if (p.key === "title") continue;
propsPrefix += `$${p.key}: ${p.text}\n`;
}

const content = renderSectionForTitleAndContent(
title || null,
renderedPage
content.sections.push(
renderMarkdownSection(`${propsPrefix}\n`, renderedPage)
);

localLogger.info(
Expand All @@ -1766,13 +1783,13 @@ export async function renderAndUpsertPageFromCache({
documentId,
documentContent: content,
documentUrl: pageCacheEntry.url,
timestampMs: updatedTime,
timestampMs: updatedTime.getTime(),
tags: getTagsForPage({
title,
author,
lastEditor,
createdTime,
updatedTime,
createdTime: createdTime.getTime(),
updatedTime: updatedTime.getTime(),
}),
parents,
retries: 3,
Expand Down

0 comments on commit 017eb4e

Please sign in to comment.