Skip to content

Commit

Permalink
data cleaning required before pdf ingestion
Browse files Browse the repository at this point in the history
  • Loading branch information
codebanesr committed Dec 6, 2023
1 parent c457664 commit a64974b
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions llm-server/workers/tasks/process_pdfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ def process_pdf(file_name: str, bot_id: str):
insert_pdf_data_source(chatbot_id=bot_id, file_name=file_name, status="PENDING")
loader = PyPDFium2Loader(get_file_path(file_name))
raw_docs = loader.load()

# clean the data received from pdf document before passing it
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, length_function=len
)
Expand Down

0 comments on commit a64974b

Please sign in to comment.