Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentloaders: implement simple client for apache tika #1002

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ricardomaraschini
Copy link

@ricardomaraschini ricardomaraschini commented Sep 4, 2024

Implements a very simple client to use Apache Tika for document parsing. An example usage has been added:

import (
        "context"
        "net/http"

        "github.com/tmc/langchaingo/documentloaders"
        "github.com/tmc/langchaingo/textsplitter"
)

// To run this example you need to run a Tika server and then set the address
// on the TikaURL constant. The easiest way of running a Tika server is by
// using Docker:
//
// $ docker run -d -p 9998:9998 apache/tika
//
// Tika will be listening on http://localhost:9998, you then just need to ajust
// the TikaURL constant.

const TikaURL = "http://localhost:9998"

func main() {
        resp, err := http.Get("https://www.golang-book.com/public/pdf/gobook.pdf")
        if err != nil {
                panic(err)
        }
        defer resp.Body.Close()

        splitter := textsplitter.NewRecursiveCharacter()
        tika := documentloaders.NewTika(TikaURL, resp.Body)
        docs, err := tika.LoadAndSplit(context.Background(), splitter)
        if err != nil {
                panic(err)
        }

        _ = docs
}

Feel free to just close this PR if this isn't necessary here. I am using it so I decided to contribute it back, no harm done.

PR Checklist

  • Read the Contributing documentation.
  • Read the Code of conduct documentation.
  • Name your Pull Request title clearly, concisely, and prefixed with the name of the primarily affected package you changed according to Good commit messages (such as memory: add interfaces for X, Y or util: add whizzbang helpers).
  • Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. Fixes #123).
  • Describes the source of new concepts.
  • References existing implementations as appropriate.
  • Contains test coverage for new functions.
  • Passes all golangci-lint checks.

@ricardomaraschini
Copy link
Author

Yeah, lint is complaining. I will get back to this only if there is interest on this code.

implements a very simple client to use apache tika for document parsing.
@ricardomaraschini
Copy link
Author

Yeah, lint is complaining. I will get back to this only if there is interest on this code.

Oh well, once in hell I may as well give Satan a hug. That was an easy fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant