Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dotnetcore support #152

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Add Dotnetcore support #152

wants to merge 13 commits into from

Conversation

KevM
Copy link
Owner

@KevM KevM commented Sep 1, 2022

.Net Core Support

We have long wanted to add support for .Net core and earlier this year IKVM was finally "revived" to have support for .net core. At first, I gave up because ikvmc.exe didn't seem to work at all (and still does not for our use case). But @dylanlangston created a proof of concept using IKVMReference and msbuild to extract dotnet assemblies from the tika .jar file.

Nugets

  • TikaOnDotNet nuget is now multi-targeting .Net framework 4.6.2 and .Net Core 3.1.
  • TikaOnDotNet.TextExtraction is now multi-targeting .Net framework 4.6.2 and .Net Core 3.1.

Do we need to target .Net 6?

Tests

All tests but one are passing. For some reason parsing our test .rtf file throws a java UnsatisfiedLinkError exception:

    TikaOnDotNet.TextExtraction.TextExtractionException : Extraction of text from the file 'files/Tika.rtf' failed.
  ----> TikaOnDotNet.TextExtraction.TextExtractionException : Extraction failed.
  ----> java.lang.UnsatisfiedLinkError : sun/java2d/Disposer.initIDs()V

If anyone has an idea what this might be related to please help!😖

Build / Deployment Automation

We are going to move away from Packet and the F# build automation to use GitHub actions to build/test and deploy nugets. I'd like updating the version of Tika to be a simple update of a version file. We are close with what @dylanlangston started for us.

Tests are "mostly" passing with plain msbuild and me hammering out this at the command line to produce a tika nuget.

dotnet pack ./src/TikaOnDotnet/TikaOnDotnet.csproj -p:NuspecFile=package.nuspec -p:NuSpecBasePath=. --configuration=Release

Nuget Packaging

The nuget has been updated to better represent the license, readme location, project url, and finally I've added an icon.

Icon

I spent 30 seconds creating an icon to prettify the Nuget presentation. I anyone would like to improve on what I started. Please do. I am not a design person.

@KevM
Copy link
Owner Author

KevM commented Sep 1, 2022

TBD:

  • TikaOnDotnet.TextExtraction should use nuspec or jave csproj properties to make the listing as nice as TikaOnDotnet
  • Move deployment automation to GitHub Actions.

@nathanatstariongroup
Copy link

Tika NetCore

Hey I'm not a designer, but if you like it I can add in a commit to this branch.

@KevM
Copy link
Owner Author

KevM commented Sep 13, 2022

Thank you!

@nathanatstariongroup
Copy link

Hey KevM,

Do we need to target .Net 6?

Yes we do need Tika on .Net to target .Net 6 at my organization for a couple of projects,

When do you think we could expect a new release?

@KevM
Copy link
Owner Author

KevM commented Sep 14, 2022 via email

@rharbaugh
Copy link

There is one failing test for rtf files. No idea why it is not working. I was going to work on getting a pre-release out and then let people try it out for a bit before committing to a release. Note: I’d be willing to take a short contract to get this release out quicker. I am self employed.

On Wed, Sep 14, 2022, at 7:01 AM, Smiechowski Nathanael wrote: Hey KevM, > Do we need to target .Net 6? > Yes we do need Tika on .Net to target .Net 6 at my organization for a couple of projects, When do you think we could expect a new release? — Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAMHPSFOCSWRMVEYDO473V6G5DLANCNFSM6AAAAAAQCWG3ME. You are receiving this because you authored the thread.Message ID: @.***>

I was able to pretty easily target net6.0 throughout and pass tests (except the RTF one) without changing any other dependency versions by incrementing version numbers and adding net6.0 to the .csproj targets.

Regarding the RTF test - it seems to pass when the RTF file doesn't contain an image. Without digging into the Java side of things I can't provide much feedback beyond that.

If you'd like, I can submit a PR for the net6 support but that'll take a bit of approval on my end as I'm using this for an internal project.

@wasabii
Copy link

wasabii commented Sep 26, 2022

Hey ya'll. I fell into this thread while following links blindly. I revived the IKVM project.

To get Core out, and because nobody really wanted to fix it, we didn't pay any attention to AWT. So, no AWT in IKVM. My guess is this is killing your attempted usage of Java2D. I don't really know though, since I didn't do any more investigation yet besides read this thread.

The previous AWT default toolkit was IKVM.AWT.WinForms. An attempt to map the AWT stuff to WinForms. As ya'll know, WinForms is quite different in Core. And it's not cross platform anyways. So we just didn't get it building, and probably aren't going to spend any time on it.

Instead though, you can probably configure IKVM to run in headless mode, just as you would configure OpenJDK to do so. Some System property you can set.

8.2.2 will end up with headless mode enabled by default.

Somebody try that.

@wasabii
Copy link

wasabii commented Sep 26, 2022

Also, while I'm here, I want to make ya'll aware of IKVM.Maven.Sdk.

This is a new strategy for Java-Libraries-on-DotNet.

Instead of actually distributing cross compiled JAR -> DLL files on NuGet, which is error prone, as you don't own the assembly name or code, we allow users to directly add references to Maven packages in their C# projects. That way only one authoritative source for Java libraries exist: Maven. And the author of the actual product owns those artifacts. And the end developer is the one downloading and doing the conversion, and no middle man is redistributing licensed code.

We also embed some fancy information inside produced NuGet packages describing these references to Maven. So, say you write some .NET code that uses a library from Maven. And then you Pack that. Your produced NuGet file actually has a .pom file embedded into it. When somebody installs your NuGet package, and builds their own library, IKVM.Maven.Sdk downloads the Maven artifacts and cross compiles them on the fly.

It kind of obsoletes projects like TikaOnDotNet, unless you guys provide some functionality beyond the Java library. Like extension methods, etc.

<IkvmReference Include="..\tika-app-*.jar">
<AssemblyName>tika.core</AssemblyName>
</IkvmReference>
<ContentWithTargetPath Include="$(OutDir)tika.core.dll">
Copy link

@wasabii wasabii Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IkvmReference functions exactly like Reference. As such, it does not copy the assemblies to the output directory by default. This is denoted by the Private metadata item (which appears as "Copy Local" in Visual Studio).

So set <Private>true</Private> on the IkvmReference and it'll probably work.

Any metadata set on IkvmReference flows through to Reference.

@treasoner77
Copy link

treasoner77 commented Oct 24, 2022

Also, while I'm here, I want to make ya'll aware of IKVM.Maven.Sdk.

This is a new strategy for Java-Libraries-on-DotNet.

Instead of actually distributing cross compiled JAR -> DLL files on NuGet, which is error prone, as you don't own the assembly name or code, we allow users to directly add references to Maven packages in their C# projects. That way only one authoritative source for Java libraries exist: Maven. And the author of the actual product owns those artifacts. And the end developer is the one downloading and doing the conversion, and no middle man is redistributing licensed code.

We also embed some fancy information inside produced NuGet packages describing these references to Maven. So, say you write some .NET code that uses a library from Maven. And then you Pack that. Your produced NuGet file actually has a .pom file embedded into it. When somebody installs your NuGet package, and builds their own library, IKVM.Maven.Sdk downloads the Maven artifacts and cross compiles them on the fly.

It kind of obsoletes projects like TikaOnDotNet, unless you guys provide some functionality beyond the Java library. Like extension methods, etc.

This is great! However, it's building extremely slowly for me. Is that expected for IKVM.Maven.Sdk? Any recommendations for how the Maven build process can be sped up?

@wasabii
Copy link

wasabii commented Oct 24, 2022

Depends. The first build is definitely going to be a thing. Likely it has to download two dozen jars and convert them all. But that information is cached for subsequent builds.

Can you described what it looks like it's doing?

@rharbaugh
Copy link

Also, while I'm here, I want to make ya'll aware of IKVM.Maven.Sdk.
It kind of obsoletes projects like TikaOnDotNet, unless you guys provide some functionality beyond the Java library. Like extension methods, etc.

Thanks for the information and suggestion of IKVM.Maven.SDK.
Unfortunately upon digging deeper I realized the IKVM Nuget package is licensed under GPL and as such won't work for me.
In addition this project likely needs to have a change of license to accommodate the requirements of GPL.

@wasabii
Copy link

wasabii commented Jun 1, 2023

So, advice here:

Yes, you can build your own JAR files. This works fine with IKVM as it stands today. Probably won't work when we eventually get JDK9 support (since JAR files can be named modules in JDK9+).

You should prefer to look over the dependencies for the library you want to use, and understand the parts, and add a MavenReference to only those parts you actually need. Same as any Java user would do. Same as you'd do for NuGet packages.

You should NOT publish resulting IKVM assemblies to NuGet. Please. Ever. There's a big notice at the end of the IKVM README.md file for why not. If you are using MavenReference, it is safe to publish your packages to NuGet.org.

@KevM
Copy link
Owner Author

KevM commented Jun 2, 2023

I'd love it if someone wrote up a Readme with all of this knowledge so that others can leverage it. It sounds like this project needs to go on hiatus if y'all get things figured out. This is fine with me but I'd like to update the docs to educate people on the correct way to IKVM in the modern era.

@wasabii
Copy link

wasabii commented Jun 5, 2023

In response to an email sent specifically by Erik Gavriluk, I have uncovered two issues in IKVM.Maven.SDK which could impact ya'll. Both of which have patches incoming.

  1. Transitive dependencies between projects was broken in 1.5.0. This means if you have a MavenReference in ProjectB, referenced by ProjectA, that dependency wasn't making it over. This was trivial to work around by just specifying the MavenReference in both projects. But a hotfix is being published shortly that solves it.

  2. Incorrectly culling all but one unified artifact version. This showed up in a few places with slf4j-api. If LibraryA referenced slf4j-api:2.6.0, and LibraryB referenced slf4j-api:2.7.0, only the first reference to slf4j-api was preserved and unified (to 2.7.0). So LibraryB would compile without the reference added. This will be fixed in the same hotfix.

There are other things ya'll need to be fully aware of when using IKVM.Maven.Sdk: we are just a wrapper for Maven. If the author of a library published something broken in Maven, it's going to break IKVM.Maven.Sdk. And it might break it slightly differently than it would in Java.

One common example is underspecified dependencies. IKVM.Maven.Sdk relies on dependencies specified in Maven being correct, so that we can generate assemblies which properly reference each other. But, this might not break Java users, as Java JAR files have no actual dependencies: only the classes might depend on other classes, and this is only discoverable at runtime if the code is accessed. For instance, if an author forgets to depend on like, commons-logging, but nobody ever runs the code path that needs commons-logging, it'll work just fine in Java. It will also work if the users end up having commons-logging for some other reason, like they added it explicitely, or are using some other library which depends on it. In those cases, on Java, the .jar files will be added to the CLASSPATH, and it'll work fine. But, IKVM needs that information to generate assemblies.

In this case, the only true fix is to report the problem to the upstream authors of the library you are using, and have them properly fix their dependencies.

Second, assembly name generation. IKVM.Maven.SDK replies on the "automatic module name" specification of JDK9+ in order to choose assembly names. But, the situation with modules in Java is a bit weird. First, they don't exist in JDK8. Second, even in JDK9+ they are "sort of optional". That is, you can load a JAR file by specifying it on the MODULEPATH, or by adding the JAR to the CLASSPATH. This opens the situation where an upstream author may provide invalid module information: but nobody notices because all the users are using the CLASSPATH.

I have discovered at least one issue of this in Tika: tika-parser-crypto-module-2.8.0.jar.

Notice the file name is tika-parser-crypto-module-2.8.0.jar. However, if you open the JAR, and look in META-INF/MANIFEST.MF at the Automatic-Module-Name line, you'll notice the value is org.apache.tika.parser.code. This value is incorrect. The crypto-module JAR has a module-name for the parser code JAR. That's wrong. Each JAR file should have it's own unique module name.

As a consequence, IKVM attempts to name the assembly for tika-parser-crypto-module as org.apache.tika.parser.code.dll. But, it also attempts to name the assembly for org.apache.tika.parser.code to org.apache.tika.parser.code.dll. Resulting in two assemblies with the same name. IKVM then adds a reference to both to your .NET project. Except they have the same name. So they get copied into bin/ and clobber each other.

This is a bug in Tika upstream. It needs to be reported to Tika upstream and fixed there. There's not much I can do about it.

@Arextion
Copy link

Arextion commented Jun 5, 2023

@wasabii thanks for the wrieup and time spent on this!

@wasabii
Copy link

wasabii commented Jun 5, 2023

And this is how you do that:

https://issues.apache.org/jira/browse/TIKA-4061

@KevM
Copy link
Owner Author

KevM commented Jun 5, 2023

Thank you @wasabii that info dump is sure to be helpful to many people moving forward. 🌬️ ⛵

That said. How do people want to proceed? This project doesn't have a future it sounds like. How do we best encapsulate the learnings we've made here to help the .Net community of Tika users?

@wasabii
Copy link

wasabii commented Jun 5, 2023

@KevM Are the custom .net classes something you want to continue to provide?

@wasabii
Copy link

wasabii commented Jun 6, 2023

If they are, you can continue to publish them on NuGet. You can publish a .NET project that uses MavenReference to NuGet.

You'd be the first one to do it, too.

@KevM
Copy link
Owner Author

KevM commented Jun 6, 2023

Custom Tika Wrapper

We currently do this to provide a good out of box experience for new users. My guess is that most people use the wrapper and if it works that is all they'll do. Others roll their own.

Question for the community what do you think we should do?

It sounds like IKVM + Maven support gets you 90% of the way there. We could still provide a Tika text extraction wrapper but I'm honestly not sure what people would like or how best to support the way they use Tika.

@wasabii
Copy link

wasabii commented Jun 7, 2023

Disclaimer: I am not a Tika user. I barely knew what it was before an hour ago. But I just spent the better part of a day working through getting a sample project submitted to me by Erik Gavriluk. FYI, it worked.

If the code in the text extractor package is still useful for users, it can continue to be published as a NuGet package. That NuGet package can reference the Maven packages it requires. I quite enjoy doing this: having a NuGet package that depends on a Maven artifact, and having the NuGet package provide extension methods and stuff for working with Java classes in Maven in a more .NET-ish way.

Tika has a large library of parsers. That large library of parsers depends on an even larger set of other Maven artifacts. For instance, org.apache.tika:tika-parser-pdf-module implements it's PDF parsing using PDFBox. So it depends on PDF box. PDF Box depends on a bunch of graphics libraries. Those graphics libraries depend on other libraries. For parsing SVGs, and BMPs, and JPGs. The text parser depends on CSV libraries. The HTML parser depends on many things. Adding half of the parsers I found available in Tika to a project produced about 100 different assemblies.

Users shouldn't be forced to include all of those unless they opt in. So, I would not have the TikaDotNet.TextExtractor package depend on the parsers. It would only depend on the core. And the user can add in whatever parser packages he happens to need.

Now, this isn't EXACTLY Tika related, but it is a place where value can be added: all of those 100s of packages ultimately have support for logging. These libraries log to java.util.logging. Some log to log4j. Some log to SLF4J. Some log to commons-logging. There are others. As described at https://cwiki.apache.org/confluence/display/TIKA/Logging:

Apache Tika include a lot of Apache and thirdparty libraries that have different approach to logging. Tika use slf4j-api as logging API and Apache Log4j 2.x as an implementation for modules that require it.

Basically, to get logging out of each aspect of Tika, you need to configure a dozen differnet logging libraries, independently.

Each of these logging libraries could use .NET helpers. For instance, a package that configures log4j to forward messages to Microsoft.Extensions.Logger. A package that configures slf4j to log to Microsoft.Extensions.Logging. One for commons. Lots of little packages, useful beyond Tika (for anybody using slf4j for instance), to hook it up to Microsoft logging.

In anticipation of this, I started https://github.com/ikvmnet/ikvm-logging, who's goal it will be to write little bridges forwarding Java logger implementations into .NET.

There are probably other similar utilities like this. Simple bits of code that can make integrating Java code with .NET easier.

@wasabii
Copy link

wasabii commented Jun 7, 2023

Tika upstream fixed the name bug.

https://issues.apache.org/jira/plugins/servlet/mobile#issue/TIKA-4061

@KevM KevM mentioned this pull request Jul 29, 2023
@MattBayliss
Copy link

Very new to Tika, and trying to use it in .Net Core, and, following many links, ended up here!

Trying to use Tika to extract text on large quantities of files, so ideally would like to get AutoDetectParser working. Following @wasabii 's advice (thanks for your efforts!) and copying and pasting the MavenReferences you posted here, I'm able to Parse files, but only if I try something like @souramoo 's if docx then OOXMLParser type stuff. I don't want to have to do that for hundreds of file extensions / mime-types.

Has anyone got the AutoDetectParser working with .Net and IKVM.Maven.Sdk?

I suspect it's a problem with the DefaultDetector? (from Tika troubleshooting guide)

Hoping someone's already solved this problem and can share their solution! 🤞

@Arextion
Copy link

Arextion commented Sep 9, 2023

I'm getting this error when using TikaOnDotNet from NET 6:

TypeLoadException: Could not load type 'System.Reflection.Emit.MethodToken' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.

I suspect the error is caused by the System.Reflection.Emit.MethodToken type does not exist in NET 6 or .Net Core, only in .Net Framework.

@KevM
Copy link
Owner Author

KevM commented Sep 9, 2023 via email

@Arextion
Copy link

Arextion commented Sep 9, 2023

Are you doing this on iOS? I know they do not allow emitting code.

On Sat, Sep 9, 2023, at 9:56 AM, Arextion wrote: I'm getting this error when using TikaOnDotNet from NET 6: TypeLoadException: Could not load type 'System.Reflection.Emit.MethodToken' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'. I suspect the error is caused by the System.Reflection.Emit.MethodToken type does not exist in NET 6 og .Net Core, only in .Net Framework. — Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAMHJ6KIETWBUCVXKEW4LXZR7SNANCNFSM6AAAAAAQCWG3ME. You are receiving this because you were mentioned.Message ID: @.***>

No from Windows 11.

@wasabii
Copy link

wasabii commented Sep 9, 2023

This is as simple as him trying to run TikaOnDotNet on .NET Core.

This very issue is about TokaOnDotNet not supporting .NET Core.

@Arextion
Copy link

Arextion commented Sep 9, 2023

Very new to Tika, and trying to use it in .Net Core, and, following many links, ended up here!

Trying to use Tika to extract text on large quantities of files, so ideally would like to get AutoDetectParser working. Following @wasabii 's advice (thanks for your efforts!) and copying and pasting the MavenReferences you posted here, I'm able to Parse files, but only if I try something like @souramoo 's if docx then OOXMLParser type stuff. I don't want to have to do that for hundreds of file extensions / mime-types.

Has anyone got the AutoDetectParser working with .Net and IKVM.Maven.Sdk?

I suspect it's a problem with the DefaultDetector? (from Tika troubleshooting guide)

Hoping someone's already solved this problem and can share their solution! 🤞

Please let us know if you find a solution!

Btw, could you maybe share a simple sample of a working parser?

@Arextion
Copy link

Arextion commented Sep 10, 2023

I Finally got a, somewhat, working project in .Net 6 with AutoDetectParser. But it doesn't work on all files. It seems to be working on .doc files and others, but not .docx files e.g.

I guess this is because of missing parsers for those.

I've tried slamming in a bunch of references, but that doesn't help.

<PackageReference Include="IKVM.Maven.Sdk" Version="1.5.5" />
<MavenReference Include="org.apache.tika:tika-core" Version="2.9.0" />

<MavenReference Include="org.apache.tika:tika-async-cli" Version="2.9.0" />
<MavenReference Include="org.apache.tika:tika-parser-sqlite3-module" Version="2.9.0" />
<MavenReference Include="org.apache.tika:tika-parser-scientific-module" Version="2.9.0" />

<MavenReference Include="org.apache.tika:tika-serialization" Version="2.9.0" />
<MavenReference Include="org.apache.tika:tika-parsers-standard-package" Version="2.9.0" />

<MavenReference Include="org.apache.tika:tika-parser-zip-commons" Version="2.9.0" />
<MavenReference Include="org.apache.tika:tika-parser-text-module" Version="2.9.0" />
<MavenReference Include="org.apache.tika:tika-parser-pdf-module" Version="2.9.0" />
<MavenReference Include="org.apache.tika:tika-parser-image-module" Version="2.9.0" />
<MavenReference Include="org.apache.tika:tika-parser-xml-module" Version="2.9.0" />
<MavenReference Include="org.apache.tika:tika-parser-microsoft-module" Version="2.9.0" />

TikaOnDotNet does parse the files. So something must be missing.

@KevM what parsers or modules are included in TikaOnDotNet?

@Arextion
Copy link

Arextion commented Sep 10, 2023

Hmm, interesting find.

AutoDetectParser works, if you create an instance of specific parsers, even if you're not using them:

new OOXMLParser();
new OfficeParser();

@MattBayliss
Copy link

Hmm, interesting find.

AutoDetectParser works, if you create an instance of specific parsers, even if you're not using them:

new OOXMLParser();
new OfficeParser();

Nice find! That helps heaps! Here's my experiment so far - with your find included:
https://github.com/MattBayliss/TikaTest-IKVM.Maven.Sdk

I assume IKVM.Maven.Sdk has smarts not to include libraries that aren't directly referenced - I think that's what the IKVM.Maven.Sdk README is talking about in the Transitive Dependencies section... although I can't figure out / successfully duck-duck-go what a TFM is.

In the meantime I thought I needed to get logging working - because I assumed that would tell me where the problem was. Did you have the issue with SLF4J?

SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.

The Tika documentation had pages on logging config, which I tried to emulate with MavenReferences in that repo of mine.

I abandoned Tika research to just get SLF4J working with MavenReferences, assuming I was having the same underlying issue - it not finding a library (SLF4J Simple Provider) that I had included as a MavenReference. I asked about that on StackOverflow to no success so far.

I'll keep working on my repo, and let you know if I have any further luck.

@Arextion
Copy link

Arextion commented Sep 11, 2023

I ended up just preloading all parser assemblies like this:

string assemblyPath = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
string[] assemblies = Directory.GetFiles(assemblyPath, "*.dll", SearchOption.AllDirectories);
var parserAssemblies = assemblies.Where(p => Regex.IsMatch(p, ".*\\.parser\\..*\\.dll$"));

foreach (string file in parserAssemblies)
{
    Assembly.LoadFile(file);
}

But, from my initial test, the parsing is very slow! and much slower than using TikaOnDotNet, for some reason. So i actually tried to find another way and came across Tika-Server.

I've already created a custom dockerfile and have it running in Docker Desktop using both the /tika (syncronious) and /async (asyncronious) parsing.

This article was very helpful.

It seems much faster and very easy to setup. Just create a simple http client to PUT files to the /tika endpoint and you get the extracted text in response.

Heres my docker file and tika-config
Tika-Server Docker.zip

@Arextion
Copy link

Yes SLF4J did not work for me either.

@wasabii
Copy link

wasabii commented Sep 11, 2023

It's the same situation. Have to preload things that are dynamic.

@isaaclisg
Copy link

Is this still ongoing? I'd loved to have it :(

@dylanlangston
Copy link

dylanlangston commented Jun 21, 2024

Hi @isaaclisg! I submitted the initial pull request for .NET core support (nearly two years ago now 😳). I hope I'm not speaking incorrectly here (@KevM or anyone else for that matter), but I think this effort is more or less abandoned. Somebody please do correct me if I'm wrong!

My original use case for this was primarily motivated by simply not wanting to use the Tika library in Java or mess around with Tika Server. Really wanted to use C# but that use case ultimately ended up using the Tika library in Java, due primarily to performance issues with the .NET Core version which other's have mentioned. Without knowing you're use case using Tika directly via the Java API or Tika Server would be my general recommendation.

FYI, the .NET Core support was only made possible thanks the AMAZING efforts of https://github.com/ikvmnet/ikvm and primarily @wasabii. I haven't messed around with any of this in a long time now but when I did the documentation on that project was vital to get things even kind of working. Per that project, IKVM recommend explicitly against redistributing recompiled Java libraries like TikaOnDotNet is doing today and instead say to use IKVM.Maven.

@wasabii
Copy link

wasabii commented Jun 21, 2024

I would love to know more about that performance issue.

@dylanlangston
Copy link

The use case I targeted was pretty niche and needed to run as Microservices in AWS. We had AWS Lambda functions written in both C# and Java at one point to evaluate the performance of the two on the same documents. The long and short of it is the Lambdas which used Tika via IKVM ended up being slower and using more cpu/memory. This was a .NET shop but we went with the Java version in the end because it would be cheaper for us to run. 🤷‍♂️ Unfortunately all that work is protected under an NDA and I no longer have access to the code to share anyways. It was long enough ago now too that things may have changed so please take all this with a grain of salt.

I think IKVM is fantastic btw, thank you for your work!

@KevM
Copy link
Owner Author

KevM commented Jun 21, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants