Select | Performance issue with large file #97

markb-trustifi · 2020-07-02T14:56:50Z

XPath versions: 23, 24, 27.
Selecting from large file (~70000 records) takes about 3 minute. All this time the thread is stuck.
bigxmlfile.xml.zip

document = new Dom().parseFromString(strfile);
let ts = xpath.select("//*[local-name()='t' or local-name()='tab' or local-name()='br' or local-name()='p' or local-name()='si']", document);

The most time is spent in the cycle in this function:

XNodeSet.prototype.buildTree = function () {
    if (!this.tree && this.nodes.length) {
        this.tree = new AVLTree(this.nodes[0]);
        for (var i = 1; i < this.nodes.length; i += 1) {
            this.tree.add(this.nodes[i]);
        }
    }

    return this.tree;
};

The text was updated successfully, but these errors were encountered:

JLRishe · 2020-07-03T06:08:29Z

Thank you for looking into the location of the performance bottleneck. We already have an open issue #87 about this, so I am going to close this one.

markb-trustifi · 2020-07-03T08:16:38Z

@JLRishe FYI, it isn't related to version 27. It took the same time complete the parsing also in versions 23 and 24.

JLRishe · 2020-08-21T12:26:53Z

@markb-trustifi Thank you for clarifying. I will reopen this issue for now.

cleydyr · 2021-10-09T13:33:26Z

Hi, @markb-trustifi, I've tested your use case with the file you've attached. It didn't take 3 minutes on my machine initially, but it took 30 seconds, which I think is still so much. Now, with the changes I'm proposing in #108, it's taking 1.5 seconds on my machine. You may want to give the modified code a try.

tremby · 2023-09-16T21:41:30Z

I am seeing a similar severe performance issue with a large file. My file is relatively simple -- a few nodes deep are a very large number of self-closing child nodes with a few attributes. I'm selecting them quite directly. Both of these queries:

/rootNode/ns:otherNode/ns:childNode
/rootNode/ns:otherNode/*[name()='childNode']

seemingly hang the code. I don't know if it's stuck in an endless busy loop or if it'll eventually exit but as I write this the process has been running 12 minutes on a somewhat fast machine and hasn't finished.

Meanwhile, xmllint --xpath "/rootNode/*[name()='otherNode']/*[name()='childNode']" myfile.xml (I didn't try to figure out the syntax for namespaces in xmllint) completes in less than a second. Something is definitely wrong!

nick-hunter · 2024-02-28T22:06:43Z

I'm also experiencing some serious performance issues. My application used to take 30 seconds to load XML files on start up, and now that the source files have grown, it's taking about 30mins. Almost all of that time is xpath queries. My XML files are relatively flat, and my two largest files are 6MB and 32MB. I could be doing something wrong, but I've been seeing worrying and inconsistent performance in my benchmarks.

I tested with the flatter 6MB file, and //* returned 57600 nodes in 7 minutes 44 seconds. Selecting one node usually takes about 250ms. This isn't scalable for my app. I'm now planning to refactor my application to use fast-xml-parser and phase out XPath. fast-xml-parser is able to parse the same document in 650ms. xmllint is also fast.

simon-20 · 2024-03-07T10:00:02Z

We also have an app that is experiencing performance issues to do with xpath selects using this library. We're dealing with ~10-60 Mb files, with a fairly complex node structure.

Over repeated runs, the most complicated files we process (~25 Mb, but many repeated nodes (tens of thousands), so although not the largest file we handle, it takes the longest) our app takes on average ~800 seconds to process a 18 Mb file using the current version of this library.

We are currently using a modified version of this library that incorporates the changes that are in the unmerged PR #107 (PR #107 has merge conflicts, but the same change has been redone as PR #120, and that has no conflicts), and using this fork of the library reduces the processing time to between 200-250 seconds.

So PR #120 gives a ~75% performance increase in at least one real-world use case.

For us, 200-250 seconds is still far too long, and we're hitting timeout issues, so we're considering our options.

Are there any updates on whether the unmerged but mergeable performance fix that drops unshift() is going to be merged? Are there other plans for improving performance?

It would not be ideal to start modifying even further one of the forks of this library that already incorporates the performance fix gained by dropping unshift().

JLRishe · 2024-03-08T17:06:45Z

@nick-hunter @simon-20 Sorry to hear that you are both experiencing performance issues.

I can try to get the unshift change merged and published in the next week or so.

One question - what are you using for your XML DOM? If it's @xmldom/xmldom, please note that a change has been made to that package that should offer significant performance benefits when querying it from this package, but it looks like those changes are still in the next branch of the package 10 months after they were merged and I don't know when they will be included in a release. Looks like the last release was 0.8.10 7 months ago, and this change is planned for inclusion in version 0.9.

So if you are using @xmldom/xmldom, I would suggest trying the latest beta version of that package to see if it makes a difference.

nick-hunter · 2024-03-08T17:43:34Z

@JLRishe thanks for the info! I am using @xmldom/xmldom. I just tried using version 0.9.0-beta.11 and it made my app slower. I don't have time to do proper benchmarks today, but in my dev environment my app went from loading in 12 seconds to taking 113 seconds. Hopefully will have more time to investigate next week.

JLRishe · 2024-03-08T17:56:52Z

@nick-hunter Thank you for checking on that. I guess I had assumed that the newly added implementation of compareDocumentPosition in xmldom would be a fast operation, but after looking at the implementation, it looks like it's actually a rather expensive operation, which would explain why it made your app even slower. In any case, I will work on getting those unshift changes added and look to see what else can be done to improve performance.

simon-20 · 2024-03-12T08:51:00Z

thanks @JLRishe, we are using @xmldom/xmldom, so I will bear that in mind, though perhaps won't rush to try it after @nick-hunter's experience.

JLRishe closed this as completed Jul 3, 2020

JLRishe reopened this Aug 21, 2020

cleydyr mentioned this issue Oct 9, 2021

Improve performance by applying binary search to nodes #108

Open

farlee2121 mentioned this issue Sep 11, 2023

Trx parsing performance ionide/ionide-vscode-fsharp#1925

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select | Performance issue with large file #97

Select | Performance issue with large file #97

markb-trustifi commented Jul 2, 2020

JLRishe commented Jul 3, 2020 •

edited

Loading

markb-trustifi commented Jul 3, 2020

JLRishe commented Aug 21, 2020

cleydyr commented Oct 9, 2021

tremby commented Sep 16, 2023

nick-hunter commented Feb 28, 2024

simon-20 commented Mar 7, 2024

JLRishe commented Mar 8, 2024 •

edited

Loading

nick-hunter commented Mar 8, 2024

JLRishe commented Mar 8, 2024

simon-20 commented Mar 12, 2024

Select | Performance issue with large file #97

Select | Performance issue with large file #97

Comments

markb-trustifi commented Jul 2, 2020

JLRishe commented Jul 3, 2020 • edited Loading

markb-trustifi commented Jul 3, 2020

JLRishe commented Aug 21, 2020

cleydyr commented Oct 9, 2021

tremby commented Sep 16, 2023

nick-hunter commented Feb 28, 2024

simon-20 commented Mar 7, 2024

JLRishe commented Mar 8, 2024 • edited Loading

nick-hunter commented Mar 8, 2024

JLRishe commented Mar 8, 2024

simon-20 commented Mar 12, 2024

JLRishe commented Jul 3, 2020 •

edited

Loading

JLRishe commented Mar 8, 2024 •

edited

Loading