Faster way to read input files #852

mre · 2022-02-16T14:55:15Z

mre
Feb 16, 2022
Maintainer

After reading this comment I was wondering what would be the downsides of testing bstr for reading inputs.

The way I see it

we need a read-only view anyway
inputs don't necessarily have to be valid utf-8

This would probably be the fastest way to read inputs if there was a way to stream the input to the extractor (which would be a bigger change).
Another alternative: memory maps.

This is just a thought for now. Would love to get people's opinions.

lebensterben · 2022-02-16T16:06:05Z

lebensterben
Feb 16, 2022
Collaborator

where do you intend to use bstr?

we already use crates to extract links and traversing nodes of html. what else can we benefit from using bstr?

0 replies

mre · 2022-02-16T16:40:00Z

mre
Feb 16, 2022
Maintainer Author

We read the entire file into memory here and we pass that object around:

lychee/lychee-lib/src/types/input.rs

Line 51 in 6d56c6b

let input = fs::read_to_string(&path)?;

lychee/lychee-lib/src/types/input.rs

Lines 279 to 281 in 6d56c6b

    
           let content = tokio::fs::read_to_string(&path) 
        
               .await 
        
               .map_err(|e| (path.clone().into(), e))?;

I was thinking of ways to speed that up.
Maybe bstr ist not the right abstraction for that.
Maybe an mmap would be nicer for faster file I/O and less allocations?
I would love to have a "view" into the file instead of allocating.
Passing Cows around is tedious or at least it wasn't very elegant last time I tried.
Just brainstorming ideas for now.

0 replies

lebensterben · 2022-02-16T17:17:59Z

lebensterben
Feb 16, 2022
Collaborator

The consumer of InputContent is in lychee-lib::extract module.

Most of time it's directly passed to html5gum or pulldown_cmark.

The performance improvement is possible but I doubt how much it would be.

More performance gain can be made if html5gum starts to allow unsafe code.

0 replies

mre · 2022-02-17T12:08:52Z

mre
Feb 17, 2022
Maintainer Author

We could test it and run a benchmark. It probably also depends on the platform.
Ripgrep uses mmap, but for some reason not on macOS.

https://github.com/BurntSushi/ripgrep/blob/master/crates/searcher/src/searcher/mmap.rs

There are some general caveats mentioned in this article, but none we should be worried about for lychee. The article mentions that mmap was around 30% faster than pread. Not sure if that would be worth it. I still think it would be nice to stream the contents as read-only into the extractor without doing any allocations along the way.

0 replies

mre · 2022-11-28T23:20:34Z

mre
Nov 28, 2022
Maintainer Author

Converting this issue to a discussion, since it doesn't track any kind of planned work. The benchmark would still be valuable, but this is not fleshed out enough to be tackled as an actionable item.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster way to read input files #852

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Faster way to read input files #852

mre Feb 16, 2022 Maintainer

Replies: 5 comments

lebensterben Feb 16, 2022 Collaborator

mre Feb 16, 2022 Maintainer Author

lebensterben Feb 16, 2022 Collaborator

mre Feb 17, 2022 Maintainer Author

mre Nov 28, 2022 Maintainer Author

mre
Feb 16, 2022
Maintainer

lebensterben
Feb 16, 2022
Collaborator

mre
Feb 16, 2022
Maintainer Author

lebensterben
Feb 16, 2022
Collaborator

mre
Feb 17, 2022
Maintainer Author

mre
Nov 28, 2022
Maintainer Author