We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is your feature request related to a problem? Please describe.
Extractor takes maximum file name length under consideration and creates sub-directories based on the url.
http://a.com/b.ext?x=&y=$%z2 -> a.com/b.extxyz2_.html (a.com folder with b.extxyz2_.html file in it)
http://a.com/b.ext?x=&y=$%z2
a.com/b.extxyz2_.html
a.com
b.extxyz2_.html
This is good for storage purpose but does not act like a database.
Issues:
Describe the solution you'd like A linear architecture where a folder consists of files with file names as SHA1 hash of the respective URLs.
$ cat output/github.com/extracted/ 00d1fbae77557ec45b3bfb3bdebfee49fd155cf9 b615c769e688dd83b2845ea0f32e2ee0c125c366 9b76fbceb3abd3423318ee37fd9ec1073961c14d
The links.txt file is renamed to links.json with the following content:
links.txt
links.json
{ "00d1fbae77557ec45b3bfb3bdebfee49fd155cf9": "http://github.com", "b615c769e688dd83b2845ea0f32e2ee0c125c366": "http://github.com/about/careers", "9b76fbceb3abd3423318ee37fd9ec1073961c14d": "http://github.com/sponsors" }
Describe alternatives you've considered
Storing URLs in a big flat directories is a performance overhead as well (O(N) lookups).
Possible options:
The text was updated successfully, but these errors were encountered:
PROxZIMA
Successfully merging a pull request may close this issue.
Is your feature request related to a problem? Please describe.
Extractor takes maximum file name length under consideration and creates sub-directories based on the url.
http://a.com/b.ext?x=&y=$%z2
->a.com/b.extxyz2_.html
(a.com
folder withb.extxyz2_.html
file in it)This is good for storage purpose but does not act like a database.
Issues:
Describe the solution you'd like
A linear architecture where a folder consists of files with file names as SHA1 hash of the respective URLs.
The
links.txt
file is renamed tolinks.json
with the following content:Describe alternatives you've considered
Storing URLs in a big flat directories is a performance overhead as well (O(N) lookups).
Possible options:
The text was updated successfully, but these errors were encountered: