Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extractor should use proper mechanism to extract and store URLs #32

Open
PROxZIMA opened this issue Feb 19, 2023 · 0 comments · May be fixed by #35
Open

Extractor should use proper mechanism to extract and store URLs #32

PROxZIMA opened this issue Feb 19, 2023 · 0 comments · May be fixed by #35
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@PROxZIMA
Copy link
Owner

PROxZIMA commented Feb 19, 2023

Is your feature request related to a problem? Please describe.

Extractor takes maximum file name length under consideration and creates sub-directories based on the url.

http://a.com/b.ext?x=&y=$%z2 -> a.com/b.extxyz2_.html (a.com folder with b.extxyz2_.html file in it)

This is good for storage purpose but does not act like a database.

Issues:

  • File retrieval and merging of data for URL classification is complex.
  • An URL can be very big but file names have length constraints.

Describe the solution you'd like
A linear architecture where a folder consists of files with file names as SHA1 hash of the respective URLs.

$ cat output/github.com/extracted/

00d1fbae77557ec45b3bfb3bdebfee49fd155cf9
b615c769e688dd83b2845ea0f32e2ee0c125c366
9b76fbceb3abd3423318ee37fd9ec1073961c14d

The links.txt file is renamed to links.json with the following content:

{
    "00d1fbae77557ec45b3bfb3bdebfee49fd155cf9": "http://github.com",
    "b615c769e688dd83b2845ea0f32e2ee0c125c366": "http://github.com/about/careers",
    "9b76fbceb3abd3423318ee37fd9ec1073961c14d": "http://github.com/sponsors"
}

Describe alternatives you've considered

Storing URLs in a big flat directories is a performance overhead as well (O(N) lookups).

Possible options:

  • SQL DB
  • Neo4j
@PROxZIMA PROxZIMA self-assigned this Feb 19, 2023
@PROxZIMA PROxZIMA added enhancement New feature or request question Further information is requested labels Feb 19, 2023
@PROxZIMA PROxZIMA linked a pull request Feb 26, 2023 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
Status: 🏗 In progress
Development

Successfully merging a pull request may close this issue.

1 participant