Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HacktoberFest #392

Closed
wvanderp opened this issue Sep 25, 2020 · 5 comments
Closed

HacktoberFest #392

wvanderp opened this issue Sep 25, 2020 · 5 comments

Comments

@wvanderp
Copy link
Contributor

wvanderp commented Sep 25, 2020

Hello,

It is nearly ocktober and so Hacktoberfest is just around the corner.

I'd like to spend my hacktoberfest here!

So, my questions are what are some tickets you really want solved. Or are there any issues that are not yet captured by tickets.

My strengths are vanilla js, typescript, willing to write tests and I have done some infrastructure work.

I'm looking forward to hearing from you and contributing to this repo.

-wvanderp

@spencermountain
Copy link
Owner

spencermountain commented Sep 25, 2020

hey cool! Welcome wouter, I really appreciate that.
Feel-free to jump in anywhere you'd like. There's a lot to improve, and nothing is off-limits.

Big problems:

  • Ordering - I made this library for getting data out of wikipedia, but people are very-often using it to render wikipedia articles back to users. The biggest problem is that the ordering of parts of a page is lost - we parse-out parts from the text, but then lose the sequence. There's a lot of guess-work re-building the page from the data.

  • Missing template whackamole - There's hundreds-of-thousands of en-wikipedia templates, many with ad-hoc behaviours, and then many-more in other langauges, and other wikis. We do default to a 'generic parse', if we don't know the template, but there are any-number of easy-to-create templates (like missing EuronextParis Template #391 ), that would reduce missing information. It would be neat to do this better.

  • infobox html - I think this would be a really fun one to do - we're good at parsing infobox data, but bad at rendering them nicely. Wikipedia has a bunch of rules for rendering things, and there's a lot of junk-data that it doesn't show. Generating nice-looking infoboxes would be really fun project.

Less-scary

  • in-sentence duplicates - duplicate matches in one sentence #300 I think this is a solvable problem, or at least one that can be mitigated. Any ideas are welcome.

  • nested lists - We're gonna need to refactor the List class to support V9: support list indentation #360 . I don't know what the best api would look like. right now the List class is very simple.

  • transcluded data - more-and more we see data in wikipedia page that is not included in the wikitext. Template transclusion #223 maybe we should have a Doc.getTranscluded() method to make a 2nd fetch, and get it. That would be cool. Same for wikidata. We have the wikidata id, we could make wikidata integration easier.

  • nicer table getters - if you're parsing a table, one pain-point is that some row-headers are titlecased, and it is a really brittle process. Maybe we should have a smarter Table.getData(['name', 'birth date']) or something similar.

Lastly, feel-free just to poke-around and see what you find. I'm sure there are big wins to make with performance, better regexes here and there, better unicode support, oh, and testing.

Any of the plugins are up for grabs. Some are really rough, and need some help - the wikitext one, for example, barely works and may be a silly idea, or maybe it isn't.

cheers

@wvanderp
Copy link
Contributor Author

wvanderp commented Sep 28, 2020

Hello,

I looked at the repo and I have a few plans and questions.

First to get to know the project a bit better I implemented a fix for #391. I will be sending a pr for the fix shortly.

Then the next part is maybe too big of a change to drop at once but first some questions:

  • Is there a reason that eslint is missing from the dev dependencies?
  • Is there any interest in getting code coverage working again in the repo?I already have a small implementation running for my own purpose. It uses Istanbul to get the coverage from all tests, even the plugins.
  • I suspect that the big parts of the code were written before es6 and es2020. Are there any rules about newer language constructs in this project?
  • I'm also curious why you choose to make installing this package from source more difficult for Windows users by using a bash for loop. not a attack just curious

So, my big plan has two phases:

First capturing all the behaviors of some big parts of the application like the Document.js and maybe section.js.

Then I want to use modern language features, like classes, Nullish coalescing operator, and Optional chaining, to simplify and rewrite some complicated parts of the application.

I do not know if this is too big a step to take at once. But I'm still interested in bettering the test coverage and looking forward to the answers to the questions and your thoughts.

-wvanderp

edit: added a question

@spencermountain
Copy link
Owner

hi @wvanderp - great questions.

  • eslint - yeah, I was trying to make node_modules dir smaller, i think. I added npm i --no-save eslint to the github actions. Feel free to replace it. Maybe it's gotten smaller.

  • coverage - YES this would be great. I would love this. The readme badge is stale, as you've noticed.

  • ES6 - yeah, I am committed having uncompiled code for green node versions. I'm confident this is best-practice still, right now. Have tried many alternatives, and not excited about changing it right now.

  • bash for-loop, do you mean this one? shelljs is cross-platform, supports windows.

great questions, keep em coming!

@wvanderp
Copy link
Contributor Author

Thank you for the quick answer.

I was talking about the for loop here

"prepare": "for i in plugins/*/; do (cd \"$i\" && npm install) || exit 1; done",

Maybe it can also be replaced with a shelljs script

I have to see what currently is green in node versions to see about the second part of my masterplan. But I will be looking at the test coverage infrastructure as well as the test coverage it self.

Have a nice evening (I'm from Europe so it's quite late for me)

@spencermountain
Copy link
Owner

oh, yeah. that sucks. we should fix that! Good catch.

I don't really know what's in node LTS right now either. I just try to avoid being the dependency that ruins someones day. I used to 'babel-down' for node/main, and there were weird problems from webpack. We can drop node 8, if it helped with your vision, that's cool. I could be convinced further, too.
cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants