-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow exclusion of specific articles in WikiProject Selection to avoid projects failing due to a few missing articles #737
Comments
Thanks for the report/write up! A repro case would be great for this, if you have the list of WikiProjects and a few of the articles that were missing. That way I can check if we have "deleted/missing" information in our own database for them and can filter them out at that step. I honestly would rather implement this as #728, where you specify a Simple Selection to "set difference" (subtract), rather than a bespoke implementation just for WikiProject Selection. |
@audiodude Info below for easy repro (using WikiProject tool). I also attach the I suppose I could manually subtract the failing articles from the In sum, I think there's a lot of value in being able to select a set of related WikiProjects rather than forcing the user to list all the articles in the projects. The underlying problem seems to be that a lot of projects are not well maintained, so it's quite likely other users will run into this issue unless their needs are pretty basic (like a handful of articles they want to archive). List of WikiProjects selected:
List of not-found articles (or assets) I extracted from the task failure's JSON with a simple search-replace regex on the contents of the error key:
|
Thanks for the detailed repro information! I think before we build new features, we should focus on the first part of your bug report, where "the more this process could be automated, the better". The theory that the WikiProjects are somehow out of date makes sense, but it's not true. WP1 builds its knowledge of what articles are in what project based on the categories the article is placed in on its talk page. So it's WP1 itself that's out of date. This is a actually a legitimate bug, #738. We don't actually query the wiki replica database for if an article is deleted, we just check if it's been moved and mark it as quality=NotAClass/importance=NotAClass, then clean it up later. The bug describes the rest. What's happening is that many of those articles which are showing up in the Selection are |
I agree absolutely. The only reason I could have for excluding specific articles or assets from a Project compilation would be because they are causing mwOffliner to fail. I don't know why they cause it to fail, as mwOflliner should ignore 404s if I've understood correctly: maybe there's a limit to the number of 404s it will ignore before bailing. In any case, if the logic for detecting entries for exclusion can be made more robust, it would be a good solution. |
Summary: WikiProjects are often not up-to-date and may have several articles that are no longer in Wikipedia (or whose title has changed). This appears to cause mwOflliner to abort (I know that's an old chestnut, but it is what it is). So, we need a way to provide a list of articles to exclude when editing the WikiProject Selection. Currently there is only a way to exclude whole (sub) Projects (unless exclusion list is more flexible than it appears to be).
Detail: So, I tested this great tool by trying to build a Wikipedia ZIM of literary topics, with a few projects like
Literature
,Poetry
, etc. This all went smoothly, except that the ZIM failed to build due to a bunch of missing articles that are no-doubt referenced in out-of-date projects. You can see the rather large list of failures I got here.I was able to use that json output to make a simple list of the articles I need to exclude, only to find that there is no way to exclude individual articles (as opposed to individual projects) in the WikiProject Selection / editing tool. While it can probably be done in SPARQL, realistically users are not going to spend the time learning that syntax, which looks positively evil 😈...
The more this process could be automated, the better (I know, easier said than done).
The text was updated successfully, but these errors were encountered: