Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to download only pages of a category? #2066

Open
LAfricain opened this issue Jul 21, 2024 · 3 comments
Open

How to download only pages of a category? #2066

LAfricain opened this issue Jul 21, 2024 · 3 comments
Assignees
Labels

Comments

@LAfricain
Copy link

I would like to create a file with all the page with the Ancien_Testament category. I run this:

mwoffliner --mwUrl=https://fr.wikipedia.org/ --getCategories=Ancien_Testament --outputDirectory=./Bible [email protected] --verbose

But it seems to download much more! How to have only the page of this category and how to add more then 1 category. By instance I would like to have the Ancien_Testament, and Nouveau_Testament categories together...

@LAfricain LAfricain changed the title How to donwload only page of a category? How to download only pages of a category? Jul 21, 2024
@kelson42 kelson42 self-assigned this Jul 21, 2024
@kelson42 kelson42 added this to the 1.14.0 milestone Jul 21, 2024
@audiodude
Copy link
Member

Where are you getting the param --getCategories from? I don't see it in https://github.com/openzim/mwoffliner/blob/main/src/parameterList.ts

In general, mwoffliner does not have the concept of wiki "categories", it only operates on "article lists".

However you could use WP1 to do this.

  1. Login to WP1: https://wp1.openzim.org/
  2. Go to https://wp1.openzim.org/#/selections/petscan to create a "Petscan collection".
  3. Select fr.wikipedia.org and use this Petscan URL in the URL field: https://petscan.wmcloud.org/?psid=28962290
  4. Wait for your selection and ZIM file to be created.

@LAfricain
Copy link
Author

@audiodude thank you for the link to wp1.openzim.org, someone send me there yesterday. It can help me.
Thank you for the perscan, it's exactly what I wanted. But how to add categories to the petscan? I would like to have the both, old and new testament?

And for the --getCategories I got it in the

mwoffliner --help
...
  --getCategories             [WIP] Download category pages

@audiodude
Copy link
Member

Petscan takes a list of categories. They are formatted just as the category name. So for instance:

https://en.wikipedia.org/wiki/Thekla_(daughter_of_Theophilos)

Has the categories:

9th-century births | 9th-century deaths | (and others....)

You can put either or both of these on https://petscan.wmcloud.org/ in the "Categories" box. If you want everything from all of the categories, use the "Union" button under Combination. Also be sure to set the "Depth" to the appropriate value in order to get subcategories.

As for:

And for the --getCategories I got it in the

mwoffliner --help
...
  --getCategories             [WIP] Download category pages

This is an experimental feature in an older version of mwoffliner that was never fully developed. I also believe, from looking at the code, that the intent was to fetch the "Category pages" of articles, not to download articles based on a given category.

Hope this helps.

@audiodude audiodude removed this from the 1.14.0 milestone Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants