Skip to content

Latest commit

 

History

History
34 lines (28 loc) · 1.75 KB

README.md

File metadata and controls

34 lines (28 loc) · 1.75 KB

Obtaining Wikipedia as a Python dictionary

We adapt the Wikipedia extractor available in https://github.com/attardi/wikiextractor (all code is available in the arwiki folder). From Wikipedia dumps we will turn it to a Python dictionary to be able to access it as:

wikipedia['لبنان'] = ["       
لبنان أو (رسمياً: الجمهوريّة اللبنانيّة)، هي دولة عربية واقعة في الشرق الأوسط في غرب القارة الآسيوية.", ... ]

Steps: All scripts here are located in the arwiki folder.

Note: This command will create a bunch of folders in your TEMP_DIRECTORY named AA, AB, ... and will take up to 10 minutes (there are 660k articles in total).

python WikiExtractor.py ^
arwiki-20190201-pages-articles-multistream.xml ^
--processes 16 ^
--o . ^
--no-templates ^
--json
  • Now using the output of WikiExtractor we will build a Python dictionary of Arabic Wikipedia and save it in pickle form (if you are not familiar with Pickle check https://wiki.python.org/moin/UsingPickle, we will use it extensively here), pick an OUTPUT_DIRECTORY:
python arwiki_to_dict.py ^
-i TEMP_DIRECTORY ^
-o OUTPUT_DIRECTORY

This command will create a file called arwiki.p of size 1.2GB in your output directory and this is your pickled Wikipedia.

  • You can safely now delete your TEMP_DIRECTORY