Fetch Wikipedia articles by category tree. Give it any category, it walks the subcategory tree recursively and outputs clean JSONL.
Two modes:
- fetch — full article wikitext
- map — article metadata + every category each article belongs to
pip install requests# Fetch full article text for a category tree
python wikifetch.py --root "Category:Balkan Wars" --mode fetch --output balkan_wars.jsonl
# Category membership map (lightweight — no text download)
python wikifetch.py --root "Category:World War I" --mode map --output wwi_map.jsonl
# Control recursion depth (default 3)
python wikifetch.py --root "Category:Young Bosnia" --mode fetch --depth 2 --output young_bosnia.jsonl
# Append to an existing file — skips already-fetched articles
python wikifetch.py --root "Category:July Crisis" --mode fetch --output corpus.jsonl --append
# Slow down for polite crawling
python wikifetch.py --root "Category:Gallipoli campaign" --mode fetch --output gallipoli.jsonl --sleep 1.0fetch mode:
{"pageid": 12345, "title": "Battle of Cer", "wikitext": "...raw wikitext..."}map mode:
{
"pageid": 12345,
"title": "Gavrilo Princip",
"categories": ["Category:Assassins", "Category:Young Bosnia", "..."],
"category_count": 14
}| Option | Default | Description |
|---|---|---|
--root |
required | Wikipedia category to start from |
--mode |
fetch |
fetch for wikitext, map for category membership |
--output |
required | Output JSONL file path |
--depth |
3 |
Subcategory recursion depth |
--append |
off | Skip already-fetched articles in output file |
--sleep |
0.5 |
Seconds between API calls |
wookiedigs identifies itself via the User-Agent header and sleeps between requests. The default --sleep 0.5 is generally fine. For large crawls, consider --sleep 1.0. Wikipedia's API is free and open — please don't hammer it.