Skip to content

darthcoder/wookiedigs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

wookiedigs

Fetch Wikipedia articles by category tree. Give it any category, it walks the subcategory tree recursively and outputs clean JSONL.

Two modes:

  • fetch — full article wikitext
  • map — article metadata + every category each article belongs to

Install

pip install requests

Usage

# Fetch full article text for a category tree
python wikifetch.py --root "Category:Balkan Wars" --mode fetch --output balkan_wars.jsonl

# Category membership map (lightweight — no text download)
python wikifetch.py --root "Category:World War I" --mode map --output wwi_map.jsonl

# Control recursion depth (default 3)
python wikifetch.py --root "Category:Young Bosnia" --mode fetch --depth 2 --output young_bosnia.jsonl

# Append to an existing file — skips already-fetched articles
python wikifetch.py --root "Category:July Crisis" --mode fetch --output corpus.jsonl --append

# Slow down for polite crawling
python wikifetch.py --root "Category:Gallipoli campaign" --mode fetch --output gallipoli.jsonl --sleep 1.0

Output schemas

fetch mode:

{"pageid": 12345, "title": "Battle of Cer", "wikitext": "...raw wikitext..."}

map mode:

{
  "pageid": 12345,
  "title": "Gavrilo Princip",
  "categories": ["Category:Assassins", "Category:Young Bosnia", "..."],
  "category_count": 14
}

Options

Option Default Description
--root required Wikipedia category to start from
--mode fetch fetch for wikitext, map for category membership
--output required Output JSONL file path
--depth 3 Subcategory recursion depth
--append off Skip already-fetched articles in output file
--sleep 0.5 Seconds between API calls

Wikipedia API etiquette

wookiedigs identifies itself via the User-Agent header and sleeps between requests. The default --sleep 0.5 is generally fine. For large crawls, consider --sleep 1.0. Wikipedia's API is free and open — please don't hammer it.

About

Fetch Wikipedia articles by category tree — recursive, resumable, clean JSONL output

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages