Skip to content

TODO β€” MEF Module Roadmap

Planned features and improvements for the perustats.MEF module.


πŸ”œ In Progress / Next Up

Progress Bars

Add rich visual feedback during long-running scrapes using the rich library's progress utilities.

  • [ ] Step-level progress bar showing Step N / Total as each navigation level is entered
  • [ ] Row-level progress bar within each step showing Row N / M (current_value)
  • [ ] Display the current metadata context (year, filters applied) in the progress description
  • [ ] Configurable verbosity: verbose=False to silence all output, verbose=True (default) for progress bars
  • [ ] Nested progress bars for multi-level workflows using rich.progress.Progress with TaskID
# Planned API (subject to change)
MEFScraper(steps, verbose=True).run(2023)
# ↓
# Scraping MEF gasto 2023 ━━━━━━━━━━━━━━━━━━━━━━ 100% Step 3/4
#   TipoGobierno β†’ Locales  ━━━━━━━━━━━━━━━━━━━━━ 100% Row 2/2
#   Departamento            ━━━━━━━━━━━━━━━━━╸━━━  72% Row 18/25

Partial Saves (SavePartial)

SavePartial is already parsed and accepted in the step list but is not yet active. Implementation will:

  • [ ] Write a Parquet or CSV checkpoint after completing all row iterations at the marked step
  • [ ] Detect existing checkpoints on startup and skip already-scraped rows
  • [ ] Allow resuming an interrupted scrape without restarting from zero
  • [ ] Expose a resume=True flag on MEFScraper.run() to opt into checkpoint recovery
# Planned behavior
steps = [
    ...
    Rows(),
    ClickBtn(BTN.DEPARTAMENTO),
    SavePartial(filename_prefix="departamento"),  # will checkpoint here
    ...
]
MEFScraper(steps, master_dir_save="./data/mef/").run(2023)
# If interrupted and re-run, already-saved departments are skipped

πŸ“‹ Planned Features

Async / Concurrent Requests

  • [ ] asyncio-based HTTP requests to parallelize row iterations at each step
  • [ ] Configurable concurrency limit (max_workers) to avoid overwhelming the MEF server
  • [ ] Estimated time remaining based on rows completed

OnMissing Full Implementation

  • [ ] OnMissing.SKIP β€” silently skip steps where no row matches the filter
  • [ ] OnMissing.RAISE β€” raise ValueError on missing rows (useful for strict pipelines)
  • [ ] OnMissing.RECORD β€” log misses to a scraper.missing attribute for post-run inspection

Retry Logic

  • [ ] Automatic retry with exponential backoff for network errors and MEF 500 responses
  • [ ] Configurable max_retries and retry_delay parameters on MEFScraper

Output Formats

  • [ ] .save() method to write results directly to CSV, Parquet, or Excel
  • [ ] Optional column renaming / localization to Spanish descriptive names

Extended Year Support

  • [ ] Monitor MEF portal for v8 migration and update URL/column configs accordingly

πŸ› Known Limitations

  • The MEF portal is occasionally down or slow β€” no retry logic yet.
  • SavePartial steps are parsed but do nothing at runtime.
  • OnMissing enum is defined but not yet enforced in _proces_step.
  • Very large result sets (all municipalities, all years) can be slow due to sequential HTTP requests.

πŸ’‘ Ideas Under Consideration

  • A MEFScraper.preview() method that simulates the first step offline and returns the initial table without making further requests.
  • CLI entrypoint (perustats mef run ...) for one-off queries without writing a script.
  • Integration with perustats caching layer (shared with BCRP module) to avoid redundant requests.