Directory Structure¶

INEIFetcher creates a predictable, self-contained directory tree under master_directory. Understanding the layout helps when inspecting downloads or integrating with downstream pipelines.

Full Layout¶

<master_directory>/
│
├── referrer.db                          # SQLite cache & progress database
│
└── microdatos_inei/                     # inei_directory (configurable)
    └── <survey>/                        # e.g. "enaho", "endes"
        │
        ├── 0_zips/                      # raw downloaded ZIP files
        │   ├── 2020_mod_0001.zip
        │   ├── 2020_mod_0002.zip
        │   └── ...
        │
        ├── 1_unzipped/                  # extracted ZIP contents
        │   ├── 2020_mod_0001/
        │   │   ├── enaho01-2020.dta
        │   │   └── ficha_tecnica.pdf
        │   └── ...
        │
        └── 2_organized/
            │
            ├── by_year/                 # organize_by="year"
            │   ├── 2020/
            │   │   ├── 0001_enaho01-2020.dta
            │   │   └── 0002_enaho02-2020.dta
            │   └── 2021/
            │       └── ...
            │
            ├── by_module/               # organize_by="module"
            │   ├── 0001_vivienda_hogar/
            │   │   ├── 2020_enaho01.dta
            │   │   └── 2021_enaho01.dta
            │   └── ...
            │
            └── documentation/           # deduplicated PDFs and docs
                ├── ficha_tecnica.pdf
                └── ...

Stage Descriptions¶

`0_zips/`¶

Holds the raw ZIP files downloaded from the INEI portal. Filenames follow the pattern <year>_mod_<code>.zip. ZIPs are validated after download; corrupted files are automatically re-fetched.

Set remove_zip_after_extract=True in download() to delete these after extraction and save disk space.

`1_unzipped/`¶

Each ZIP is extracted into its own subdirectory named <year>_mod_<code>/. The original filenames inside the ZIP are preserved.

`2_organized/`¶

This is the analysis-ready output. Files are placed in one or both of the two sub-schemes:

by_year/ — groups every module for the same year together. Useful when you process one year at a time.

by_module/ — groups all years for the same module together. Useful for longitudinal analysis.

documentation/ — aggregates all documentation files (PDFs, DOC, PPT, XLS …) across years. When deduplicate_docs_by_hash=True, identical files are stored only once regardless of filename, eliminating the redundant copies that INEI typically ships with each year's ZIP.

SQLite Database (`referrer.db`)¶

The database is created at <master_directory>/referrer.db (or a custom path via sql_file). It stores:

Module cache — scraped module listings per survey and year, so fetch_modules() does not repeat network requests.
Progress flags — downloaded, unzipped, organized, removed_zip booleans per module row, allowing interrupted workflows to resume cleanly.

You can inspect the database directly with any SQLite browser.

Directory Structure¶

Full Layout¶

Stage Descriptions¶

0_zips/¶

1_unzipped/¶

2_organized/¶

SQLite Database (referrer.db)¶

`0_zips/`¶

`1_unzipped/`¶

`2_organized/`¶

SQLite Database (`referrer.db`)¶