Minimal guide to preprocess metadata and run k-fold experiments combining images with one-hot-encoded or sentence-embedded metadata.
- Python 3.10+
- Install dependencies (PyTorch per your CUDA/CPU setup):
# Initialize the raug submodule
git submodule update --init
python -m pip install -r requirements.txtUpdate config.py paths to match your environment:
PAD_20_PATHpoints to PAD-UFES-20 root (expectsimages/andmetadata.csv).PAD_20_IMAGES_FOLDERpoints to the images folder under that root.DATA_PATHis where generated CSVs will be written (default:data/pad-ufes-20).
Example expected layout:
/datasets/PAD-UFES-20/
images/
metadata.csv
Run both preprocessors to generate one-hot encoded and sentence-based CSVs with 5-fold patient-grouped splits. These read config.PAD_20_RAW_METADATA and write CSVs into config.DATA_PATH.
- One-hot encoded CSVs:
python -m benchmarks.pad20.preprocess.onehot- Sentence CSVs (anamnese strings for sentence-transformers):
python -m benchmarks.pad20.preprocess.sentenceOutputs (filenames):
pad-ufes-20-one-hot-missing-{0|5|10|20|30|50|70}.csvpad-ufes-20-sentence-missing-{0|5|10|20|30|50|70}.csv
Launch the experiment driver. It will iterate models, missingness levels, and folds configured inside benchmarks/kfoldexperiment.py and log runs with Sacred into benchmarks/pad20/results/....
python -m benchmarks.kfoldexperimentNotes:
- In
benchmarks/kfoldexperiment.py, adjust:feature_fusion_methods: usemetablockfor one-hot; usemetablock-sefor sentence-embeddings; useNonefor image-only.modelsandmissing_percentagesas desired.
- The
folder_experimentautomatically picks the correct metadata file based on_preprocessingand will encode sentences with_llm_type(defaultsentence-transformers/paraphrase-albert-small-v2).
- Ensure the dataset paths in
config.pyare valid before running. - If you use GPUs, install a matching PyTorch build from pytorch.org.
- Results, configs, and metrics are stored per-fold in
benchmarks/pad20/results/...via Sacred observers.
After the k-fold runs finish, aggregate per-fold metrics into a single CSV and then plot performance vs. missing metadata rate.
Under benchmarks/pad20/results/opt_<optimizer>_early_stop_<metric>/<TIMESTAMP>/ the structure is:
benchmarks/pad20/results/
opt_adam_early_stop_loss/
<TIMESTAMP>/ # e.g., 17551255817303946
no_metadata/
<model>/
missing_0/
folder_1/predictions_best_test.csv
folder_2/predictions_best_test.csv
... folder_5/
metablock/
<model>/
missing_{0|5|10|20|30|50|70}/
folder_{1..5}/predictions_best_test.csv
metablock-se/
<model>/
missing_{0|5|10|20|30|50|70}/
folder_{1..5}/predictions_best_test.csv
Tip: ensure you have the image-only baseline (no_metadata/missing_0) for each model; the plotting script uses it as a reference band.
This scans the timestamped results folder, collects predictions_best_test.csv across folds, and writes agg_metrics.csv at the timestamp root.
python -m utils.aggpredictions --timestamp <TIMESTAMP>Output: benchmarks/pad20/results/opt_adam_early_stop_loss/<TIMESTAMP>/agg_metrics.csv with a multi-index:
- index: fusion in {no_metadata, metablock, metablock-se}, model, missing in {0,5,10,20,30,50,70}, metric in {balanced_accuracy, auc, f1_score}
- columns: AVG, STD, FOLDER-1..FOLDER-5
This reads agg_metrics.csv and produces a figure at repo root. It expects all models and missing levels to be present in the CSV.
python -m utils.plotmissing --timestamp <TIMESTAMP> --metric balanced_accuracyWhere --metric is one of: balanced_accuracy, auc, f1_score.
Output: ./<metric>_vs_missing_data.png (e.g., balanced_accuracy_vs_missing_data.png).