I'm trying to adapt the bitdistiller code for encoder-decoder models.
Are there any plans to add support for this? Can some guidance be provided what parts need adaptation?
We're running a project to test the findings found in Table 5 where Llama 7B performed better as the teacher than 13B. We're testing the hypothesis you put forward across OPT models and now expanding our experiment to encoder-decoder models. Further, we're also running an experiment to sequentially introduce larger teachers. I.E self-distillation followed by a bigger model as teacher on the self-distilled model.