-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Description
Hi,
We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune?
Line 258 in 0a7c07f
| sim_matrix_text_visual = self.get_similarity_logits(sequence_output_alm, visual_output_alm, |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels