A multi-modal transformer-based model for generative visual dialog system

Ghada ELSHAMY

ghada.magdy@cis.asu.edu.eg
Ain Shams University (Egypt)
https://orcid.org/0000-0002-1866-5321

Marco ALFONSE


Ain Shams University (Egypt)
https://orcid.org/0000-0003-0722-3218

Islam HEGAZY


Ain Shams University (Egypt)
https://orcid.org/0000-0002-1572-463X

Mostafa AREF


Ain Shams University (Egypt)
https://orcid.org/0000-0002-1278-0070

Abstract

Recent advancements in generative artificial intelligence have boosted significant interest in conversational agents. The visual dialog task, a synthesis of visual question-answering and dialog systems, requires agents capable of both seeing and chatting in natural language interactions. These agents must effectively understand cross-modal contextual information and generate coherent, human-like responses to a sequence of questions about a given visual scene. Despite progress, previous approaches often required complex architectures and substantial resources. This paper introduces a generative dialog agent that effectively addresses these challenges while maintaining a relatively simple architecture, dataset, and resource requirements. The proposed model employs an encoder-decoder architecture, incorporating ViLBERT for cross-modal information grounding and GPT-2 for autoregressive answer generation. This is the first visual dialog agent solely reliant on an autoregressive decoder for text generation. Evaluated on the VisDial dataset, the model achieves promising results, with scores of 64.05, 62.67, 70.17, and 15.37 on normalized discounted cumulative gain (NDCG), rank@5, rank@10, and the mean, respectively. These outcomes underscore the effectiveness of this approach, particularly considering its efficiency in terms of dataset size, architecture complexity, and generation process. The code and dataset are available at https://github.com/GhadaElshamy/MS-GPT-visdial.git , complete with usage instructions to facilitate replication of these experiments.


Keywords:

visual dialog, transformers, ViLBERT, GPT, answer generation

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. ArXiv, abs/2005.14165. https://doi.org/10.48550/ARXIV.2005.14165
  Google Scholar

Cadène, R., Dancette, C., Ben-younes, H., Cord, M., & Parikh, D. (2019). RUBi: Reducing unimodal biases in visual question answering. ArXiv, abs/1906.10169. https://doi.org/10.48550/arXiv.1906.10169
  Google Scholar

Chen, C., Tan, Z., Cheng, Q., Jiang, X., Liu, Q., Zhu, Y., & Gu, X. (2022). UTC: A unified transformer with inter-task contrastive learning for visual dialog. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 18082–18091). IEEE. https://doi.org/10.1109/CVPR52688.2022.01757
  Google Scholar

Chen, F., Meng, F., Chen, X., Li, P., & Zhou, J. (2021). Multimodal incremental transformer with visual grounding for visual dialogue generation. Findings of the Association for Computational Linguistics, 436–446. https://doi.org/10.18653/v1/2021.findings-acl.38
  Google Scholar

Chen, F., Meng, F., Xu, J., Li, P., Xu, B., & Zhou, J. (2020). DMRM: A dual-channel multi-hop reasoning model for visual dialog. AAAI Conference on Artificial Intelligence (pp. 7504-7511). https://doi.org/10.1609/aaai.v34i05.6248
  Google Scholar

Chen, Z., Qiu, G., Li, P., Zhu, L., Yang, X., & Sheng, B. (2023). MNGNAS: Distilling adaptive combination of multiple searched networks for one-shot neural architecture search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 13489-13508. https://doi.org/10.1109/TPAMI.2023.3293885
  Google Scholar

Cui, X., Khan, D., He, Z., & Cheng, Z. (2023). Fusing surveillance videos and three‐dimensional scene: A mixed reality system. Computer Animation and Virtual Worlds, 34(1), e2129. https://doi.org/10.1002/cav.2129
  Google Scholar

Dai, J., & Zhang, X. (2022). Automatic image caption generation using deep learning and multimodal attention. Computer Animation and Virtual Worlds, 33(3–4), e2072. https://doi.org/10.1002/cav.2072
  Google Scholar

Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M. F., Parikh, D., & Batra, D. (2016). Visual dialog. ArXiv, abs/1611.08669. https://doi.org/10.48550/ARXIV.1611.08669
  Google Scholar

Das, A., Kottur., S., Moura, J. M. F., Lee, S., & Batra, D. (2017). Learning cooperative visual dialog agents with deep reinforcement learning. 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 2970–2979). IEEE. https://doi.org/10.1109/ICCV.2017.321
  Google Scholar

de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., & Courville, A. (2016). GuessWhat?! Visual object discovery through multi-modal dialogue. ArXiv, abs/1611.08481. https://doi.org/10.48550/ARXIV.1611.08481
  Google Scholar

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. 2019 Conference of (NAACL-HLT) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
  Google Scholar

Fan, H., Zhu, L., Yang, Y., & Wu, F. (2020). Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications, 16(3), 78. https://doi.org/10.1145/3390891
  Google Scholar

Gan, Z., Cheng, Y., Kholy, A. E., Li, L., Liu, J., & Gao, J. (2019). Multi-step reasoning via recurrent dual attention for visual dialog. 57th Annual Meeting of the Association for Computational Linguistics (pp. 6463–6474). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1648
  Google Scholar

Guo, D., Xu, C., & Tao, D. (2019). Image-question-answer synergistic network for visual dialog. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10426–10435). IEEE. https://doi.org/10.1109/CVPR.2019.01068
  Google Scholar

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). IEEE. https://doi.org/10.1109/CVPR.2016.90
  Google Scholar

Jiang, X., Du, S., Qin, Z., Sun, Y., & Yu, J. (2020a). KBGN: knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. 28th ACM International Conference on Multimedia (pp. 1265–1273). Association for Computing Machinery. https://doi.org/10.1145/3394171.3413826
  Google Scholar

Jiang, X., Yu, J., Qin, Z., Zhuang, Y., Zhang, X., Hu, Y., & Wu, Q. (2020b). DualVD: An adaptive dual encoding model for deep visual understanding in visual dialogue. AAAI Conference on Artificial Intelligence (pp. 11125–11132). AAAI Technical Track: Vision. https://doi.org/10.1609/aaai.v34i07.6769
  Google Scholar

Jiang, X., Yu, J., Sun, Y., Qin, Z., Zhu, Z., Hu, Y., & Wu, Q. (2020c). DAM: Deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue. Twenty-Ninth International Joint Conference on Artificial Intelligence (pp. 687–693). https://doi.org/10.24963/ijcai.2020/96
  Google Scholar

Kang, G.-C., Kim, S., Kim, J.-H., Kwak, D., & Zhang, B.-T. (2023). The dialog must go on: Improving visual dialog via generative self-training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6746–6756). IEEE. https://doi.org/10.1109/CVPR52729.2023.00652
  Google Scholar

Kottur, S., Moura, J. M. F., Parikh, D., Batra, D., & Rohrbach, M. (2018). Visual coreference resolution in visual dialog using neural module networks. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer Vision – ECCV 2018 (Vol. 11219, pp. 160–178). Springer International Publishing. https://doi.org/10.1007/978-3-030-01267-0_10
  Google Scholar

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., & Fei-Fei, L. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123, 32–73. https://doi.org/10.1007/s11263-016-0981-7
  Google Scholar

Li, L., Huang, T., Li, Y., & Li, P. (2023). Trajectory‐BERT: Pre‐training and fine‐tuning bidirectional transformers for crowd trajectory enhancement. Computer Animation and Virtual Worlds, 34(3–4), e2190. https://doi.org/10.1002/cav.2190
  Google Scholar

Lin, T.-Y., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. Computer Vision - ECCV 2014 (pp. 740–755). Springer. https://doi.org/10.1007/978-3-319-10602-1_48
  Google Scholar

Lin, X., Sun, S., Huang, W., Sheng, B., Li, P., & Feng, D. D. (2023). EAPT: Efficient attention pyramid transformer for image processing. IEEE Transactions on Multimedia, 25, 50–61. https://doi.org/10.1109/TMM.2021.3120873
  Google Scholar

Liu, A.-A., Zhang, G., Xu, N., Guo, J., Jin, G., & Li, X. (2022). Closed-loop reasoning with graph-aware dense interaction for visual dialog. Multimedia Systems, 28, 1823–1832. https://doi.org/10.1007/s00530-022-00947-1
  Google Scholar

Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32.
  Google Scholar

Lu, J., Kannan, A., Yang, J., Parikh, D., & Batra, D. (2017). Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 30). Curran Associates, Inc.
  Google Scholar

Murahari, V., Batra, D., Parikh, D., & Das, A. (2020). Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 (Vol. 12363, pp. 336–352). Springer International Publishing. https://doi.org/10.1007/978-3-030-58523-5_20
  Google Scholar

Nguyen, V.-Q., Suganuma, M., & Okatani, T. (2020). Efficient attention mechanism for visual dialog that Can handle all the interactions between multiple inputs. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 (Vol. 12369, pp. 223–240). Springer International Publishing. https://doi.org/10.1007/978-3-030-58586-0_14
  Google Scholar

OpenAI. (2023). GPT-4 technical report. ArXiv, abs/2303.08774. https://doi.org/10.48550/arXiv.2303.08774
  Google Scholar

Radford, A., & Narasimhan, K. (2018). Improving language understanding by generative pre-training.
  Google Scholar

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
  Google Scholar

Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
  Google Scholar

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108. https://doi.org/10.48550/ARXIV.1910.01108
  Google Scholar

Schwartz, I., Yu, S., Hazan, T., & Schwing, A. G. (2019). Factor graph attention. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2039–2048). IEEE. https://doi.org/10.1109/CVPR.2019.00214
  Google Scholar

Seo, P. H., Lehrmann, A. M., Han, B., & Sigal, L. (2017). Visual reference resolution using attention memory for visual dialog. Advances in Neural Information Processing Systems 30, 30, 3719–3729.
  Google Scholar

Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. 56th Annual Meeting of the Association for Computational Linguistics (pp. 2556–2565). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1238
  Google Scholar

Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30, 30.
  Google Scholar

Wang, A., & Cho, K. (2019). BERT has a mouth, and It must speak: BERT as a markov random field language model. Workshop on Methods for Optimizing and Evaluating Neural Language Generation, 30–36. https://doi.org/10.18653/v1/W19-2304
  Google Scholar

Wang, Y., Joty, S. R., Lyu, M. R., King, I., Xiong, C., & Hoi, S. C. H. (2020). VD-BERT: A unified vision and dialog transformer with BERT. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3325–3338). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.269
  Google Scholar

Wu, Q., Wang, P., Shen, C., Reid, I., & Hengel, A. V. D. (2018). Are you talking to me? reasoned visual dialog generation through adversarial learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6106–6115). IEEE. https://doi.org/10.1109/CVPR.2018.00639
  Google Scholar

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., … Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv, abs/1609.08144. https://doi.org/10.48550/ARXIV.1609.08144
  Google Scholar

Xin, B., Xu, N., Zhai, Y., Zhang, T., Lu, Z., Liu, J., Nie, W., Li, X., & Liu, A.-A. (2023). A comprehensive survey on deep-learning-based visual captioning. Multimedia Systems, 29, 3781–3804. https://doi.org/10.1007/s00530-023-01175-x
  Google Scholar

Yang, T., Zha, Z.-J., & Zhang, H. (2019). Making history matter: History-advantage sequence training for visual dialog. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 2561–2569). IEEE. https://doi.org/10.1109/ICCV.2019.00265
  Google Scholar

Yu, Y., Yang, Y., & Xing, J. (2024). PMGAN: Pretrained model-based generative adversarial network for text-to-image generation. The Visual Computer, 44, 303–314. https://doi.org/10.1007/s00371-024-03326-1
  Google Scholar

Yu, Z., Yu, J., Cui, Y., Tao, D., & Tian, Q. (2019). Deep modular co-attention networks for visual question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6274–6283). IEEE. https://doi.org/10.1109/CVPR.2019.00644
  Google Scholar

Zhang, B., Ma, R., Cao, Y., & An, P. (2024). Swin-VEC: Video swin transformer-based GAN for video error concealment of VVC. The Visual Computer, 40, 7335–7347. https://doi.org/10.1007/s00371-024-03518-9
  Google Scholar

Zhang, J., Zhao, T., & Yu, Z. (2018). Multimodal hierarchical reinforcement learning policy for task-oriented visual dialog. 19th Annual SIGdial Meeting on Discourse and Dialogue (pp. 140–150). Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5015
  Google Scholar

Zhao, L., Lyu, X., Song, J., & Gao, L. (2021). GuessWhich? Visual dialog with attentive memory network. Pattern Recognition, 114, 107823. https://doi.org/10.1016/j.patcog.2021.107823
  Google Scholar

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 19–27). IEEE. https://doi.org/10.1109/ICCV.2015.11
  Google Scholar

Download


Published
2025-03-31

Cited by

ELSHAMY, G., ALFONSE, M., HEGAZY, I., & AREF, M. (2025). A multi-modal transformer-based model for generative visual dialog system. Applied Computer Science, 21(1), 1–17. https://doi.org/10.35784/acs_6856

Authors

Ghada ELSHAMY 
ghada.magdy@cis.asu.edu.eg
Ain Shams University Egypt
https://orcid.org/0000-0002-1866-5321

Authors

Marco ALFONSE 

Ain Shams University Egypt
https://orcid.org/0000-0003-0722-3218

Authors

Islam HEGAZY 

Ain Shams University Egypt
https://orcid.org/0000-0002-1572-463X

Authors

Mostafa AREF 

Ain Shams University Egypt
https://orcid.org/0000-0002-1278-0070

Statistics

Abstract views: 49
PDF downloads: 29


License

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.


Similar Articles

<< < 1 2 

You may also start an advanced similarity search for this article.