A multi-modal transformer-based model for generative visual dialog system
Ghada ELSHAMY
ghada.magdy@cis.asu.edu.egAin Shams University (Egypt)
https://orcid.org/0000-0002-1866-5321
Marco ALFONSE
Ain Shams University (Egypt)
https://orcid.org/0000-0003-0722-3218
Islam HEGAZY
Ain Shams University (Egypt)
https://orcid.org/0000-0002-1572-463X
Mostafa AREF
Ain Shams University (Egypt)
https://orcid.org/0000-0002-1278-0070
Abstract
Recent advancements in generative artificial intelligence have boosted significant interest in conversational agents. The visual dialog task, a synthesis of visual question-answering and dialog systems, requires agents capable of both seeing and chatting in natural language interactions. These agents must effectively understand cross-modal contextual information and generate coherent, human-like responses to a sequence of questions about a given visual scene. Despite progress, previous approaches often required complex architectures and substantial resources. This paper introduces a generative dialog agent that effectively addresses these challenges while maintaining a relatively simple architecture, dataset, and resource requirements. The proposed model employs an encoder-decoder architecture, incorporating ViLBERT for cross-modal information grounding and GPT-2 for autoregressive answer generation. This is the first visual dialog agent solely reliant on an autoregressive decoder for text generation. Evaluated on the VisDial dataset, the model achieves promising results, with scores of 64.05, 62.67, 70.17, and 15.37 on normalized discounted cumulative gain (NDCG), rank@5, rank@10, and the mean, respectively. These outcomes underscore the effectiveness of this approach, particularly considering its efficiency in terms of dataset size, architecture complexity, and generation process. The code and dataset are available at https://github.com/GhadaElshamy/MS-GPT-visdial.git , complete with usage instructions to facilitate replication of these experiments.
Keywords:
visual dialog, transformers, ViLBERT, GPT, answer generationReferences
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. ArXiv, abs/2005.14165. https://doi.org/10.48550/ARXIV.2005.14165
Google Scholar
Cadène, R., Dancette, C., Ben-younes, H., Cord, M., & Parikh, D. (2019). RUBi: Reducing unimodal biases in visual question answering. ArXiv, abs/1906.10169. https://doi.org/10.48550/arXiv.1906.10169
Google Scholar
Chen, C., Tan, Z., Cheng, Q., Jiang, X., Liu, Q., Zhu, Y., & Gu, X. (2022). UTC: A unified transformer with inter-task contrastive learning for visual dialog. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 18082–18091). IEEE. https://doi.org/10.1109/CVPR52688.2022.01757
Google Scholar
Chen, F., Meng, F., Chen, X., Li, P., & Zhou, J. (2021). Multimodal incremental transformer with visual grounding for visual dialogue generation. Findings of the Association for Computational Linguistics, 436–446. https://doi.org/10.18653/v1/2021.findings-acl.38
Google Scholar
Chen, F., Meng, F., Xu, J., Li, P., Xu, B., & Zhou, J. (2020). DMRM: A dual-channel multi-hop reasoning model for visual dialog. AAAI Conference on Artificial Intelligence (pp. 7504-7511). https://doi.org/10.1609/aaai.v34i05.6248
Google Scholar
Chen, Z., Qiu, G., Li, P., Zhu, L., Yang, X., & Sheng, B. (2023). MNGNAS: Distilling adaptive combination of multiple searched networks for one-shot neural architecture search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 13489-13508. https://doi.org/10.1109/TPAMI.2023.3293885
Google Scholar
Cui, X., Khan, D., He, Z., & Cheng, Z. (2023). Fusing surveillance videos and three‐dimensional scene: A mixed reality system. Computer Animation and Virtual Worlds, 34(1), e2129. https://doi.org/10.1002/cav.2129
Google Scholar
Dai, J., & Zhang, X. (2022). Automatic image caption generation using deep learning and multimodal attention. Computer Animation and Virtual Worlds, 33(3–4), e2072. https://doi.org/10.1002/cav.2072
Google Scholar
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M. F., Parikh, D., & Batra, D. (2016). Visual dialog. ArXiv, abs/1611.08669. https://doi.org/10.48550/ARXIV.1611.08669
Google Scholar
Das, A., Kottur., S., Moura, J. M. F., Lee, S., & Batra, D. (2017). Learning cooperative visual dialog agents with deep reinforcement learning. 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 2970–2979). IEEE. https://doi.org/10.1109/ICCV.2017.321
Google Scholar
de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., & Courville, A. (2016). GuessWhat?! Visual object discovery through multi-modal dialogue. ArXiv, abs/1611.08481. https://doi.org/10.48550/ARXIV.1611.08481
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. 2019 Conference of (NAACL-HLT) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Google Scholar
Fan, H., Zhu, L., Yang, Y., & Wu, F. (2020). Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications, 16(3), 78. https://doi.org/10.1145/3390891
Google Scholar
Gan, Z., Cheng, Y., Kholy, A. E., Li, L., Liu, J., & Gao, J. (2019). Multi-step reasoning via recurrent dual attention for visual dialog. 57th Annual Meeting of the Association for Computational Linguistics (pp. 6463–6474). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1648
Google Scholar
Guo, D., Xu, C., & Tao, D. (2019). Image-question-answer synergistic network for visual dialog. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10426–10435). IEEE. https://doi.org/10.1109/CVPR.2019.01068
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). IEEE. https://doi.org/10.1109/CVPR.2016.90
Google Scholar
Jiang, X., Du, S., Qin, Z., Sun, Y., & Yu, J. (2020a). KBGN: knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. 28th ACM International Conference on Multimedia (pp. 1265–1273). Association for Computing Machinery. https://doi.org/10.1145/3394171.3413826
Google Scholar
Jiang, X., Yu, J., Qin, Z., Zhuang, Y., Zhang, X., Hu, Y., & Wu, Q. (2020b). DualVD: An adaptive dual encoding model for deep visual understanding in visual dialogue. AAAI Conference on Artificial Intelligence (pp. 11125–11132). AAAI Technical Track: Vision. https://doi.org/10.1609/aaai.v34i07.6769
Google Scholar
Jiang, X., Yu, J., Sun, Y., Qin, Z., Zhu, Z., Hu, Y., & Wu, Q. (2020c). DAM: Deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue. Twenty-Ninth International Joint Conference on Artificial Intelligence (pp. 687–693). https://doi.org/10.24963/ijcai.2020/96
Google Scholar
Kang, G.-C., Kim, S., Kim, J.-H., Kwak, D., & Zhang, B.-T. (2023). The dialog must go on: Improving visual dialog via generative self-training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6746–6756). IEEE. https://doi.org/10.1109/CVPR52729.2023.00652
Google Scholar
Kottur, S., Moura, J. M. F., Parikh, D., Batra, D., & Rohrbach, M. (2018). Visual coreference resolution in visual dialog using neural module networks. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer Vision – ECCV 2018 (Vol. 11219, pp. 160–178). Springer International Publishing. https://doi.org/10.1007/978-3-030-01267-0_10
Google Scholar
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., & Fei-Fei, L. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123, 32–73. https://doi.org/10.1007/s11263-016-0981-7
Google Scholar
Li, L., Huang, T., Li, Y., & Li, P. (2023). Trajectory‐BERT: Pre‐training and fine‐tuning bidirectional transformers for crowd trajectory enhancement. Computer Animation and Virtual Worlds, 34(3–4), e2190. https://doi.org/10.1002/cav.2190
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. Computer Vision - ECCV 2014 (pp. 740–755). Springer. https://doi.org/10.1007/978-3-319-10602-1_48
Google Scholar
Lin, X., Sun, S., Huang, W., Sheng, B., Li, P., & Feng, D. D. (2023). EAPT: Efficient attention pyramid transformer for image processing. IEEE Transactions on Multimedia, 25, 50–61. https://doi.org/10.1109/TMM.2021.3120873
Google Scholar
Liu, A.-A., Zhang, G., Xu, N., Guo, J., Jin, G., & Li, X. (2022). Closed-loop reasoning with graph-aware dense interaction for visual dialog. Multimedia Systems, 28, 1823–1832. https://doi.org/10.1007/s00530-022-00947-1
Google Scholar
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32.
Google Scholar
Lu, J., Kannan, A., Yang, J., Parikh, D., & Batra, D. (2017). Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 30). Curran Associates, Inc.
Google Scholar
Murahari, V., Batra, D., Parikh, D., & Das, A. (2020). Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 (Vol. 12363, pp. 336–352). Springer International Publishing. https://doi.org/10.1007/978-3-030-58523-5_20
Google Scholar
Nguyen, V.-Q., Suganuma, M., & Okatani, T. (2020). Efficient attention mechanism for visual dialog that Can handle all the interactions between multiple inputs. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 (Vol. 12369, pp. 223–240). Springer International Publishing. https://doi.org/10.1007/978-3-030-58586-0_14
Google Scholar
OpenAI. (2023). GPT-4 technical report. ArXiv, abs/2303.08774. https://doi.org/10.48550/arXiv.2303.08774
Google Scholar
Radford, A., & Narasimhan, K. (2018). Improving language understanding by generative pre-training.
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Google Scholar
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Google Scholar
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108. https://doi.org/10.48550/ARXIV.1910.01108
Google Scholar
Schwartz, I., Yu, S., Hazan, T., & Schwing, A. G. (2019). Factor graph attention. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2039–2048). IEEE. https://doi.org/10.1109/CVPR.2019.00214
Google Scholar
Seo, P. H., Lehrmann, A. M., Han, B., & Sigal, L. (2017). Visual reference resolution using attention memory for visual dialog. Advances in Neural Information Processing Systems 30, 30, 3719–3729.
Google Scholar
Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. 56th Annual Meeting of the Association for Computational Linguistics (pp. 2556–2565). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1238
Google Scholar
Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30, 30.
Google Scholar
Wang, A., & Cho, K. (2019). BERT has a mouth, and It must speak: BERT as a markov random field language model. Workshop on Methods for Optimizing and Evaluating Neural Language Generation, 30–36. https://doi.org/10.18653/v1/W19-2304
Google Scholar
Wang, Y., Joty, S. R., Lyu, M. R., King, I., Xiong, C., & Hoi, S. C. H. (2020). VD-BERT: A unified vision and dialog transformer with BERT. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3325–3338). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.269
Google Scholar
Wu, Q., Wang, P., Shen, C., Reid, I., & Hengel, A. V. D. (2018). Are you talking to me? reasoned visual dialog generation through adversarial learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6106–6115). IEEE. https://doi.org/10.1109/CVPR.2018.00639
Google Scholar
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., … Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv, abs/1609.08144. https://doi.org/10.48550/ARXIV.1609.08144
Google Scholar
Xin, B., Xu, N., Zhai, Y., Zhang, T., Lu, Z., Liu, J., Nie, W., Li, X., & Liu, A.-A. (2023). A comprehensive survey on deep-learning-based visual captioning. Multimedia Systems, 29, 3781–3804. https://doi.org/10.1007/s00530-023-01175-x
Google Scholar
Yang, T., Zha, Z.-J., & Zhang, H. (2019). Making history matter: History-advantage sequence training for visual dialog. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 2561–2569). IEEE. https://doi.org/10.1109/ICCV.2019.00265
Google Scholar
Yu, Y., Yang, Y., & Xing, J. (2024). PMGAN: Pretrained model-based generative adversarial network for text-to-image generation. The Visual Computer, 44, 303–314. https://doi.org/10.1007/s00371-024-03326-1
Google Scholar
Yu, Z., Yu, J., Cui, Y., Tao, D., & Tian, Q. (2019). Deep modular co-attention networks for visual question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6274–6283). IEEE. https://doi.org/10.1109/CVPR.2019.00644
Google Scholar
Zhang, B., Ma, R., Cao, Y., & An, P. (2024). Swin-VEC: Video swin transformer-based GAN for video error concealment of VVC. The Visual Computer, 40, 7335–7347. https://doi.org/10.1007/s00371-024-03518-9
Google Scholar
Zhang, J., Zhao, T., & Yu, Z. (2018). Multimodal hierarchical reinforcement learning policy for task-oriented visual dialog. 19th Annual SIGdial Meeting on Discourse and Dialogue (pp. 140–150). Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5015
Google Scholar
Zhao, L., Lyu, X., Song, J., & Gao, L. (2021). GuessWhich? Visual dialog with attentive memory network. Pattern Recognition, 114, 107823. https://doi.org/10.1016/j.patcog.2021.107823
Google Scholar
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 19–27). IEEE. https://doi.org/10.1109/ICCV.2015.11
Google Scholar
Authors
Ghada ELSHAMYghada.magdy@cis.asu.edu.eg
Ain Shams University Egypt
https://orcid.org/0000-0002-1866-5321
Statistics
Abstract views: 49PDF downloads: 29
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
Similar Articles
- Pascal Krutz, Matthias Rehm, Holger Schlegel, Martin Dix, RECOGNITION OF SPORTS EXERCISES USING INERTIAL SENSOR TECHNOLOGY , Applied Computer Science: Vol. 19 No. 1 (2023)
- Kamil ŻYŁA, SIMPLIFIED GRAPHICAL DOMAIN-SPECIFIC LANGUAGES FOR THE MOBILE DOMAIN – PERSPECTIVES OF LEARNABILITY BY NONTECHNICAL USERS , Applied Computer Science: Vol. 13 No. 3 (2017)
- Baldemar ZURITA, Luís LUNA, José HERNÁNDEZ, Federico RAMÍREZ, BOVW FOR CLASSIFICATION IN GEOMETRICS SHAPES , Applied Computer Science: Vol. 14 No. 4 (2018)
- Rafał KWOKA, Janusz KOZAK, Michał MAJKA, TESTS OF HTS 2G SUPERCONDUCTING TAPES USING THE LABVIEW ENVIRONMENT , Applied Computer Science: Vol. 14 No. 1 (2018)
- Raphael Olufemi AKINYEDE, Sulaiman Omolade ADEGBENRO, Babatola Moses OMILODI, A SECURITY MODEL FOR PREVENTING E-COMMERCE RELATED CRIMES , Applied Computer Science: Vol. 16 No. 3 (2020)
- Tomasz NOWICKI, Adam GREGOSIEWICZ, Zbigniew ŁAGODOWSKI, PRODUCTIVITY OF A LOW-BUDGET COMPUTER CLUSTER APPLIED TO OVERCOME THE N-BODY PROBLEM , Applied Computer Science: Vol. 17 No. 4 (2021)
- Lukas BAUER, Leon STÜTZ, Markus KLEY, BLACK BOX EFFICIENCY MODELLING OF AN ELECTRIC DRIVE UNIT UTILIZING METHODS OF MACHINE LEARNING , Applied Computer Science: Vol. 17 No. 4 (2021)
- Marcin BADUROWICZ, Stanisław SKULIMOWSKI, Maciej LASKOWSKI, FEASIBILITY OF USING LOW-PARAMETER LOCAL LLMS IN ANSWERING QUESTIONS FROM ENTERPRISE KNOWLEDGE BASE , Applied Computer Science: Vol. 20 No. 4 (2024)
You may also start an advanced similarity search for this article.