A multi-modal transformer-based model for generative visual dialog system
Article Sidebar
Open full text
Main Article Content
DOI
Authors
Abstract
Recent advancements in generative artificial intelligence have boosted significant interest in conversational agents. The visual dialog task, a synthesis of visual question-answering and dialog systems, requires agents capable of both seeing and chatting in natural language interactions. These agents must effectively understand cross-modal contextual information and generate coherent, human-like responses to a sequence of questions about a given visual scene. Despite progress, previous approaches often required complex architectures and substantial resources. This paper introduces a generative dialog agent that effectively addresses these challenges while maintaining a relatively simple architecture, dataset, and resource requirements. The proposed model employs an encoder-decoder architecture, incorporating ViLBERT for cross-modal information grounding and GPT-2 for autoregressive answer generation. This is the first visual dialog agent solely reliant on an autoregressive decoder for text generation. Evaluated on the VisDial dataset, the model achieves promising results, with scores of 64.05, 62.67, 70.17, and 15.37 on normalized discounted cumulative gain (NDCG), rank@5, rank@10, and the mean, respectively. These outcomes underscore the effectiveness of this approach, particularly considering its efficiency in terms of dataset size, architecture complexity, and generation process. The code and dataset are available at https://github.com/GhadaElshamy/MS-GPT-visdial.git , complete with usage instructions to facilitate replication of these experiments.
Keywords:
References
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. ArXiv, abs/2005.14165. https://doi.org/10.48550/ARXIV.2005.14165
Cadène, R., Dancette, C., Ben-younes, H., Cord, M., & Parikh, D. (2019). RUBi: Reducing unimodal biases in visual question answering. ArXiv, abs/1906.10169. https://doi.org/10.48550/arXiv.1906.10169
Chen, C., Tan, Z., Cheng, Q., Jiang, X., Liu, Q., Zhu, Y., & Gu, X. (2022). UTC: A unified transformer with inter-task contrastive learning for visual dialog. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 18082–18091). IEEE. https://doi.org/10.1109/CVPR52688.2022.01757 DOI: https://doi.org/10.1109/CVPR52688.2022.01757
Chen, F., Meng, F., Chen, X., Li, P., & Zhou, J. (2021). Multimodal incremental transformer with visual grounding for visual dialogue generation. Findings of the Association for Computational Linguistics, 436–446. https://doi.org/10.18653/v1/2021.findings-acl.38 DOI: https://doi.org/10.18653/v1/2021.findings-acl.38
Chen, F., Meng, F., Xu, J., Li, P., Xu, B., & Zhou, J. (2020). DMRM: A dual-channel multi-hop reasoning model for visual dialog. AAAI Conference on Artificial Intelligence (pp. 7504-7511). https://doi.org/10.1609/aaai.v34i05.6248 DOI: https://doi.org/10.1609/aaai.v34i05.6248
Chen, Z., Qiu, G., Li, P., Zhu, L., Yang, X., & Sheng, B. (2023). MNGNAS: Distilling adaptive combination of multiple searched networks for one-shot neural architecture search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 13489-13508. https://doi.org/10.1109/TPAMI.2023.3293885 DOI: https://doi.org/10.1109/TPAMI.2023.3293885
Cui, X., Khan, D., He, Z., & Cheng, Z. (2023). Fusing surveillance videos and three‐dimensional scene: A mixed reality system. Computer Animation and Virtual Worlds, 34(1), e2129. https://doi.org/10.1002/cav.2129 DOI: https://doi.org/10.1002/cav.2129
Dai, J., & Zhang, X. (2022). Automatic image caption generation using deep learning and multimodal attention. Computer Animation and Virtual Worlds, 33(3–4), e2072. https://doi.org/10.1002/cav.2072 DOI: https://doi.org/10.1002/cav.2072
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M. F., Parikh, D., & Batra, D. (2016). Visual dialog. ArXiv, abs/1611.08669. https://doi.org/10.48550/ARXIV.1611.08669 DOI: https://doi.org/10.1109/CVPR.2017.121
Das, A., Kottur., S., Moura, J. M. F., Lee, S., & Batra, D. (2017). Learning cooperative visual dialog agents with deep reinforcement learning. 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 2970–2979). IEEE. https://doi.org/10.1109/ICCV.2017.321 DOI: https://doi.org/10.1109/ICCV.2017.321
de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., & Courville, A. (2016). GuessWhat?! Visual object discovery through multi-modal dialogue. ArXiv, abs/1611.08481. https://doi.org/10.48550/ARXIV.1611.08481 DOI: https://doi.org/10.1109/CVPR.2017.475
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. 2019 Conference of (NAACL-HLT) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423 DOI: https://doi.org/10.18653/v1/N19-1423
Fan, H., Zhu, L., Yang, Y., & Wu, F. (2020). Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications, 16(3), 78. https://doi.org/10.1145/3390891 DOI: https://doi.org/10.1145/3390891
Gan, Z., Cheng, Y., Kholy, A. E., Li, L., Liu, J., & Gao, J. (2019). Multi-step reasoning via recurrent dual attention for visual dialog. 57th Annual Meeting of the Association for Computational Linguistics (pp. 6463–6474). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1648 DOI: https://doi.org/10.18653/v1/P19-1648
Guo, D., Xu, C., & Tao, D. (2019). Image-question-answer synergistic network for visual dialog. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10426–10435). IEEE. https://doi.org/10.1109/CVPR.2019.01068 DOI: https://doi.org/10.1109/CVPR.2019.01068
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). IEEE. https://doi.org/10.1109/CVPR.2016.90 DOI: https://doi.org/10.1109/CVPR.2016.90
Jiang, X., Du, S., Qin, Z., Sun, Y., & Yu, J. (2020a). KBGN: knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. 28th ACM International Conference on Multimedia (pp. 1265–1273). Association for Computing Machinery. https://doi.org/10.1145/3394171.3413826 DOI: https://doi.org/10.1145/3394171.3413826
Jiang, X., Yu, J., Qin, Z., Zhuang, Y., Zhang, X., Hu, Y., & Wu, Q. (2020b). DualVD: An adaptive dual encoding model for deep visual understanding in visual dialogue. AAAI Conference on Artificial Intelligence (pp. 11125–11132). AAAI Technical Track: Vision. https://doi.org/10.1609/aaai.v34i07.6769 DOI: https://doi.org/10.1609/aaai.v34i07.6769
Jiang, X., Yu, J., Sun, Y., Qin, Z., Zhu, Z., Hu, Y., & Wu, Q. (2020c). DAM: Deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue. Twenty-Ninth International Joint Conference on Artificial Intelligence (pp. 687–693). https://doi.org/10.24963/ijcai.2020/96 DOI: https://doi.org/10.24963/ijcai.2020/96
Kang, G.-C., Kim, S., Kim, J.-H., Kwak, D., & Zhang, B.-T. (2023). The dialog must go on: Improving visual dialog via generative self-training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6746–6756). IEEE. https://doi.org/10.1109/CVPR52729.2023.00652 DOI: https://doi.org/10.1109/CVPR52729.2023.00652
Kottur, S., Moura, J. M. F., Parikh, D., Batra, D., & Rohrbach, M. (2018). Visual coreference resolution in visual dialog using neural module networks. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer Vision – ECCV 2018 (Vol. 11219, pp. 160–178). Springer International Publishing. https://doi.org/10.1007/978-3-030-01267-0_10 DOI: https://doi.org/10.1007/978-3-030-01267-0_10
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., & Fei-Fei, L. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123, 32–73. https://doi.org/10.1007/s11263-016-0981-7 DOI: https://doi.org/10.1007/s11263-016-0981-7
Li, L., Huang, T., Li, Y., & Li, P. (2023). Trajectory‐BERT: Pre‐training and fine‐tuning bidirectional transformers for crowd trajectory enhancement. Computer Animation and Virtual Worlds, 34(3–4), e2190. https://doi.org/10.1002/cav.2190 DOI: https://doi.org/10.1002/cav.2190
Lin, T.-Y., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. Computer Vision - ECCV 2014 (pp. 740–755). Springer. https://doi.org/10.1007/978-3-319-10602-1_48 DOI: https://doi.org/10.1007/978-3-319-10602-1_48
Lin, X., Sun, S., Huang, W., Sheng, B., Li, P., & Feng, D. D. (2023). EAPT: Efficient attention pyramid transformer for image processing. IEEE Transactions on Multimedia, 25, 50–61. https://doi.org/10.1109/TMM.2021.3120873 DOI: https://doi.org/10.1109/TMM.2021.3120873
Liu, A.-A., Zhang, G., Xu, N., Guo, J., Jin, G., & Li, X. (2022). Closed-loop reasoning with graph-aware dense interaction for visual dialog. Multimedia Systems, 28, 1823–1832. https://doi.org/10.1007/s00530-022-00947-1 DOI: https://doi.org/10.1007/s00530-022-00947-1
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32.
Lu, J., Kannan, A., Yang, J., Parikh, D., & Batra, D. (2017). Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 30). Curran Associates, Inc.
Murahari, V., Batra, D., Parikh, D., & Das, A. (2020). Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 (Vol. 12363, pp. 336–352). Springer International Publishing. https://doi.org/10.1007/978-3-030-58523-5_20 DOI: https://doi.org/10.1007/978-3-030-58523-5_20
Nguyen, V.-Q., Suganuma, M., & Okatani, T. (2020). Efficient attention mechanism for visual dialog that Can handle all the interactions between multiple inputs. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 (Vol. 12369, pp. 223–240). Springer International Publishing. https://doi.org/10.1007/978-3-030-58586-0_14 DOI: https://doi.org/10.1007/978-3-030-58586-0_14
OpenAI. (2023). GPT-4 technical report. ArXiv, abs/2303.08774. https://doi.org/10.48550/arXiv.2303.08774
Radford, A., & Narasimhan, K. (2018). Improving language understanding by generative pre-training.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 DOI: https://doi.org/10.1109/TPAMI.2016.2577031
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108. https://doi.org/10.48550/ARXIV.1910.01108
Schwartz, I., Yu, S., Hazan, T., & Schwing, A. G. (2019). Factor graph attention. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2039–2048). IEEE. https://doi.org/10.1109/CVPR.2019.00214 DOI: https://doi.org/10.1109/CVPR.2019.00214
Seo, P. H., Lehrmann, A. M., Han, B., & Sigal, L. (2017). Visual reference resolution using attention memory for visual dialog. Advances in Neural Information Processing Systems 30, 30, 3719–3729.
Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. 56th Annual Meeting of the Association for Computational Linguistics (pp. 2556–2565). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1238 DOI: https://doi.org/10.18653/v1/P18-1238
Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30, 30.
Wang, A., & Cho, K. (2019). BERT has a mouth, and It must speak: BERT as a markov random field language model. Workshop on Methods for Optimizing and Evaluating Neural Language Generation, 30–36. https://doi.org/10.18653/v1/W19-2304 DOI: https://doi.org/10.18653/v1/W19-2304
Wang, Y., Joty, S. R., Lyu, M. R., King, I., Xiong, C., & Hoi, S. C. H. (2020). VD-BERT: A unified vision and dialog transformer with BERT. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3325–3338). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.269 DOI: https://doi.org/10.18653/v1/2020.emnlp-main.269
Wu, Q., Wang, P., Shen, C., Reid, I., & Hengel, A. V. D. (2018). Are you talking to me? reasoned visual dialog generation through adversarial learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6106–6115). IEEE. https://doi.org/10.1109/CVPR.2018.00639 DOI: https://doi.org/10.1109/CVPR.2018.00639
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., … Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv, abs/1609.08144. https://doi.org/10.48550/ARXIV.1609.08144
Xin, B., Xu, N., Zhai, Y., Zhang, T., Lu, Z., Liu, J., Nie, W., Li, X., & Liu, A.-A. (2023). A comprehensive survey on deep-learning-based visual captioning. Multimedia Systems, 29, 3781–3804. https://doi.org/10.1007/s00530-023-01175-x DOI: https://doi.org/10.1007/s00530-023-01175-x
Yang, T., Zha, Z.-J., & Zhang, H. (2019). Making history matter: History-advantage sequence training for visual dialog. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 2561–2569). IEEE. https://doi.org/10.1109/ICCV.2019.00265 DOI: https://doi.org/10.1109/ICCV.2019.00265
Yu, Y., Yang, Y., & Xing, J. (2024). PMGAN: Pretrained model-based generative adversarial network for text-to-image generation. The Visual Computer, 44, 303–314. https://doi.org/10.1007/s00371-024-03326-1 DOI: https://doi.org/10.1007/s00371-024-03326-1
Yu, Z., Yu, J., Cui, Y., Tao, D., & Tian, Q. (2019). Deep modular co-attention networks for visual question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6274–6283). IEEE. https://doi.org/10.1109/CVPR.2019.00644 DOI: https://doi.org/10.1109/CVPR.2019.00644
Zhang, B., Ma, R., Cao, Y., & An, P. (2024). Swin-VEC: Video swin transformer-based GAN for video error concealment of VVC. The Visual Computer, 40, 7335–7347. https://doi.org/10.1007/s00371-024-03518-9 DOI: https://doi.org/10.1007/s00371-024-03518-9
Zhang, J., Zhao, T., & Yu, Z. (2018). Multimodal hierarchical reinforcement learning policy for task-oriented visual dialog. 19th Annual SIGdial Meeting on Discourse and Dialogue (pp. 140–150). Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5015 DOI: https://doi.org/10.18653/v1/W18-5015
Zhao, L., Lyu, X., Song, J., & Gao, L. (2021). GuessWhich? Visual dialog with attentive memory network. Pattern Recognition, 114, 107823. https://doi.org/10.1016/j.patcog.2021.107823 DOI: https://doi.org/10.1016/j.patcog.2021.107823
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 19–27). IEEE. https://doi.org/10.1109/ICCV.2015.11 DOI: https://doi.org/10.1109/ICCV.2015.11
Article Details
Abstract views: 93
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.