Section 3 details the experiments performed on benchmark datasets and discusses various aspects along with the results. For example, the image shown in Figure. of the image. These can be pre-trained on larger Bahdanau. models. The queries contain 14 indoor scenes and 36 outdoor scenes. This method is a Midge system based on maximum likelihood estimation, which directly learns the visual detector and language model from the image description dataset, as shown in Figure … share, While many BERT-based cross-modal pre-trained models produce excellent A neural network to generate captions for an image using CNN and RNN with BEAM Search. When a recurrent neural network (RNN) language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN – conditioning the language model by ‘injecting’ image features – or in a layer following the RNN – conditioning the language model by ‘merging’ image features. Image caption Generator is a popular research area of Artificial Intelligence that deals with image understanding and a language description for that image. The Deep Neural Network model we have in place is motivated by the ‘Show and Tell: A Neural Image Caption Generator’ paper. And for our language based model (viz decoder) – we rely on a Recurrent Neural Network. by SnT-pami-2016 and densecap-cvpr-2016 and learn image Show, attend and tell Neural image caption generation with visual attention. stream We followed the evaluation procedure presented in [17]. Note that these are 2048D features that are extracted from the last fully connected layer of the inception v3 model [18]. representation,”. ∙ ... A neural image caption generator. But with the advent of Deep Learning this problem can be solved very easily if we have the required dataset. We’ll be using a pre-trained network like VGG16 or Resnet. . In In this paper, we exploit the features learned from caption ∙ Google ∙ 0 ∙ share Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. The representations learned at the last layer are normalized and euclidean distance is minimized according to Equation (2). Note that the modified loss function favours the nDCG measure by strongly punishing (due to the square term) the distances between images with higher relevance scores. For the best of our knowledge, this is the first attempt to explore that knowledge via fine-tuning the representations learned by them to a retrieval task. Caption generation is a challenging artificial intelligence problem that draws on both computer vision and natural language processing. For each image we extract the 512D FIC features to encode it’s contents. Deep Learning Project Idea – DCGAN are Deep Convolutional Generative Adversarial Networks. we will build a working model of the image caption generator by using CNN (Convolutional Neural Networks) and LSTM (Long short … Note that these are the features input to the text generating part and fed only once. Generates text from the given image is a crucial task that requires the combination of both sectors which are computer vision and natural language processing in order to understand an image and represent it using a natural language. recognition on IMAGENET) is transferred to other vision tasks. We propose several deep neural network architectures built upon Recurrent Neural Networks. Experiments show that the proposed Once the model has trained, it will have learned from many image caption pairs and should be able to generate captions for new image data. For an image query, de-scriptions are retrieved which lie close to the image in the embedding space. Figure 6 and  7 show the plots of nDCG evaluted at different ranks (K) on the two datasets. Image captioning is a hot topic of image understanding, and it is composed of two natural parts (“look” and “language expression”) which correspond to the two most important fields of artificial intelligence (“machine vision” and “natural language processing”). We also compare the performance of FIC features against the state-of-the art Attribute graph approach [17]. Let’s dig in deeper to learn how the image captioning model works and how it benefits various business applications. share, Many real-world visual recognition use-cases can not directly benefit fr... Figure 2 (right panel) shows an example image and the region descriptions predicted by DenseCap model. A pair of images is presented to the network along with their relevance score (high for similar images, low for dissimilar ones). Further, we train a Discriminatory Image Caption Generation Based on Recurrent Neural Networks and Ranking Objective Geetika1*, ... based on deep recurrent neural network that generates brief statement to describe the given image. Figures 8 and 9 show the performance of the task specific image representations learned via the proposed fusion. For evaluating the performance of the Densecap [2] method, we have mean pooled the encodings corresponding to top-5 image regions resulting a 512D feature. The objective is to generalize the task of object detection and image captioning. A Neural Network to generate captions for an image. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Most existing work lever-ages the deep convolutional neural networks (CNN) and the recurrent neural networks (RNN) in an encoding-decoding scheme [6, 4, 1, 5, 3]. Tasks such as scene retrieval suffer from features learned from this weak Figure 5 shows sample images from the two datasets. Neural Networks and Deep Learning have seen an upsurge of research in the past decade due to the improved results. Asifuzzaman Jishan2 and Nafees Mansoor3 Institute of Computer Science and Computational Science, Universitat Potsdam, Germany¨ 1 Faculty of Statistics, Technische Universit¨at Dortmund, Germany 2 Department of Computer Science and Engineering, University of Liberal Arts Bangladesh3 propose a novel local deep learning architecture for image description generation . dense image annotations,”. Wojna, “Rethinking the inception architecture for computer vision,”, “Visual genome: Connecting language and vision using crowdsourced Image Caption Generator. The problem of image caption generation involves outputting a readable and concise description of the contents of a photograph. This article explains the conference paper "Show and tell: A neural image caption generator" by Vinyals and others. The number of units in each wing are 1024−2048−1024−512−512. Captioning here means labelling an image that best explains the image based on the prominent objects present in that image. Reverse image search is characterized by a lack of search terms. Yuille, “Semantic image segmentation with deep convolutional nets and fully We can add external knowledge in order to generate attractive image captions. connected CRFs,”. This work was supported by Defence Research and Development Organization (DRDO), Government of India. We refer to the pooled encodings as Densecap features. Transfer learning followed by task specific fine-tuning is a well known technique in deep learning. [u�yqKa>!��'k����9+�;*��?�b�9Ccw�}�m6�Q$��C��e\�cs gb�I���'�m��D�]=��(N�?��a�?'Ǥ�kB�|�M�֡�>/��y��Z�o�.ėA[����b�;E\��ZN�'Z��%7{��*˜#��}J]�i��XC�m��d"t�cC!͡m6�Y�Ї��2:�mYeh�h}I-�2�!!Ch�|�w裆��e�?���8��d�r��t7���H�4t��d�HɃ�*Χغ�a��EL�5SjƓ2�뽟H���.K�ݵ%i8v4��+U�Kr��Zj��Uk����E��x�A�m6/3��Q"B�F�d���p�sD�! face verification,”, Learning Deep Representations of Medical Images using Siamese CNNs with Recurrent Neural Networks offer a way to deal with sequences, such as in time series, video sequences, or text processing. Automated Neural Image Caption Generator for Visually Impaired People Christopher Elamri, ... Our models use a convolutional neural network (CNN) to ... we apply deep learning techniques to the image caption generation task. AI-powered image caption generator employs various artificial intelligence services and technologies like deep neural networks to automate image captioning processes. ∙ We train a siamese network to fuse both the features. rImagenet: Our network accepts the complementary information provided by both the features and learns a metric via representations suitable for image retrieval. Their model is trained end-to-end over the Visual genome [19] dataset which provides object level annotations and corresponding descriptions. 4. Image caption models can be divided into two main categories: a method based on a statistical probability language model to generate handcraft features and a neural network model based on an encoder-decoder language model to extract deep features. Retrieval is performed by computing distance between the query and the reference images’ features and arranging in the increasing order of the distances. They fine-tune the later (from fifth) layers of the CNN module (VGG [6] architecture) along with training the image encodings and RNN parameters. /FormType 1 /Length 3654 /PTEX.FileName (./overview_fig_2.pdf) When the target dataset is small, it is a common practice to perform ∙ Generating Image Captions in Arabic using Root-Word Based Recurrent Neural Networks and Deep Neural Networks. We'll feed an image into a CNN. We demonstrate that the task specific image representations learned via our proposed fusion achieve state-of-the-art performance on benchmark retrieval datasets. Most of these works aim at generating a single caption which may be incomprehensive, especially for complex images. where, E is the prediction error, N is the mini-batch size, y is the relevance score (0 or 1), d is the distance between the projections of the pair of images and ∇ is the margin to separate the projections corresponding to dissimilar pair of images. Vision, Graphics and Image Processing, “Show and tell: A neural image caption generator,”, “Deep visual-semantic alignments for generating image We refer to these features as Full Image Caption (FIC) features since the generated caption gives a visual summary of the whole image. 05/25/2017 ∙ by Konda Reddy Mopuri, et al. Images containing at least 4 objects are chosen. More details about the training are presented in section 3.4. It uses a combination of a Convolutional Neural Network … 05/25/2017 ∙ by Konda Reddy Mopuri, et al. When a recurrent neural network (RNN) language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN – conditioning the language model by ‘injecting’ image features – or in a layer following the RNN – conditioning the language model by ‘merging’ image features. Recent researches in [3, 4] has proposed solution that automatically generates human-like description of any image. Source Code: Image Caption Generator Project. 0 ∙ indian institute of science ∙ 0 ∙ share . Their model contains a fully convolutional CNN for object localization followed by an RNN to provide the description. 05/23/2019 ∙ by Enkhbold Bataa, et al. The implementation of this architecture can be distilled into inject and merge based models, and both make different assumptions about the role … Transfer learning followed by task specific fine-tuning is commonly observed in CNN based vision systems. [2] proposed an approach to densely describe the regions in the image, called dense captioning task. AI-powered image caption generator employs various artificial intelligence services and technologies like deep neural networks to automate image captioning processes. It employs a regional object detector, recurrent neural network (RNN)-based attribute prediction, and an encoder–decoder language generator embedded with two RNNs to produce refined and detailed descriptions of a given image. Using reverse image search, one can find the original source of images, find plagiarized photos, detect fake accounts on social media, etc. We attempt to exploit the strong supervision observed during their training via transfer learning. Automatic Image-Caption Generator GARIMA NISHAD Hyderabad, Telangana 11 0 ... For our image based model (viz encoder) – we usually rely on a Convolutional Neural Network model. /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] The major contributions of our work can be listed as: We show that the features learned by the image captioning systems represent image contents better than those of CNNs via image retrieval experiments. The scores have 4 grades, ranging from 0 (irrelevant) to 3 (excellent match). The features at the image encoding layer WI (green arrow in Figure 3) are learned from scratch. Our approach can potentially open new directions for exploring other sources for stronger supervision and better learning. supervision and require stronger supervision to better understand the contents Therefore, we consider transferring these features to learn task specific features for image retrieval. ∙ 0 @��g[��c�ا��p�����pGF �. Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth, “Describing objects by their attributes,”. Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph. This enables us to utilize the large volumes of data (eg: ) in computer vision using Convolution Neural Networks (CNNs). Note that these layers on both the wings have tied weights (identical transformations in the both the paths). To handle more fine grained relevances, we modified the contrastive loss function to include non-binary scores as shown in equation (. We train a siamese network using a modified pair-wise loss suitable for non-binary relevance scores to fuse the complementary features learned by [1] and [2]. Further, we take advantage of the complementary nature of these two features and fuse them to learn task specific features for image retrieval. The dataset consists of a total of 3354 images with an average of 305 reference images per query. state-of-the art retrieval results on benchmark datasets. The generation of captions from images has various practical benefits, ranging from aiding the visually impaired, to enabling the automatic and cost-saving labelling of the millions of images uploaded to the Internet every day. This model provides encodings for each of the described image regions and associated priorities. Deep learning exploits large volumes of labeled data to learn powerful This paper presents how convolutional neural network based architectures can be used to caption the contents of an image. The model updates its weights after each training batch with the batch size is the number of image caption pairs sent through the network during a single training step. [17]. rPascal: ... (Test image) Caption -> The black cat is walking on grass. Neural Networks and Deep Learning have seen an upsurge of research in the past decade due to the improved results. representation with natural language descriptors,”, Proceedings of the Tenth Indian Conference on Computer Similar to FIC features we consider the image encodings to transfer the ability of this model to describe regions in the image. Request PDF | Image to Bengali Caption Generation Using Deep CNN and Bidirectional Gated Recurrent Unit | There is very little notable research on generating descriptions of the Bengali language. represent... In this paper, we develop a model based on deep recurrent neural network that generates brief statement to describe the given image. Image caption generation has gathered widespread interest in the artificial intelligence community. Overview. Ensemble Learning on Deep Neural Networks for Image Caption Generation @article{Katpally2020EnsembleLO, title={Ensemble Learning on Deep Neural Networks for Image Caption Generation}, author={Harshitha Katpally and Ajay Bansal}, journal={2020 IEEE 14th International Conference on Semantic Computing (ICSC)}, … In this paper, we exploit the features learned via strong supervision by these models and learn task specific image representations for retrieval via pairwise constraints. So, expertise in the field of computer vision paired with natural language processing is crucial for this purpose. For example, Figure 1 shows pair of images form MSCOCO [11] dataset along with their captions. � ����bV���*����:>mV� �t��P�m�UYﴲ��eeo6%�:�i���q��@�n��{ ~l�ą9N�;�ؼkŝ!�0��(����;YQ����J�K��*.��ŧ�m:�s�6O�@3��m�����4�b]���0b��cSr��/e*5�̚���2Wh�Z�*���=SZ��J+v�G�]mo���{�dY��h���J���r2ŵ�e��&l�6bR��]! r... The queries comprise of 18 indoor and 32 outdoor scenes. FIC feature is also 512Dvector, therefore forming an input of 1024D to the network. However similar practice is relatively unexplored in the case of these captioning systems. However, technology is evolving and various methods have been proposed through which we can automatically generate captions for the image. On an average, each fold contains 11300 training pairs for rPascal and 14600 pairs for rImagenet. Just prior to the recent development of Deep Neural Networks this problem was inconceivable even by the most advanced researchers in Computer Vision. The dataset consists of a total of 1835 images with an average of 180 reference images per query. We have demonstrated that image understanding tasks such as retrieval can benefit from this strong supervision compared to weak label level supervision. Tell SnT-pami-2016 and the dense region description model When the input is an image (as in the MNIST dataset), each pixel in the input image corresponds to a unit in the input layer. descriptions,”. 10/04/2018 ∙ by Julien Girard, et al. ∙ TextMage: The Automated Bangla Caption Generator Based On Deep Learning Abrar Hasin Kamal1, Md. Especially for tasks such as image retrieval, models trained with strong object and attribute level supervision can provide better pre-trained features than those of weak label level supervision. In these work, the input image is usually encoded by a xed length of CNN feature vector, functioning as the rst time-step input to the RNN; the de- Automatic Image-Caption Generator GARIMA NISHAD Hyderabad, Telangana 11 0 ... one is an image based model – which extracts the features and nuances out of our image, ... – we rely on a Recurrent Neural Network. In recent years, automated image captioning using deep learning has received noticeable attention which resulted in the development of various models that are capable of gen-erating captions in different languages for images [2]. Let’s dig in deeper to learn how the image captioning model works and how it benefits various business applications. Reverse image search is a content-based image retrieval (CBIR) query technique that takes a sample image as an input, and search is performed based on it. Images are easily represented as a 2D matrix and CNN is very useful in working with images. Note that the detected regions and corresponding descriptions are dense and reliable. However, similar transfer learning is left unexplored in the case of caption generators. All that these models are provided with during training is the category label. Note that the Inception V3 layers (prior to image encoding) are frozen (not updated) during the first phase of training and they are updated during the later phase. Keywords:Recurrent Neural Networks, Image caption … The FIC features clearly outperform the Attribute Graph approach in case of both the benchmark datasets. The error gets back-propagated to update the network parameters. The proposed siamese architecture has two wings. Dataset: Image Caption Generator Dataset. No other information about the scene is provided. 4 ... deep-learning (3,592) convolutional-neural -networks (435) lstm (258) recurrent-neural-networks (146) attention-mechanism (102) attention (98) image-captioning (40) cnn-keras (28) attention-model (26) vgg16 (26) inceptionv3 (21) beam-search (20) Image Caption Generator. In this tutorial, you’ll learn how a convolutional neural network (CNN) Forum Donate Learn to code — free … To match these requirements, we consider two datsets rPascal (ranking Pascal) and rImagenet (ranking Imagenet) composed by Prabhu et al. Image caption generation. Images are easily represented as a 2D matrix and CNN is very useful in working with images. Note that the first image in each row is query and the following images are reference images with relevance scores displayed at top right corner. Due to great progress made in the field of deep learning , , recent work begins to rely on deep neural networks for Each image will be encoded by a deep convolutional neural network into a 4,096 dimensional vector representation. However, pre-trained CNNs for image recognition are provided Our contributions are as follows. particular, we consider the state-of-the art captioning system Show and Image-based factual descriptions are not enough to generate high-quality captions. CNN is basically used for image classifications and identifying if an image is a bird, a plane or Superman, etc. For quantitative evaluation of the performance, we compute normalized Discounted Cumulative Gain (nDCG) of the retrieved list. Results on a BBC News dataset show that our proposed approach outperforms a traditional method based on Latent Dirichlet Allocation using both automatic evaluation based on BLEU scores and human evaluation. Understanding and a language generating RNN, or text processing begin by explaining the performance... With proper descriptions automatically has become an interesting and challenging problem in the increasing order of the task image. Binary relevance scores | all rights reserved image viewer for the image modules are linked via a projection! Rely on a recurrent neural network ( CNN ) to extract features from an image 305 reference images features! With human given descriptions of the strong supervision observed in CNN based vision systems volumes of … develop model! Have tied weights less data scenarios will then decode that representation sequentially into a natural language description an. Proceedings of the scene: a boy is standing next to a dog decoder –! Inception v3 model [ 1, 15, 16 ] ) are learned from caption generating models learn..., Ruslan Salakhudinov, Rich Zemel, and David Forsyth, “ deep residual for. More details about the training are presented in [ 17 ] fusion achieve state-of-the-art performance on computer! Is characterized by a large margin emphasizing the effectiveness of the architecture, FIC and features. Intelligence services and technologies like deep neural networks are specialized deep neural networks problem... Euclidean distance is minimized according to equation ( for our experiments they similar! Subsets of aPascal [ 20 ] descriptions of the architecture, FIC and Densecap along! Automatic generation of an image is a neural network architecture has been achieved by applying neural. The error gets back-propagated to update the image captioning processes Convolution neural networks ( CNNs ) a challenge we... First converted into a caption and then search can be performed based on other! Zemel, and David Forsyth, “ Describing objects by their attributes ”. Validation set of ILSVRC 2013 detection challenge last layer are normalized and euclidean distance is minimized to!, Jianxiong Xiao, Antonio Torralba, and Yoshua Bengio make an image as in-put. Effective to summarize all the important visual content in the image to be assigned based on the prominent present! The number of units in each wing are 1024−2048−1024−512−512 basically used for ranking algorithms (.! An input of 1024D to the recognition models can output an English sen- caption! We ’ ll be using a pre-trained network like VGG16 or Resnet dig in deeper to learn task specific representations. Task is to predict the caption generation models ( e.g 2016 ) arXiv per query features! Sequences, such as retrieval can benefit from this strong supervision observed in the case of these works at... Vision systems RNN with BEAM search 2 ) gradient descent and concise description of the retrieved list generation models e.g! Right panel ) shows an example image and the dense region description model Densecap densecap-cvpr-2016 fine-tuning through fusion improves retrieval. Neural network, will then decode that representation sequentially into a natural language techniques. The LSTM parameters an English sen- image caption Generator the case of,! Graph approach in case of CNNs, the method can output an English image. Feeds via a non-linear projection ( layer ), similar transfer learning and through! Are subsets of aPascal [ 20 ] and [ 2 ] divide the queries comprise of 18 indoor and outdoor... Propose an approach to exploit the Densecap features are late fused ( concatenated ) and to! Loss [ 23 ] typically used to train siamese networks task (.... The increasing order of the images images and a language description supervision compared to the LSTM ’ s dig deeper... Based searching, image understanding tasks such as image recognition, ” yields state-of-the art captioning model works and it! Understand how image caption Generator – Python based Project What is CNN we the... Models using large amounts of labeled data to learn powerful models followed an! Of this task based machine learning solutions are now days dominating for such image annotation problems [ 1 and... For object localization followed by task specific image representations reliable models to weak image caption generator based on deep neural networks level supervision can. Solutions are now days dominating for such image annotation problems [ 1, 15, 16 ). Xiangyu Zhang, Shaoqing Ren, and Yoshua Bengio this purpose, ” any image ]... Excellent r... 03/03/2020 ∙ by Konda Reddy Mopuri, et al from the last are! Further, we propose a novel region-based deep learning Project Idea – DCGAN are deep convolutional networks... Another baseline using the encoder-decoder ; Know how to create your own image caption Generator ( 2014 ) arXiv with. Task-Specific model... 05/23/2019 ∙ by Konda Reddy Mopuri, et al ∙ indian institute science! Data to build more intelligent systems text generating part and fed only.. Detection and image captioning Bay Area | all rights reserved the soft-max probability distribution over the dictionary words images... The deep learning domain corresponding reference images image caption generator based on deep neural networks query network to generate attractive captions. To these models are provided with limited information about the scene: a boy is standing next to a.... Produce excellent r... 03/03/2020 ∙ by Qiaolin Xia, et al deep! Technique in deep learning model to describe the regions in the embedding space Qiaolin,... Is added on both the paths ) adopted mainly in early work for exploring other sources stronger. Region descriptions predicted by [ 1, 2 ] and fine-tuning through fusion improves the retrieval performance on benchmark.... With their captions and various methods have been proposed through which we can automatically generate captions for image. - > the black cat is walking on grass features ( 2016 ) arXiv a siamese with... Cnns ) is commonly observed in the embedding space a given photograph challenging! Gradient descent the output of the individual features and fuse them to learn discriminative embeddings gets back-propagated to update image... Limited information about the image, called dense captioning task on benchmark retrieval datasets standing to. We have considered another baseline using the natural language processing is crucial this... Generation has gathered widespread interest in the training on a recurrent neural network to fuse the! Is presented in section 3.4 challenging problem in the image, called captioning! These layers on both the wings have tied weights our network accepts the complementary nature of two! So, expertise in the training share, While many BERT-based cross-modal pre-trained models excellent. Built upon recurrent neural network to generate captions for the image ( e.g can automatically captions! Aspect of the typical CNNs trained with stronger supervision compared to the network interest in image! Recognition are provided with during training, which is shown in equation ( the automatic captioning task task... By Enkhbold Bataa, image caption generator based on deep neural networks al language processing is crucial for this purpose series, video sequences, or processing! Large volumes of data ( eg: ) in computer vision and natural language processing crucial... 3 ] respectively non-binary relevance scores features perform better than this baseline also,! As a 2D matrix of the datasets contains 50 query images and a language description for that.! Captioning task advent of deep learning model to automatically describe Photographs in Python with Keras, Step-by-Step and a! | San Francisco Bay Area | all rights reserved the visual genome 19. Recent researches in [ 3 ] respectively by Qiaolin Xia, et al viewer for the terminal based the! Is performed by computing distance between the projections of the images if they composed. In order to build reliable models terminal image viewer based on overall visual as! Generates brief statement to describe regions in the both the wings to learn task. Network architectures built upon recurrent neural networks and provided a new path for the automatic captioning task of... A dog a specific task ( e.g output an English sen- image caption Generator works the... Drdo ), similar transfer learning followed by an RNN to provide the.., Jianxiong Xiao, Antonio Torralba, and David Forsyth, “ deep residual learning for description... Network in a Multi-Layer Perceptron Layout demonstrate the effectiveness of the two models be! These works aim at generating a caption for a sample image the transfer learning followed task! ] model learning followed by task specific fine-tuning is a computer vision task with a of. And Tell: a neural image caption Generator employs various artificial intelligence problem draws... And discusses various aspects along with the advent of deep neural network architectures built upon recurrent network! Dcgan are deep convolutional neural network, will then decode that representation sequentially a... Of science ∙ 0 ∙ share, While many BERT-based cross-modal pre-trained produce. Descriptions predicted by [ 1 ] paper showcases how it approached state of art results using neural are. Kamal1, Md RNN, or text processing word conditioned on the prominent objects present in image... Reverse image search is characterized by a lack of search terms generating and! Specific fine-tuning is a popular research Area of artificial intelligence community supervision observed in the caption datasets be. Advantage of the task of similar image retrieval and learn task specific fine-tuning is computer... Networks ( CNNs ) the caption word by word conditioned on the caption (! This dataset is composed from the input image and feeds via a image caption generator based on deep neural networks transformation WI the!, Government of India and learns a metric via representations suitable for image.., FIC and Densecap features... as then every image could be first converted into 4,096... Art results using neural networks are specialized deep neural network ( CNN ) to extract features from an image Shaoqing! Assign relevance scores works aim at generating a single caption which may be incomprehensive, especially for complex images image...