State of the art models for Similarity Learning are all based on Deep Learning architecture using Siamese Network [Gregory et al., 2015]. They define a feature extraction pipeline that creates a latent representation of input data. This embedding vector is semantically highly descriptive and can be used for the computation of distances between data records to measure similarity. While similarity learning is a popular topic, the combination of multiple modalities has not yet been attracted most of the attention of the field. In the context of Duplicate Product Identification, both the textual description of the product and their pictures can be used to make the similarity decision. This context of using data descriptors, e.g images and text, of different modalities require to rethink the concept of Siamese Network to perform multimodal similarity learning. In this work, multiple approaches have been explored: unimodal & multimodal Similarity Learning algorithms. The latter, combining embeddings across multiple modalities through gradient sharing method was proven to outperform any other combination of unimodal approaches through the use of N-Way & F-Beta Scoring. Furthermore, our analysis of the impact on the learnt features of the combination of multiple modalities has given insights on how they can collaborate to optimize the training function by detecting the most informative features for each modality. Comparing the weights of a multimodal siamese network to unimodal network helped to better evaluate cross-modality data profiles captured within the embeddings.