Transferring Deep Visual Semantic Features to Large-Scale Multimodal Learning to Rank

Introduction

One consideration to make is that large modern CNNs require large amounts of training data. The amount of examples available to an individual query model can be in the low hundreds, particularly for queries in the long tail. This makes training one deep CNN per query from scratch prone to overfitting. Transfer learning is a popular method for dealing with this problem, with many expels in computer vision.
Learning to rank search results has received considerable attention over the past decade, and it is at the core of modern information retrieval.

Works
(1) Multimodal Listing Embedding

Our listing contains text information such as descriptive title and tags, as well as an image of the item for sale. To measure the value of including image information, we embed listings in a multimodal space (consisting of high-level text …)

(2) Training Models

we extend our existing learning to rank models with image information. At first, we embed listings for the learning to rank task in both single modality and multimodal settings. Then, the listing embedded in both modalities are incorporated into a learning to rank framework.