Metric Learning Tutorial

Parag Jain

1 Metric Learning methods

Most popular machine learning algorithms like k-nearest neighbour, k-means, SVM uses a metric to identify the distance(or similarity) between data instances. It is clear that performances of these algorithm heavily depends on the metric being used. In absence of prior knowledge about data we can only use general purpose metrics like Euclidean distance, Cosine similarity or Manhattan distance etc, but these metric often fail to capture the correct behaviour of data which directly affects the performance of the learning algorithm. Solution to this problem is to tune the metric according to the data and the problem, manually deriving the metric for high dimensional data which is often difficult to even visualize is not only tedious but is extremely difficult. Which leads to put effort on metric learning which satisfies the data geometry.
Goal of metric learning algorithm is to learn a metric which assigns small distance to similar points and relatively large distance to dissimilar points.

Definition 1. A metric on a set X is a function (called the distance function or simply distance).
d : X × X → R,
where R is a set of real numbers, and for all x,y,z in X following condition are satisfied:
1.d(x, y) ≥ 0 (non-negativity)
2.d(x, y) = 0 if and only if x = y (coincidence axiom)
3.d(x, y) = d(y, x) (symmetry)
4.d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality).

If a function does not satisfy the second property but satisfies other three then it is called a pseudometric. But since most of the metric learning methods learns a pseudometric instead of a metric for rest of the discussion we will refer pseudometric as metric. Most of the metric learning methods in literature learns the metric of form,

∘ ------------------- dM (x,x ′) = (x − x′)TM (x − x ′)

(1)

which is Mahalanobis distance,where, M = (A^1∕2)^T(A^1∕2) is a positive semi-definite matrix.

1.1 Supervised Metric Learning

Given a set of k dimensional data points X ∈ℛ^N×k, supervised metric learning methods learns a metric by using some similarity/dissimilarity information provided as a constraints. There are different formulations proposed for supervised metric learning accommodating different kinds of constraints. In a general supervised setting most popular form of constraints used in literature [kulis2012metric] are:

1.: Similarity/dissimilarity constraints
$dA (xi,xj ) ≤ u (i,j) ∈ S dA(xi,xj) ≥ l (i,j) ∈ D$
where, (i,j) ∈ S for objects that are similar, (i,j) ∈ D for objects that are dissimilar.
2.: Relative constraints
R = (x_i,x_j,x_k) : x_i should be more similar to x_j than to x_k .:
$dA (xi,xj) < dA (xi,xk) − m$
Where m is margin, generally m is chosen to be 1.

Next section summarizes some of the widely used methods.

1.1.1 Large Margin Nearest Neighbor

Large Margin Nearest Neighbour(LMNN) [weinberger2009distance] learns a metric of form 1 parameterized by matrix A for kNN classification setting. Intuition behind this method is to learn a metric so that the k-nearest-neighbours belongs to the same class while instances with difference class labels should be separated by a margin.

Figure 1: Schematic illustration of LMNN approach [weinberger2009distance]

Let X_n×d is a set of data points in d dimensional space, and class labels y_i : i = 1...n we define target neighbours for each point x_i ∈ X as those points which are in k-nearest-neighbour of x_i and share the same label y_i and points which do not have same label as of x_i we call them impostors. Formulation consist of two terms which compete with each other, first term is to penalizes the large distance between each point x_i and its target neighbors while second term penalizes small distance between x_i and impostors. Cost function is defined as:

∑ 2 ∑ 2 2 𝜖(L ) = ηij||L(xi − xj)|| + c ηij(1 − Yil[1 + ||L(xi − xj)|| + ||L(xi − xl)|| ]+ ) ij ij

(2)

Where Y _ij and η_ij are binary matrices such that Y _ij is 1 when labels y_i and y_j match and η_ij is 1 when x_j is in the target neighbours of x_i, in second term [z]₊ = max(0,z) is a standard hinge loss function and c is some positive constant. Using cost function defined in 2 a convex optimization problem can be formulated as:

∑ T ∑
min ηij(xi − xj ) M (xi − xj) + c ηij(1 − Yilξijl)
ij ij
subject to (xi − xl)TM (xi − xl) − (xi − xj )T M (xi − xj) ≥ 1 − ξijl

ξijl ≥ 0
M ≽ 0

(3)

where matrix M = L^TL and η_ijl are slack variables.

1.1.2 Pseudo-Metric Online Learning Algorithm

Pseudo-Metric Online Learning algorithm (POLA) proposed by Shalev-Shwartz et al [shalev2004online] updated/learns the mertic in an online manner. In this method, given that 𝒳 denotes the feature space, POLA learns a metric of the form:

∘ ------------------ dA (x, x′) = (x − x′)′A (x − x ′)

where A is a positive semi-definite (PSD) matrix that defines the metric. The algorithm receives new samples as similarity and dissimilarity pairs in the form of z = (x,x′,y) ∈ (𝒳×𝒳×{+1,−1}), where y = +1 if pair (x,x′) are similar, otherwise y = −1. The loss function is defined as:

′ 2 lτ(A,b) = max {0,y τ(dA (x,x ) − b) + 1}

(4)

where b ∈ ℝ is a threshold. If d_A(x,x′) > b, we predict pairs to be dissimilar otherwise similar. The goal is to learn a matrix-threshold pair (A_τ,b_τ) which minimizes the cumulative loss. At each step, the algorithm receives a pair (x,x′,y) and updates the matrix threshold pair (A_τ,b_τ) in two steps:

1.

Projecting the current solution (A_τ,b_τ) onto set C_τ where

2 C τ = {(A, b) ∈ ℝ (n +1) : lτ(A, b) = 0}

i.e. C_τ is a set of all matrix-threshold pairs which gives zero loss on (x,x′,y).

2.

Projecting the new matrix-threshold pair to set of all admissible matrix-threshold pairs C_a,

(n2+1) Ca = { (A, b) ∈ ℝ : A ≽ 0,b ≥ 1}

Projecting onto C_τ: We denote the matrix-threshold pair as a vector w ∈ ℝ^n²+1, and 𝒳_τ ∈ ℝ^n²+1 as a vector form of a matrix-scalar pair (−y_τv_τv_τ^t,y_τ), where v_τ = x_τ − x′_τ. Using this, we can rewrite set C_τ as,

2 C τ = {w ∈ ℛn +1 : w 𝒳 τ ≥ 1}

Now, we can write the projection of w_τ onto C_τ as:

𝒫C (w τ) = w τ + α τ𝒳τ τ

(5)

where α_τ = 0 if w_τ𝒳_τ ≥ 1 otherwise α_τ = (1 − w_τ𝒳_τ)∕||𝒳_τ||₂², i.e.

ατ = lτ(A-τ,bτ)= lτ(Aτ,bτ)- ||𝒳 τ||22 ||vτ||42 + 1

Based on this, we can update matrix-threshold pair as:

t A ˆτ = Aτ − yτα τvτvτ, bˆτ = bτ + ατyτ

(6)

Projecting onto C_a: Projecting b_τ on set {b ∈ ℝ : b ≥ 1} is straightforward and can be achieved as b_τ+1 = max{1,b_τ}. However, projecting A has two cases,

y_τ = −1: In this case, A becomes A = A_τ + α_τv_τv_τ^t and α ≥ 0. Therefore, A ≽ 0 and hence, A_τ+1 = A.
y_τ = +1: In this case, we can write A = ∑ _i=1ⁿλ_iu_iu_i^t where u_i is the i^th eigenvector of A and λ_i is corresponding eigenvalue. We can hence get A_τ+1 by projecting A to the PSD cone as:
$n ∑ t A τ+1 = λiuiui i:λi>0$

For every new sample pair, the update is done by successively projecting (A_τ,b_τ) to C_τ and C_a.

1.1.3 Information Theoretic Metric Learning

Information Theoretic Metric Learning(ITML)[davis2007information] learns a mahalanobis distance metric that satisfy some given similarity and dissimilarity constraints on input data. Goal of ITML algorithm is to learn a metric of form d_A = (x_i −x_j)′A(x_i −x_j) according to which similar data point is close relative to dissimilar points.
ITML starts with an initial matrix d_A₀ where A₀ can be set to identity matrix(I) or inverse of covariance of the data and eventually learns a metric d_A which is close to starting metric d_A₀ and satisfies the the defined constraints. To measure distance between metrics it exploits the bijection between Gaussian distribution with fixed mean μ and Mahalanobis distance,

1- 1- 𝒩 (x |μ, A) = Zexp ( − 2dA (x, μ))

Using the above connection, the problem is formulated as:

𝒩 (x,μ, A0) miAn 𝒩 (x,μ,A0 )log(𝒩--(x,-μ,A-) )dx subject to dA (xi,xj) ≤ u∀(xi,xj) ∈ S, dA(xi,xj) ≥ l∀(xi,xj) ∈ D, A ≽ 0

(7)

Above formulation can be simplified by utilizing the connection between KL-divergence and LogDet divergence which is given as,

𝒩--(x,μ,A0-) 1- 𝒩 (x, μ,A0 )log( 𝒩 (x,μ, A) )dx = 2 Dld(A, A0) where, D (A, A ) = tr(AA −1) − logdet(AA −1) − d ld 0 0 0

(8)

Using 8 and 7 problem can be reformulated as:

minA Dld(A,A0 ) subject to dA(xi,xj) ≤ u∀(xi,xj) ∈ S dA(xi,xj) ≥ l∀(xi,xj) ∈ D A ≽ 0

(9)

Above formulation can be solved efficiently using bregman projection method as described in [davis2007information].

1.1.4 Mirror Descent for Metric Learning

Mirror Descent for Metric Learning, by [kunapuli2012mirror], is online metric learning approach which learns a pseudo-metric of form,

dM (x,z)2 = (x − z)TM (x − z )

given a pair of labeled points,(x_t,z_t,y_t)^T, where y_t denotes similarity/dissimilarity.
Taking μ as a margin, constraints can be written as,

y (μ − dM (x,z)2) ≥ 1
{ 2 }
l(M, μ) = max 0,1 − y(μ − dM (x,z) )

Where l(M,μ) is hinge loss.To learn pseudo-metric incrementally from triplets, updates can be computed as,

M = argmin B (M, M ) + η〈Δ l(M ,μ ),M − M 〉 + ηρ|||M |||
t+1 M≻0 ψ t M t t t t
′
μt+1 = argmμi≥n1 B ψ(μ,μt) + ηΔ μlt(Mt, μt)(μ − μt).

Where B_ψ(M,M_t) is bregman divergence, with ψ(x) was taken as either squared-Frobenius distance and von Neumann divergence.

1.2 Unsupervised Metric Learning

Unsupervised metric learning is generally seen as a byproduct of manifold learning or dimensionality reduction algorithms, although metric learning has a direct connection between linear manifold learning techniques as it finally learns a projective mapping but for non linear techniques, which are more useful, connection is not exact and can only be seen with some approximations. Because of these limitations of manifold techniques unsupervised metric learning has its own importance. Unsupervised metric learning aims to learn a metric without any supervision, most of the method proposed in this area either solve this problem in a domain specific way like clustering [cpcm] or by understanding the geometric properties of data.

1.2.1 Diffusion Maps

Diffusion maps [coifman2006diffusion] is a non-linear dimensionality reduction technique. Consider a graph G = (Ω,W) where Ω = {xi} _i=1^N are data samples and W is a similarity matrix with W(i,j) ∈ [0, 1]. W is obtained by applying Gaussian kernel on distances,

{ 2 } W (i,j) = exp −-d-(i,j) σ2

(10)

Using W we can obtain a transition matrix by row wise normalizing the similarity matrix:

N W--(i,j) ∑ P(i,j) = d where, di = Wij i j=1

(11)

Diffusion map introduce diffusion distance based on transition probabilities P of data, given as:

2 2 dt = ||Pt(i,:) − Pt(j,:)||1∕ϕ

(12)

where, P_t = P^t.

1.2.2 Unsupervised metric learning using self-smoothing operator

Unsupervised metric learning using self-smoothing operator [jiang2011unsupervised] proposed a diffusion based approach to improve input similarity between data points. It uses similar framework as diffusion maps but instead of using the notion of diffusion distance it uses a Self Smoothing Operator(SSO) which preserves the structure of weight matrix W described in equation 10. Main steps of SSO algorithm are summarized below:

1.: Compute smoothing kernel: P = D⁻¹W, where D is a diagonal matrix such that D(i,i) = ∑ _k=1ⁿ = W(i,k)
2.: Perform smoothing for t steps: W_t = WP^t
3.: Self-normalization: W^∗ = Γ⁻¹W_t where Γ is a diagonal matrix such that Γ(i,i) = W_t(i,i)
4.: Project W^∗ to psd cone Ŵ^∗ = psd(W^∗)

1.2.3 Unsupervised Distance Metric Learning using Predictability

Unsupervised distance metric learning using predictability [cpcm] learns a transformation of data which give well separated clusters by minimizing the blur ratio. This work proposes a two step algorithm to achive this task which alternates between predicting cluster membership by using linear regression model and again cluster these predictions. Given input data matrix X_N×p with N number of points in p dimensional space goal is to find learn a mahalanobis distance metric d(x,y) = ∘ ---------------T-
(x − y)A(x − y ) which minimizes the blur ration defined as:

min BR (A, c) ≡ SSC-- A,c SST

where SSC and SST are within cluster variance and total variance respectively.

1.3 Laplacian eigenmaps

Laplacian eigenmaps learns a low dimensional representation of the data such that the local geometry is optimally preserved, this low-dimensional manifold approximate the geometric structure of data. Steps below describes the methods in detail.
Consider set of data points X ∈ℛ^N, goal of laplacian eigenmaps is to find an embedding in m dimensional space where m < N preserving the local properties of data.

1.

Construct a graph G(V,E) where E is set of edges and V is a set of vertices. Each node in the graph G corresponds to a point in X, we connect any two vertices v_i and v_j by an edge if they are close, closeness can be defined in 2 ways:

(a): ||x_i − x_j||² < 𝜖, ||.|| is euclidean norm in ℛ^N or,
(b): x_i is in k nearest neighbour of x_j

here 𝜖 & k are user defined parameters.

2.

We construct a weight matrix W(i,j) which assigns weights between each edge in the graph G, weights can be assigned in two ways:

(a): Simple minded approach is to assign W(i,j) = 1 if vertices v_i and v_j are connected otherwise 0.
(b): Heat kernel based, we assign weight W(i,j) such that:
W(i,j) =

3.

Construct laplacian matrix L = D − W of the graph G, where D is a diagonal matrix with D_ii = Σ_jW(i,j). Final low dimensional embedding can be computes by solving generalized eigen decomposition

Lv = λDv

Let 0 = λ₀ ≤ λ₁... ≤ λ_m be the first smallest m + 1 eigenvalues, choose corresponding eigenvectors v₁,v₂...v_m ignoring eigenvector corresponding to λ₀ = 0. Embedding coordinates can be calculates as mapping:

x ∈ ℛN ↦→ y ∈ ℛm T i i where yi = [v1(i),v2(i),...vm (i)]

y_i , i = 1, 2...n is the coordinates in m dimensional embedded space.

1.4 Active Metric Learning

Active learning is a form of semi-supervised learning, difference is that in an active learning setup algorithm itself chooses what data it wants to learn. Aim is to select data instances which is most effective in training the model this saves significant cost to the end user end by asking less queries.

1.4.1 Active Metric Learning for Object Recognition

Active metric learning for object recognition by [ebert2012active] propose to combine metric learning with active sample selection strategy for classification. This work explores to exploitation(entropy based and margin based) and two exploration(kernel farthest first and graph density) based strategy for active sample selection. To learn a metric Information theoretic metric learning is used, which is combined with active sample selection is two different modes,

1.: Batch active metric learning: In this mode metric is learned only once, it starts with querying the desired number of labeled data points according to the chosen sample selection strategy and learns a metric based on this labeled data.
2.: Interleaved active metric learning: This approach alternates between active sample selection and metric learning.

1.4.2 Metric+Active Learning and Its Applications for IT Service Classification

Metric+Active learning [wang2009two] learns a metric for ticket classification which are used by IT service providers. This work proposed two methods to solve this problem:

1.

Discriminative Neighborhood Metric Learning (DNML): DNML aims to minimize the local discriminability of data which is same as maximize the local scatterness and to minimize the local compactness simultaneously.

∑ o(x − x )T C(x − x ) J = ∑-j:xj∈𝒩-i--i----j-----i----j-- k:xk∈𝒩 e(xi − xk )T C(xi − xk) i

Where 𝒩_i^o is nearest points from x_i with same labels as of x_i, 𝒩_i^e are nearest points from x_i which have different labels than of x_i.

2.

Active Learning with Median Selection(ALMS): ALMS improves Transductive Experimental Design (TED) by using available labelled information.