Hyperbolic coordinates (Poincaré embeddings) as the output of a neural network
I'm trying to build a Deep Learning predictor that takes as the input a set of word vectors (in Euclidian space) and outputs Poincaré embeddings. So far I am not having much luck, because model predicts arbitrary points in the n-dimensional real space, not the hyperbolic space. This causes the distance, and thus the loss function to be undefined. Therefore I need to restrict the output of the model somehow. I have tried several things.
First was defining the loss function that minimizes the Hyperbolic distance (on the Poincaré hyperdisc):
def distance_loss(u, v):
    max_norm = 1 - K.epsilon()
    sq_u_norm = K.clip(K.sum(K.pow(u, 2), axis=-1), 0, max_norm)
    sq_v_norm = K.clip(K.sum(K.pow(v, 2), axis=-1), 0, max_norm)
    sq_dist = K.sum(K.pow(u - v, 2), axis=-1)
    poincare_dist = tf.acosh(1 + (sq_dist / ((1 - sq_u_norm) * (1 - sq_v_norm))) * 2)
    neg_exp_dist = K.exp(-poincare_dist)
    return -K.log(neg_exp_dist)
Which I somewhat dumbly lifted from here and here.
However that doesn't seem to work properly on it's own. Next up was to change the optimizer to something I lifted from a notebook on the topic, and some slides (PDF). Note that I am using Keras 2.1.6 with Tensorflow, so I had to make some changes.
def get_normalization(p):
    p_norm = K.sum(K.square(p), -1, keepdims=True)
    mp = K.square(1 - p_norm)/4.0
    return mp, K.sqrt(p_norm)
def project(p, p_norm):
    p_norm_clip = K.maximum(p_norm, 1.0)
    p_norm_cond = K.cast(p_norm  1.0, dtype='float') * K.epsilon()
    return p/p_norm_clip - p_norm_cond
class AdamPoincare(Adam):
    @interfaces.legacy_get_updates_support
    def get_updates(self,loss,params):
        grads = self.get_gradients(loss, params)
        self.updates = [K.update_add(self.iterations, 1)]
        lr = self.lr
        if self.initial_decay  0:
            lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
                                                      K.dtype(self.decay))))
        t = K.cast(self.iterations, K.floatx()) + 1
        lr_t = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                     (1. - K.pow(self.beta_1, t)))
        ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        self.weights = [self.iterations] + ms + vs
        for p, g, m, v in zip(params, grads, ms, vs):
            normalization, p_norm = get_normalization(p)
            g = normalization * g
            m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
            v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
            p_t = p - lr_t * m_t / (K.sqrt(v_t) + self.epsilon)
            self.updates.append(K.update(m, m_t))
            self.updates.append(K.update(v, v_t))
            new_p = project(p_t, p_norm)
            # Apply constraints.
            if getattr(p, 'constraint', None) is not None:
                new_p = p.constraint(new_p)
            self.updates.append(K.update(p, new_p))
        return self.updates
That also still didn't do what I wanted, so lastly I tried to add a lambda layer that on the forward pass projects the points (although I have no idea if this proper). The target outputs are already coordinates in Hyperbolic space (so on the backward pass this should be a no-op).
def poincare_project(x, axis=-1):
    square_sum = K.tf.reduce_sum(
        K.tf.square(x), axis, keepdims=True)
    x_inv_norm = K.tf.rsqrt(square_sum)
    x_inv_norm = K.tf.minimum((1. - K.epsilon()) * x_inv_norm, 1.)
    outputs = K.tf.multiply(x, x_inv_norm)
    return outputs
x_dense = Dense(int(params["semantic_dense"]))(x_activation)
x_activation = activation(x_dense)
x_output = Dense(params["semantic_dim"], activation="tanh")(x_activation)
x_project = Lambda(poincare_project)(x_output)
But it still produces garbage results (doesn't minimize the distance, or causes NaN/Inf on subsequent evaluation). Now there might be a bug in any of these implementations, or the whole idea is just invalid. I can't really tell right now. The concrete goal is a form of supervised entity linking, where the input is a target word in a context (using pretrained fasttext vectors or even BERT embeddings), and the output is a point in the Poincare embedding representing a structured ontology (which was pretrained using the gensim implementation).
I did find a paper(pdf) that tried to do this, by reparameterizing the model, but I wasn't able to gauge from their description how to implement this. It does describe the problem neatly though.
Topic manifold keras deep-learning
Category Data Science
