Generating Bags of Words from the Sums of their Word Embeddings

This page contains supplementary material, code and data used in White et. al. 2016, Generating Bags of Words from the Sums of their Word Embeddings.

I’m very happy for anyone with any interest in this (or any of my other work) to get in touch with me.

Use of this data or code on this page of these should cite the paper:

	@Inproceedings{White2016BOWgen,
  Title                    = {Generating Bags of Words from the Sums of their Word Embeddings},
  Author                   = {White, Lyndon and Togneri,Roberto and Liu,Wei and Bennamoun, Mohammed},
  Booktitle                = {17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing)},
  Year                     = {2016},

  Abstract                 = {Many methods have been proposed to generate sentence vector representations, such as recursive neural networks, latent distributed memory models, and the simple sum of word embeddings (SOWE). However,  very few methods demonstrate the ability to reverse the process -- recovering sentences from sentence embeddings. Amongst the many sentence embeddings, SOWE has been shown to maintain semantic meaning, so in this paper we introduce a method for moving from the SOWE representations back to the bag of words (BOW) for the original sentences. This is a part way step towards recovering the whole sentence and has useful theoretical and practical applications of its own. This is done using a greedy algorithm to convert the vector to a bag of words. To our knowledge this is the first such work. It demonstrates qualitatively the ability to recreate the words from a large corpus based on its sentence embeddings.

As well as practical applications for allowing classical information retrieval methods to be combined with more recent methods using the sums of word embeddings, the success of this method has theoretical implications on the degree of information maintained by the sum of embeddings representation. This lends some credence to the consideration of the SOWE as a dimensionality reduced, and meaning enhanced, data manifold for the bag of words.  },
}

Proof of NP-Hardness of Vector Selection Problem

The vector selection problem is NP-Hard. It is possible to reduce from any given instance of a subset sum problem to a vector selection problem. The subset sum problem is NP-complete (Karp, 1972). It is defined: for some set of integers ($\mathcal{S}\subset\mathbb{Z}$), does there exist a subset ($\mathcal{L}\subseteq\mathcal{S}$) which sums to zero ($0=\sum_{l_i\in \mathcal{L}} l_i$). A suitable metric, target vector and vocabulary of vectors corresponding to the elements $\mathcal{S}$ can be defined by a bijection; such that solving the vector selection problem will give the subset of vectors corresponding to a subset of $\mathcal{S}$ with the smallest sum; which if zero indicates that the subset sum does exists, and if nonzero indicates that no such subset ($\mathcal{L}$) exists. A fully detailed proof of the reduction from subset sum to the vector selection problem can be found by following the link below

Full Proof

The proof is a little fiddly, and complex, but does not use any higher maths beyond basic linear algebra.

Core Code Only

For those who which to understand or implement the BOW generation algorithm. This is written in the Julia programming language. Nominally v0.5.

core code

Data and Code

The below is complete package to reproduce the results. It includes all inputs, all part stage outputs, full code for preprocessing, processing and analysing the results and even the (Julia programming language)[http://julialang.org/]. Note that this includes input data, and various libraries, all of which are under their own licenses (included). Our own contributions are licensed under the MIT license.

Data and Code

Generating Bags of Words from the Sums of their Word Embeddings

Supplementary Material

Proof of NP-Hardness of Vector Selection Problem

Core Code Only

Data and Code