This Word Embedding Framework is tailor made for feature extraction from textual data which is related to finance, banking, business.
It is the dense vector representation of a word in a low dimension (e.g.-100,200) vector space. In this vector space the words which
are similar in meaning or usage are positioned closer to each other compared to dissimilar words. The measure of closeness is cosine
similarity.
Some of the prominent text books in the field of finance were used to form the corpus. Some of them are mentioned below:
- The Intelligent Investor
- Investment Banking
- Principles of Corporate Finance
- Introduction to Structured Finance
- Investment Leadership and Portfolio Management
- Removal of Named Entities.
- Removal of non-english words.
- Lower-casing all the words.
- Removing stop-words.
- Removing numbers and punctuations.
GloVe is a word vector technique it is an abbreviation which stands for Global Vectors. The advantage of GloVe is that, unlike Word2vec, GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence matrix) to obtain word vectors. It efficiently leverages statistical information by training only on the non-zero elements in a word co-occurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus.
-
Co-Occurence Matrix: A co-occurrence matrix is a term-term matrix and will have unique words in rows and columns . The purpose of this matrix is to present the number of time each word appears in the same context as each word in columns.
-
Working: The word embeddings are obtained by minimising the loss function through gradient descent.Since, the embeddings are a solution obtained through gradient descent they will always be slightly different each time we obtain them for the same set of hyper-parameters as every-time the solution we obtain will be sub-optimal.
For our use-case the set of optimal hyper-parameters turns out to be as follows.
- Window=3
- Number of dimensions=100
- Learning rate = 0.01
- Number of epochs=10