Skip to content

This Word Embedding Framework is tailor made for feature extraction from textual data which is related to finance, banking, business.

Notifications You must be signed in to change notification settings

Aritro-94/Finance-Word-Embedding-Framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Finance-Word-Embedding-Framework

This Word Embedding Framework is tailor made for feature extraction from textual data which is related to finance, banking, business.


Word Embedding:

It is the dense vector representation of a word in a low dimension (e.g.-100,200) vector space. In this vector space the words which are similar in meaning or usage are positioned closer to each other compared to dissimilar words. The measure of closeness is cosine similarity.

Corpus:

Some of the prominent text books in the field of finance were used to form the corpus. Some of them are mentioned below:

  • The Intelligent Investor
  • Investment Banking
  • Principles of Corporate Finance
  • Introduction to Structured Finance
  • Investment Leadership and Portfolio Management


Pre-Processing of the Text Corpus:

  • Removal of Named Entities.
  • Removal of non-english words.
  • Lower-casing all the words.
  • Removing stop-words.
  • Removing numbers and punctuations.


Glove Model:

GloVe is a word vector technique it is an abbreviation which stands for Global Vectors. The advantage of GloVe is that, unlike Word2vec, GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence matrix) to obtain word vectors. It efficiently leverages statistical information by training only on the non-zero elements in a word co-occurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus.

  • Co-Occurence Matrix: A co-occurrence matrix is a term-term matrix and will have unique words in rows and columns . The purpose of this matrix is to present the number of time each word appears in the same context as each word in columns.

  • Loss Function:  \sum_{i,j=1}^V  f(x_{i,j})[w^T_i \tilde{w}_j + b_i +\tilde{b}_j - log(x_{i,j})]^2

    • w_i= word-embedding of ith word in row of co-occurrence matrix (focus word)

    •  \tilde{w}_j = word-embedding of jth word in column of co-occurrence matrix (context word)

    •  b_i= Bias term for ith focus word

    • \tilde{b}_j= Bias term for jth context word.

    •  x_{i,j}= (i,j)th element of the co-occurrence matrix.

    • f(X_{i,j}) =min(1,{X_{i,j}/({\max_{\forall {i,j}} X_{i,j}})^{3/4}) a.k.a weighing function

    • V = size of the vocabulary or simply number of unique words in the corpus

  • Working: The word embeddings are obtained by minimising the loss function through gradient descent.Since, the embeddings are a solution obtained through gradient descent they will always be slightly different each time we obtain them for the same set of hyper-parameters as every-time the solution we obtain will be sub-optimal.



Optimal set of Hyper-parameters for Glove model:

For our use-case the set of optimal hyper-parameters turns out to be as follows.

  • Window=3
  • Number of dimensions=100
  • Learning rate = 0.01
  • Number of epochs=10

About

This Word Embedding Framework is tailor made for feature extraction from textual data which is related to finance, banking, business.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages