Finance-Word-Embedding-Framework

This Word Embedding Framework is tailor made for feature extraction from textual data which is related to finance, banking, business.

Word Embedding:

It is the dense vector representation of a word in a low dimension (e.g.-100,200) vector space. In this vector space the words which are similar in meaning or usage are positioned closer to each other compared to dissimilar words. The measure of closeness is cosine similarity.

Corpus:

Some of the prominent text books in the field of finance were used to form the corpus. Some of them are mentioned below:

The Intelligent Investor
Investment Banking
Principles of Corporate Finance
Introduction to Structured Finance
Investment Leadership and Portfolio Management

Pre-Processing of the Text Corpus:

Removal of Named Entities.
Removal of non-english words.
Lower-casing all the words.
Removing stop-words.
Removing numbers and punctuations.

Glove Model:

GloVe is a word vector technique it is an abbreviation which stands for Global Vectors. The advantage of GloVe is that, unlike Word2vec, GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence matrix) to obtain word vectors. It efficiently leverages statistical information by training only on the non-zero elements in a word co-occurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus.

Co-Occurence Matrix: A co-occurrence matrix is a term-term matrix and will have unique words in rows and columns . The purpose of this matrix is to present the number of time each word appears in the same context as each word in columns.
Loss Function: $\sum_{i,j=1}^V f(x_{i,j})[w^T_i \tilde{w}_j + b_i +\tilde{b}_j - log(x_{i,j})]^2$
- = word-embedding of ith word in row of co-occurrence matrix (focus word)
- $\tilde{w}_j$ = word-embedding of jth word in column of co-occurrence matrix (context word)
- = Bias term for ith focus word
- $\tilde{b}_j$ = Bias term for jth context word.
- $x_{i,j}$ = (i,j)th element of the co-occurrence matrix.
- $f(X_{i,j})$ = $min(1,{X_{i,j}/({\max_{\forall {i,j}} X_{i,j}})^{3/4})$ a.k.a weighing function
- V = size of the vocabulary or simply number of unique words in the corpus
Working: The word embeddings are obtained by minimising the loss function through gradient descent.Since, the embeddings are a solution obtained through gradient descent they will always be slightly different each time we obtain them for the same set of hyper-parameters as every-time the solution we obtain will be sub-optimal.

Optimal set of Hyper-parameters for Glove model:

For our use-case the set of optimal hyper-parameters turns out to be as follows.

Window=3
Number of dimensions=100
Learning rate = 0.01
Number of epochs=10

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
embedding_dictionary.txt		embedding_dictionary.txt
word_embedding.py		word_embedding.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finance-Word-Embedding-Framework

Word Embedding:

Corpus:

Pre-Processing of the Text Corpus:

Glove Model:

Optimal set of Hyper-parameters for Glove model:

About

Releases

Packages

Languages

Aritro-94/Finance-Word-Embedding-Framework

Folders and files

Latest commit

History

Repository files navigation

Finance-Word-Embedding-Framework

Word Embedding:

Corpus:

Pre-Processing of the Text Corpus:

Glove Model:

Optimal set of Hyper-parameters for Glove model:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages