Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word vectors #132

Merged
merged 18 commits into from
Mar 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
6db69fb
feat(wink-nlp): add word vectors parameter in winkNLP
sanjayaksaxena Sep 27, 2023
f9da534
test(*): update test model to web model
sanjayaksaxena Oct 1, 2023
c511921
feat(*): add as.vector and accordingly update the rest
sanjayaksaxena Oct 1, 2023
fd6f740
feat(*): add test-vectors.json in test model and update as & wink-nlp…
sanjayaksaxena Oct 4, 2023
cbc902a
refactor(as): extract vectors right at the beginning
sanjayaksaxena Oct 4, 2023
12bbfb9
feat(*): add vectorOf method under winkNLP using as.vector
sanjayaksaxena Oct 4, 2023
b50ef29
test(wink-nlp-specs): add test cases for vectorOf method
sanjayaksaxena Oct 4, 2023
661a09d
feat(*): add l2 norm during as.vector computation
sanjayaksaxena Oct 9, 2023
929d683
refactor: use gloVe vectors instead on enwiki
sanjayaksaxena Oct 10, 2023
475887b
feat(test-vectors): update enhanced format
sanjayaksaxena Feb 16, 2024
f36d87c
build(*): migrate from travis to github actions
sanjayaksaxena Feb 16, 2024
08bd979
docs(README): change build badge to point to github actions
sanjayaksaxena Feb 16, 2024
d5dfc21
test: use test-vectors that include unk vector definition
sanjayaksaxena Feb 18, 2024
29a35b1
feat(*): drop usage of as helper from vectorOf and test
sanjayaksaxena Feb 19, 2024
e9817f9
feat(*): add method to extract contextual word vectors from doc
sanjayaksaxena Mar 17, 2024
754ec35
feat(doc-v2): add error handling
sanjayaksaxena Mar 17, 2024
97c2766
test: add test cases for contextual vectors
sanjayaksaxena Mar 17, 2024
eed410c
test(contextual-vectors-specs): complete test cases
sanjayaksaxena Mar 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/coveralls.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
on: ["push", "pull_request"]

name: Coveralls

jobs:

build:
name: Build
runs-on: ubuntu-latest
steps:

- uses: actions/checkout@v1

- name: Use Node.js 18.x
uses: actions/setup-node@v3
with:
node-version: 18.x

- name: npm install
run: |
npm install
npm run pretest
npm run test

- name: Coveralls
uses: coverallsapp/github-action@v2
with:
format: lcov
debug: false
allow-empty: false
32 changes: 32 additions & 0 deletions .github/workflows/node.js.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# This workflow will do a clean installation of node dependencies, cache/restore them, build the source code and run tests across different versions of node
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-nodejs

name: Node.js CI

on:
push:
branches: [ "word-vectors" ]
pull_request:
branches: [ "word-vectors" ]

jobs:
build:

runs-on: ubuntu-latest

strategy:
matrix:
node-version: [18.x]
# See supported Node.js release schedule at https://nodejs.org/en/about/releases/

steps:
- uses: actions/checkout@v3
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
- run: npm ci
- run: npm run build --if-present
- run: npm run pretest
- run: npm run test
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# winkNLP

### [![Build Status](https://travis-ci.com/winkjs/wink-nlp.svg?branch=master)](https://travis-ci.com/github/winkjs/wink-nlp) [![Coverage Status](https://coveralls.io/repos/github/winkjs/wink-nlp/badge.svg?branch=master)](https://coveralls.io/github/winkjs/wink-nlp?branch=master) [![Known Vulnerabilities](https://snyk.io/test/github/winkjs/wink-nlp/badge.svg)](https://snyk.io/test/github/winkjs/wink-nlp) [![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6035/badge)](https://bestpractices.coreinfrastructure.org/projects/6035) [![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/winkjs/Lobby) [![Follow on Twitter](https://img.shields.io/twitter/follow/winkjs_org?style=social)](https://twitter.com/winkjs_org)
### [![Build Status](https://github.com/winkjs/wink-nlp/actions/workflows/node.js.yml/badge.svg)](https://github.com/winkjs/wink-nlp/actions/workflows/node.js.yml/) [![Coverage Status](https://coveralls.io/repos/github/winkjs/wink-nlp/badge.svg?branch=master)](https://coveralls.io/github/winkjs/wink-nlp?branch=master) [![Known Vulnerabilities](https://snyk.io/test/github/winkjs/wink-nlp/badge.svg)](https://snyk.io/test/github/winkjs/wink-nlp) [![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6035/badge)](https://bestpractices.coreinfrastructure.org/projects/6035) [![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/winkjs/Lobby) [![Follow on Twitter](https://img.shields.io/twitter/follow/winkjs_org?style=social)](https://twitter.com/winkjs_org)

## Developer friendly Natural Language Processing ✨
[<img align="right" src="https://decisively.github.io/wink-logos/logo-title.png" width="100px" >](https://winkjs.org/)
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"main": "src/wink-nlp.js",
"scripts": {
"pretest": "npm run lint",
"test": "nyc --reporter=html --reporter=text mocha ./test/",
"test": "nyc --reporter=html --reporter=lcov --reporter=text mocha ./test/",
"coverage": "nyc report --reporter=text-lcov | coveralls",
"sourcedocs": "docker -i src -o ./sourcedocs --sidebar yes",
"lint": "eslint ./src/*.js ./test/*.js",
Expand Down
6 changes: 4 additions & 2 deletions src/allowed.js
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,8 @@ allowed.as4tokens = new Set( [
as.freqTable,
as.bigrams,
as.unique,
as.markedUpText
as.markedUpText,
as.vector
] );

// NOTE: it should exclude `as.markedUpText`, whenever this is included in the above.
Expand All @@ -76,7 +77,8 @@ allowed.as4selTokens = new Set( [
as.bow,
as.freqTable,
as.bigrams,
as.unique
as.unique,
as.vector
] );

allowed.its4entity = new Set( [
Expand Down
12 changes: 6 additions & 6 deletions src/api/col-tokens-out.js
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,15 @@ var psMask = constants.psMask;
* @private
*/
var colTokensOut = function ( start, end, rdd, itsf, asf, addons ) {
// Vectors require completely different handling.
if ( itsf === its.vector ) {
return its.vector( start, end, rdd.tokens, addons );
}

// Not a vector request, perform map-reduce.
var mappedTkns = [];
var itsfn = ( itsf && allowed.its4tokens.has( itsf ) ) ? itsf : its.value;
var asfn = ( asf && allowed.as4tokens.has( asf ) ) ? asf : as.array;

if ( itsfn !== its.value && itsfn !== its.normal && itsfn !== its.lemma && asfn === as.vector ) {
throw Error( 'winkNLP: as.vector is allowed only with its value or normal or lemma.' );
}

// Note, `as.text/markedUpText` needs special attention to include preceeding spaces.
if ( asfn === as.text || asfn === as.markedUpText ) {
for ( let i = start; i <= end; i += 1 ) {
Expand All @@ -73,7 +73,7 @@ var colTokensOut = function ( start, end, rdd, itsf, asf, addons ) {
}
}

return asfn( mappedTkns, rdd.markings, start, end );
return asfn( mappedTkns, rdd, start, end );
}; // colTokensOut()

module.exports = colTokensOut;
4 changes: 0 additions & 4 deletions src/api/itm-document-out.js
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,6 @@ var colTokensOut = require( './col-tokens-out.js' );
*/
var itmDocumentOut = function ( rdd, itsf, addons ) {
var document = rdd.document;
// Vectors require completely different handling.
if ( itsf === its.vector ) {
return its.vector( document, rdd, addons );
}

var itsfn = ( itsf && allowed.its4document.has( itsf ) ) ? itsf : its.value;

Expand Down
4 changes: 0 additions & 4 deletions src/api/itm-sentence-out.js
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,6 @@ var colTokensOut = require( './col-tokens-out.js' );
*/
var itmSentenceOut = function ( index, rdd, itsf, addons ) {
var sentence = rdd.sentences[ index ];
// Vectors require completely different handling.
if ( itsf === its.vector ) {
return its.vector( sentence, rdd, addons );
}

var itsfn = ( itsf && allowed.its4sentence.has( itsf ) ) ? itsf : its.value;

Expand Down
4 changes: 0 additions & 4 deletions src/api/itm-token-out.js
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,6 @@ var allowed = require( '../allowed.js' );
* @private
*/
var itmTokenOut = function ( index, rdd, itsf, addons ) {
// Vectors require completely different handling.
if ( itsf === its.vector ) {
return its.vector( index, rdd, addons );
}
// Not a vector request, map using `itsf`.
var f = ( allowed.its4token.has( itsf ) ) ? itsf : its.value;
return f( index, rdd.tokens, rdd.cache, addons );
Expand Down
11 changes: 5 additions & 6 deletions src/api/sel-tokens-out.js
Original file line number Diff line number Diff line change
Expand Up @@ -52,16 +52,15 @@ var psMask = constants.psMask;
* @private
*/
var selTokensOut = function ( selTokens, rdd, itsf, asf, addons ) {
// Vectors require completely different handling.
if ( itsf === its.vector ) {
return its.vector( selTokens, rdd.tokens, addons );
}

// Not a vector request, perform map-reduce.
var mappedTkns = [];
var itsfn = ( itsf && allowed.its4selTokens.has( itsf ) ) ? itsf : its.value;
var asfn = ( asf && allowed.as4selTokens.has( asf ) ) ? asf : as.array;

if ( itsfn !== its.value && itsfn !== its.normal && itsfn !== its.lemma && asfn === as.vector ) {
throw Error( 'winkNLP: as.vector is allowed only with its value or normal or lemma.' );
}

// Note, `as.text` needs special attention to include preceeding spaces.
// No `markedUpText` allowed here.
if ( asfn === as.text ) {
Expand All @@ -74,7 +73,7 @@ var selTokensOut = function ( selTokens, rdd, itsf, asf, addons ) {
}
}

return asfn( mappedTkns );
return asfn( mappedTkns, rdd );
}; // selTokensOut()

module.exports = selTokensOut;
48 changes: 46 additions & 2 deletions src/as.js
Original file line number Diff line number Diff line change
Expand Up @@ -152,13 +152,15 @@ as.text = function ( twps ) {
* `twps` and `markings`.
*
* @param {array} twps Array containing tokens with preceding spaces.
* @param {array} markings Array containing span of markings & marking specs.
* @param {object} rdd Raw Document Data structure.
* @param {number} start The start index of the tokens.
* @param {number} end The end index of the tokens.
* @return {string} the markedup text.
* @private
*/
as.markedUpText = function ( twps, markings, start, end ) {
as.markedUpText = function ( twps, rdd, start, end ) {
// Extract markings.
const markings = rdd.markings;
// Offset to be added while computing `first` and `last` indexes of `twps`.
var offset = start * 2;
// Compute the `range` of `markings` to consider on the basis `start` and `end`.
Expand All @@ -183,4 +185,46 @@ as.markedUpText = function ( twps, markings, start, end ) {
return twps.join( '' ).trim();
}; // markedUpText()

as.vector = function ( tokens, rdd ) {
// Get size of a vector from word vectors
const size = rdd.wordVectors.dimensions;
const precision = rdd.wordVectors.precision;
const vectors = rdd.wordVectors.vectors;
// Set up a new initialized vector of `size`
const v = new Array( size );
v.fill( 0 );
// Compute average.
// We will count the number of tokens as some of them may not have a vector.
let numOfTokens = 0;
for ( let i = 0; i < tokens.length; i += 1 ) {
// Extract token vector for the current token.
const tv = vectors[ tokens[ i ].toLowerCase() ];
// Increment `numOfTokens` if the above operation was successful.
if ( tv !== undefined ) numOfTokens += 1;
for ( let j = 0; j < size; j += 1 ) {
// Keep summing, eventually it will be divided by `numOfTokens` to obtain avareage.
v[ j ] += ( tv === undefined ) ? 0 : tv[ j ];
}
}

// if no token's vector is found, return a 0-vector!
if ( numOfTokens === 0 ) {
// Push l2Norm, which is 0 in this case.
v.push( 0 );
return v;
}

// Non-0 vector, find average by dividing the sum by numOfTokens
// also compute l2Norm.
let l2Norm = 0;
for ( let i = 0; i < size; i += 1 ) {
v[ i ] = +( v[ i ] / numOfTokens ).toFixed( precision );
l2Norm += v[ i ] * v[ i ];
}
// `l2Norm` becomes the 101th element for faster cosine similarity/normalization.
v.push( +( Math.sqrt( l2Norm ).toFixed( precision ) ) );

return v;
}; // vector()

module.exports = as;
Loading
Loading