Giant TypeScript file being called Julia #38

TylerLeonhardt · 2021-07-28T01:08:40Z

So I was investigating microsoft/vscode#129597

And I noticed that that issue was happening was because the file is absolutely massive. That might be a tfjs issue (cc @pyu10055)

What's interesting is that I was able to grab the first 125000 (anymore and it throws that ^) and run it through the model and it thought with 98% confidence that it is Julia and not TypeScript:

[
  { languageId: 'jl', confidence: 0.9822742342948914 },
  { languageId: 'scala', confidence: 0.016035545617341995 },
  { languageId: 'hs', confidence: 0.0016901468625292182 },
  { languageId: 'pas', confidence: 1.3825314226778573e-7 },
  { languageId: 'cpp', confidence: 3.57413348917035e-10 },
  { languageId: 'ml', confidence: 1.8221162426113047e-12 },
  { languageId: 'js', confidence: 8.594037618049957e-14 },
  { languageId: 'ts', confidence: 4.65358850900658e-14 },
  { languageId: 'vba', confidence: 7.618220256835808e-15 },
  { languageId: 'go', confidence: 5.723856057773509e-15 },
  { languageId: 'groovy', confidence: 3.010156677073889e-15 },
  { languageId: 'dart', confidence: 1.0165829367361526e-16 },
  { languageId: 'c', confidence: 3.543757880335173e-17 },
  { languageId: 'cs', confidence: 8.10666447931774e-18 },
  { languageId: 'swift', confidence: 3.1402044595342857e-18 },
  { languageId: 'mm', confidence: 7.613760166801732e-19 },
  { languageId: 'ps1', confidence: 5.315139498838977e-19 },
  { languageId: 'pm', confidence: 7.787649396626034e-21 },
  { languageId: 'md', confidence: 1.0253120787115895e-21 },
  { languageId: 'html', confidence: 3.355766328067147e-23 },
  { languageId: 'py', confidence: 1.167890584151069e-23 },
  { languageId: 'xml', confidence: 5.1753314350333106e-24 },
  { languageId: 'v', confidence: 2.9903643691862024e-25 },
  { languageId: 'ini', confidence: 5.667593976570901e-26 },
  { languageId: 'dm', confidence: 1.954242193090497e-26 },
  { languageId: 'sql', confidence: 5.364698274948056e-27 },
  { languageId: 'f90', confidence: 5.033385168846986e-27 },
  { languageId: 'php', confidence: 3.280049747541946e-27 },
  { languageId: 'lua', confidence: 3.0276543172140653e-27 },
  { languageId: 'coffee', confidence: 5.953645533418734e-28 },
  { languageId: 'java', confidence: 4.882338335168584e-28 },
  { languageId: 'r', confidence: 1.2599295269966544e-28 },
  { languageId: 'rb', confidence: 5.683628108915077e-29 },
  { languageId: 'erl', confidence: 2.1686813768150945e-29 },
  { languageId: 'tex', confidence: 2.5856190667805688e-30 },
  { languageId: 'prolog', confidence: 1.3216966443768018e-33 },
  { languageId: 'rs', confidence: 1.0031414405593044e-33 },
  { languageId: 'asm', confidence: 8.360548368235013e-34 },
  { languageId: 'matlab', confidence: 1.6285131136511483e-34 },
  { languageId: 'csv', confidence: 5.227884277503055e-35 },
  { languageId: 'sh', confidence: 8.97545819514535e-39 },
  { languageId: 'yaml', confidence: 1.8427074805871344e-42 },
  { languageId: 'ex', confidence: 7.833258415575727e-43 },
  { languageId: 'bat', confidence: 6.445972935894159e-44 },
  { languageId: 'kt', confidence: 2.6624670822171524e-44 },
  { languageId: 'clj', confidence: 0 },
  { languageId: 'cmake', confidence: 0 },
  { languageId: 'cbl', confidence: 0 },
  { languageId: 'css', confidence: 0 },
  { languageId: 'dockerfile', confidence: 0 },
  { languageId: 'json', confidence: 0 },
  { languageId: 'lisp', confidence: 0 },
  { languageId: 'makefile', confidence: 0 },
  { languageId: 'toml', confidence: 0 }
]

That seems very odd to me... Surely TypeScript would be in the top couple languages... this makes me think that it could be a bug in the model, but I'll leave that up to @yoeo to decide.

The text was updated successfully, but these errors were encountered:

TylerLeonhardt · 2021-07-28T01:59:08Z

Ahh... I guess the first 100000 characters together the guess is Julia... if I do a random chunk of 100000 it does yield TS... maybe I should chunk it and run it and then average out the results?

The start of the string seems to impact the result quite a bit which might be interesting for you.

yoeo · 2021-07-29T19:47:34Z

The model actually only reads the first 10k characters, for performance reasons:

guesslang/guesslang/model.py

Line 28 in f4ceb1d

NB_TOKENS = 10000

I guess that the repeated use bitwise operators and the relative lack of semicolons at the start of the file confused the model.
In fact, less than 1% of the 27k Typescript files that I randomly picked to train the model use the << operator.

However, I don't quite understand why the model picked Julia with such a high confidence.

if I do a random chunk of 100000 it does yield TS.

That's a great idea, I'll try that on the Python version too and see if it can improve the model overall accuracy

TylerLeonhardt · 2021-08-09T16:55:54Z

The model actually only reads the first 10k characters, for performance reasons:

Is this on training or inference? if the first 10k is only getting used, then I should probably only make strings with 10k characters :)

yoeo · 2021-08-10T16:24:50Z

Is this on training or inference?

It reads the first 10k chars for both training & inference.

then I should probably only make strings with 10k characters

Absolutely.

yoeo added the prediction improvements label Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Giant TypeScript file being called Julia #38

Giant TypeScript file being called Julia #38

TylerLeonhardt commented Jul 28, 2021

TylerLeonhardt commented Jul 28, 2021

yoeo commented Jul 29, 2021

TylerLeonhardt commented Aug 9, 2021

yoeo commented Aug 10, 2021

Giant TypeScript file being called Julia #38

Giant TypeScript file being called Julia #38

Comments

TylerLeonhardt commented Jul 28, 2021

TylerLeonhardt commented Jul 28, 2021

yoeo commented Jul 29, 2021

TylerLeonhardt commented Aug 9, 2021

yoeo commented Aug 10, 2021