Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giant TypeScript file being called Julia #38

Open
TylerLeonhardt opened this issue Jul 28, 2021 · 4 comments
Open

Giant TypeScript file being called Julia #38

TylerLeonhardt opened this issue Jul 28, 2021 · 4 comments

Comments

@TylerLeonhardt
Copy link

So I was investigating microsoft/vscode#129597

And I noticed that that issue was happening was because the file is absolutely massive. That might be a tfjs issue (cc @pyu10055)

What's interesting is that I was able to grab the first 125000 (anymore and it throws that ^) and run it through the model and it thought with 98% confidence that it is Julia and not TypeScript:

[
  { languageId: 'jl', confidence: 0.9822742342948914 },
  { languageId: 'scala', confidence: 0.016035545617341995 },
  { languageId: 'hs', confidence: 0.0016901468625292182 },
  { languageId: 'pas', confidence: 1.3825314226778573e-7 },
  { languageId: 'cpp', confidence: 3.57413348917035e-10 },
  { languageId: 'ml', confidence: 1.8221162426113047e-12 },
  { languageId: 'js', confidence: 8.594037618049957e-14 },
  { languageId: 'ts', confidence: 4.65358850900658e-14 },
  { languageId: 'vba', confidence: 7.618220256835808e-15 },
  { languageId: 'go', confidence: 5.723856057773509e-15 },
  { languageId: 'groovy', confidence: 3.010156677073889e-15 },
  { languageId: 'dart', confidence: 1.0165829367361526e-16 },
  { languageId: 'c', confidence: 3.543757880335173e-17 },
  { languageId: 'cs', confidence: 8.10666447931774e-18 },
  { languageId: 'swift', confidence: 3.1402044595342857e-18 },
  { languageId: 'mm', confidence: 7.613760166801732e-19 },
  { languageId: 'ps1', confidence: 5.315139498838977e-19 },
  { languageId: 'pm', confidence: 7.787649396626034e-21 },
  { languageId: 'md', confidence: 1.0253120787115895e-21 },
  { languageId: 'html', confidence: 3.355766328067147e-23 },
  { languageId: 'py', confidence: 1.167890584151069e-23 },
  { languageId: 'xml', confidence: 5.1753314350333106e-24 },
  { languageId: 'v', confidence: 2.9903643691862024e-25 },
  { languageId: 'ini', confidence: 5.667593976570901e-26 },
  { languageId: 'dm', confidence: 1.954242193090497e-26 },
  { languageId: 'sql', confidence: 5.364698274948056e-27 },
  { languageId: 'f90', confidence: 5.033385168846986e-27 },
  { languageId: 'php', confidence: 3.280049747541946e-27 },
  { languageId: 'lua', confidence: 3.0276543172140653e-27 },
  { languageId: 'coffee', confidence: 5.953645533418734e-28 },
  { languageId: 'java', confidence: 4.882338335168584e-28 },
  { languageId: 'r', confidence: 1.2599295269966544e-28 },
  { languageId: 'rb', confidence: 5.683628108915077e-29 },
  { languageId: 'erl', confidence: 2.1686813768150945e-29 },
  { languageId: 'tex', confidence: 2.5856190667805688e-30 },
  { languageId: 'prolog', confidence: 1.3216966443768018e-33 },
  { languageId: 'rs', confidence: 1.0031414405593044e-33 },
  { languageId: 'asm', confidence: 8.360548368235013e-34 },
  { languageId: 'matlab', confidence: 1.6285131136511483e-34 },
  { languageId: 'csv', confidence: 5.227884277503055e-35 },
  { languageId: 'sh', confidence: 8.97545819514535e-39 },
  { languageId: 'yaml', confidence: 1.8427074805871344e-42 },
  { languageId: 'ex', confidence: 7.833258415575727e-43 },
  { languageId: 'bat', confidence: 6.445972935894159e-44 },
  { languageId: 'kt', confidence: 2.6624670822171524e-44 },
  { languageId: 'clj', confidence: 0 },
  { languageId: 'cmake', confidence: 0 },
  { languageId: 'cbl', confidence: 0 },
  { languageId: 'css', confidence: 0 },
  { languageId: 'dockerfile', confidence: 0 },
  { languageId: 'json', confidence: 0 },
  { languageId: 'lisp', confidence: 0 },
  { languageId: 'makefile', confidence: 0 },
  { languageId: 'toml', confidence: 0 }
]

That seems very odd to me... Surely TypeScript would be in the top couple languages... this makes me think that it could be a bug in the model, but I'll leave that up to @yoeo to decide.

@TylerLeonhardt
Copy link
Author

Ahh... I guess the first 100000 characters together the guess is Julia... if I do a random chunk of 100000 it does yield TS... maybe I should chunk it and run it and then average out the results?

The start of the string seems to impact the result quite a bit which might be interesting for you.

@yoeo
Copy link
Owner

yoeo commented Jul 29, 2021

The model actually only reads the first 10k characters, for performance reasons:

NB_TOKENS = 10000

I guess that the repeated use bitwise operators and the relative lack of semicolons at the start of the file confused the model.
In fact, less than 1% of the 27k Typescript files that I randomly picked to train the model use the << operator.

However, I don't quite understand why the model picked Julia with such a high confidence.

if I do a random chunk of 100000 it does yield TS.

That's a great idea, I'll try that on the Python version too and see if it can improve the model overall accuracy

@TylerLeonhardt
Copy link
Author

The model actually only reads the first 10k characters, for performance reasons:

Is this on training or inference? if the first 10k is only getting used, then I should probably only make strings with 10k characters :)

@yoeo
Copy link
Owner

yoeo commented Aug 10, 2021

Is this on training or inference?

It reads the first 10k chars for both training & inference.

then I should probably only make strings with 10k characters

Absolutely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants