-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Giant TypeScript file being called Julia #38
Comments
Ahh... I guess the first 100000 characters together the guess is Julia... if I do a random chunk of 100000 it does yield TS... maybe I should chunk it and run it and then average out the results? The start of the string seems to impact the result quite a bit which might be interesting for you. |
The model actually only reads the first 10k characters, for performance reasons: Line 28 in f4ceb1d
I guess that the repeated use bitwise operators and the relative lack of semicolons at the start of the file confused the model. However, I don't quite understand why the model picked Julia with such a high confidence.
That's a great idea, I'll try that on the Python version too and see if it can improve the model overall accuracy |
Is this on training or inference? if the first 10k is only getting used, then I should probably only make strings with 10k characters :) |
It reads the first 10k chars for both training & inference.
Absolutely. |
So I was investigating microsoft/vscode#129597
And I noticed that that issue was happening was because the file is absolutely massive. That might be a tfjs issue (cc @pyu10055)
What's interesting is that I was able to grab the first 125000 (anymore and it throws that ^) and run it through the model and it thought with 98% confidence that it is Julia and not TypeScript:
That seems very odd to me... Surely TypeScript would be in the top couple languages... this makes me think that it could be a bug in the model, but I'll leave that up to @yoeo to decide.
The text was updated successfully, but these errors were encountered: