-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMN document causes check to die or time out #91
Comments
I tested the API server for the smn end point as follows: curl -X POST -H 'Content-Type: application/json' -i 'https://api-giellalt.uit.no/grammar/smn' --data '{"text": "Danne lea."}' | grep text | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 451 100 429 100 22 1097 56 --:--:-- --:--:-- --:--:-- 1156 which returned the following: {
"text": "Danne lea.",
"errs": [
{
"error_text": "Danne",
"start_index": 0,
"end_index": 5,
"error_code": "typo",
"description": "Sääni \"Danne\" váilu tivvoomohjelm sänilistoost.",
"suggestions": [
"Janne",
"Sanne",
"Lanne"
],
"title": "Časkemfeilâ"
},
{
"error_text": "lea",
"start_index": 6,
"end_index": 9,
"error_code": "typo",
"description": "Sääni \"lea\" váilu tivvoomohjelm sänilistoost.",
"suggestions": [
"lii",
"lâi"
],
"title": "Časkemfeilâ"
}
]
} That it, the endpoint works. I then tried the same using a json-ivied version of the document as the request text, but that was killed immediately with the following message:
|
I had a look, and basically we'd need to boost the heck out the performance of the grammar checker. It uses one thread currently, which.. doesn't scale great. Each document is split into chunks and each chunk is grammar-checked individually (*). A chunk is basically a paragraph. A 21 page document is gonna have a lot of paragraphs. Each non-empty paragraph is going to take some time to grammar check. (**) On the client side, from both google docs' and msword's perspective, you init one call "doGrammarCheck()". That one call spawns all the other calls. So the time you wait is the grand total of all calls. That sucks. And even if it didn't, 21 pages worth of grammar results would be a nightmare to deal with. I don't think it makes sense to rewrite the grammar-checker on the backend to be multithreaded, because the frontend is never going to be. If we could make it 10 times quicker on a single thread, that would be great. We COULD rewrite the frontend to use an iterative approach. That is, you start a grammar checker and it checks one paragraph at a time and stops when it finds an error. You can then choose to "ignore" the error and continue grammar checking your next paragraph, or you can fix it yourself. That way you'd only wait for the time it takes to check one paragraph which is way way quicker. And you don't have to deal with an infinite scroll of grammar errors. Improving grammar checker performance still leaves us with a long-ass list of grammar errors. Rewriting the frontend to be iterative is.. the way to go really. But ... it isn't exactly what you would call a quick fix. (*) We can't send too big chunks of text to the grammar checker because we can't deal with the response.
|
Ok. Thanks for the analysis. The simple solution for the users right now is then to check smaller sections of a document at the time, basically by copying portions of the whole text to another document, check, correct and copy back to the original document. I will tell this to the person reporting the bug. The grammar checker code is being rewritten in a separate project by Brendan et co. The goal of that project is to make a version that can run stand-alone on PC's and Mac's (and possibly iPhones/iPads and Android systems). I expect the outcome of that will be a significantly faster grammar checker, at least due to using a much faster speller engine (divvunspell instead of hfst-ospell, divvunspell is roughly 10x faster). We already know that the speller part of the pipeline is the slowest one, mainly due to generating suggestions. This is to say that: a) we have a stop-gap solution right now that we can inform users about (not ideal, but it works); and b) we won't do any changes to the grammar checker front-end or back-end until we have the new codebase running on the server. Release of the new grammar checker has been planned to last week of June. |
A document spanning 21 pages (sent off-line for privacy reasons) causes the Google Docs plugin to die with the message "ScriptError: Oversteg maksmal kjøretid" (essentially time-out) after about 4-5 minutes, and the Word plugin to just resign with no errors found (after a much shorter amount of time).
Running the document (as plain text) through the command line checker locally takes about 1,5 minutes, and returns several hundred error messages (some empty). That is, it works on the command line, it just takes some time.
The text was updated successfully, but these errors were encountered: