Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

Better template parsing (previously: _Cleanup removes submission templates) #196

Open
wikipedia-mabdul opened this issue Oct 8, 2013 · 18 comments

Comments

@wikipedia-mabdul
Copy link
Member

https://en.wikipedia.org/w/index.php?title=Wikipedia_talk%3AArticles_for_creation%2FReabrook_Valley&diff=cur&oldid=prev

dunno what happened here. Will have to check. (develop script used)

@theopolisme
Copy link
Contributor

It's due to a messed up submission template:

{{AFC submission|ts=20130801124623|d|nn|declinets=20130815023926|decliner=Howicus|ts=20130801131437|u=Pippa.lewis|ns=5}}

Notice how |d| is after a positional argument. Basically this means that the regex thinks that the status of said template is "|", so it's being removed (as a "duplicate pending template").

@wikipedia-mabdul
Copy link
Member Author

OK, I think we should rework that again and store all stuff in an array obeject

so

[ts]= 2013...
[0]=d
[1]=nn
etc
and thus we can cleanup the messed up (valid) templates

@theopolisme
Copy link
Contributor

Maybe just write a basic template parsing engine, tbh.
Harder than it sounds, though... but I guess for our purposes it wouldn't need to be so extensive.

@theopolisme
Copy link
Contributor

Wrote a little function:

function afcHelper_parseTemplate(wikicode,returntitle) {
    var contents = $.trim(wikicode).replace(/(^\{\{|\}\}$)/g,'');
    var pieces = contents.split('|');
    var title = pieces.shift();
    var params = {};
    var increm = 1;
    $.each(pieces, function(index,piece) {
        if (piece.indexOf('=') != -1) {
            var varparts = piece.split(/=/);
            var key = varparts.shift();
            var val = varparts.join('=');
            params[key] = val;
        } else {
            params[increm.toString()] = piece;
            increm++;
        }
    });
    if (returntitle)
        return [title,params];
    else
        return params;
}

Demo:

  • afcHelper_parseTemplate('{{AFC submission|ts=20130801124623|d|nn|declinets=20130815023926|decliner=Howicus|ts=20130801131437|u=Pippa.lewis|ns=5}}') returns the following object (key, value):
    • 1: "d"
    • 2: "nn"
    • decliner: "Howicus"
    • declinets: "20130815023926"
    • ns: "5"
    • ts: "20130801131437"
    • u: "Pippa.lewis"

Supports both positional and named parameters in the template (additionally, {{ex|foo}} and {{ex|1=foo}} both result in the same thing). Breaks on nested templates (like {{hello|{{help|ss}}}}), but i don't see when somebody would use a nested template in {{AFC submission}}...

@wikipedia-mabdul
Copy link
Member Author

How about that "crazy" but valid stuff?

{{AFC submission|d|3=bla bla bla 
* list
*bullet
* use {{tl|cite web}} templates

\~\~\~
}}

@theopolisme
Copy link
Contributor

This is what I was talking about as far as complexity goes. it's not easy.

@wikipedia-mabdul
Copy link
Member Author

yeah, I meant more that this isn't that untypical.

but, I don't understand: isn't

var val = varparts.join('=');

simply including the whole template as it is? That would mean it would work... (did you tested the code anywhere? XD)

@theopolisme
Copy link
Contributor

No, the problem isn't in THAT code but rather in whatever code we use to find the templates initially:

In this example:

{{AFC submission|d|3=bla bla bla 
* list
*bullet
* use {{tl|cite web}} templates
\~\~\~
}}

The regex would match


{{AFC submission|d|3=bla bla bla 
* list
*bullet
* use {{tl|cite web}}

and then stop. We'd need to use some sort of method for keeping track of the number of "{"s and "}"s we encounter, then break them apart, etc., etc....

@theopolisme
Copy link
Contributor

BTW, downgraded this from "critical" -- it was a malformed template.

@wikipedia-mabdul
Copy link
Member Author

i have a idea and a workaround. I will come online later

@wikipedia-mabdul
Copy link
Member Author

@theopolisme: how about using a function similar to the bottom of https://en.wikipedia.org/wiki/User:Ohconfucius/test/formatgeneral.js/core.js
(search for /// PROTECTION BY STRING SUBSTITUTION)

so we can replace | and {{}} within a parameter first and the revert them back after being in the array. ^^

@theopolisme
Copy link
Contributor

Uh...no...you still don't understand.

Just tell me how to parse the templates out of the following page:

Some text goes here.{{This is a template|1=yfdjkhjsl}}
Here we say some more

{{help me|I don't get how to use {{tl|cite web}}, since it's a confusing template.}}

Now do you understand why your method wouldn't work?

@theopolisme
Copy link
Contributor

Stayed up way too late (around 3 am now...and I have to be up at 7 :/ ) working on a template parsing engine that i'm calling parser.js. It's pretty cool (nothing compared to mwparser, but still...for a raw js implementation...)

Here's a demo string of what it can handle right now:

{{This is a template|1=yf {{tesdjkhjsl}} }}\nHere we sa{{heaven|head=e {{eggs|ea {{ah}} }} }}uy some more\n\n{{help me|I don\'t get how to use {{tl|cite web}}, since it\'s a confusing template.}}

TODO list for self

  • write documentation
  • add a parameter that tells it to only return "top level" matches -- eg, for '{{foo|{{foobar}}}}', return ['{{foo|{{foobar}}}}'], rather than the typical ['{{foo|{{foobar}}}}','{{foobar}}']
  • ignore "{"s inside <nowiki> tags and such

Getting excited.

@theopolisme
Copy link
Contributor

@wikipedia-mabdul

https://github.com/theopolisme/jsmwparser

Just run parser.js in your console to see the results for several testcases, and feel free to paste in your own. Try to make it break! (and then i'll try to fix it ;) )

@wikipedia-mabdul
Copy link
Member Author

will do next week. this weekend is busy

@Technical-13
Copy link
Contributor

Why are we still using implicit parameters in the first place. They are fairly unstable and a pain to deal with. I can get the templates updated as needed to make all of the variable explicit (yes, we would still need backwards compatibility for a while to handle existing pages). It's as simple as replacing {{afc submission|||... with {{afc submission|1=|2=|...

@theopolisme
Copy link
Contributor

@Technical-13 the implicit parameters are easily parsed in jsmwparser...no need to make it more complicated :/

https://github.com/theopolisme/jsmwparser/blob/master/parser.js#L90-L111

@Technical-13
Copy link
Contributor

Problem is it doesn't catch malformed template usage like {{afc submission|ts=20110101010101|blank|d|ns=2|declinets=20130204020124|decliner=bob|user=fred}} if all the parameter were explicit, it wouldn't matter how malformed the template was as far as order of parameters. Also, it would deal with {{afc submission|ts=20110101010101|d|blank|1=t|ns=2|user=fred}} that would crap out our current implementation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants