Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation of file formats #9

Open
t-wissmann opened this issue Sep 8, 2020 · 2 comments
Open

Documentation of file formats #9

t-wissmann opened this issue Sep 8, 2020 · 2 comments

Comments

@t-wissmann
Copy link

Where can I find documentation of the file formats used? Unfortunately I neither can find one for the .gr files in the repo nor for the file formats generated by WriteGrammarToTextFile such as .grammar (as described in README).

Though I can guess most of what's in the .grammar files I'm still a bit puzzled. I have invoked by the command

java -cp BerkeleyParser-1.7.jar edu/berkeley/nlp/PCFGLA/WriteGrammarToTextFile arb_sm5.gr arb_sm5

and in the content of arb_sm5.grammar I'm wondering:

  • Does @ have a special meaning or is it just an ordinary character in names? Is there a difference between @.. and non-@ names?
  • Does the $_1/$_0-suffix have a special meaning?

(I also couldn't find any notes on the file format in the publications COLING-ACL 2006 and HLT_NAACL 2007 that are mentioned in the README).

The reason I am asking is that I'm considering supporting .gr or .grammar input files in an own project CoPaR.

@t-wissmann
Copy link
Author

Regarding the .gr, I've noticed from ParserData.Load (ParserData.java line 104f) that .gr is a gzipped java object stream.

@t-wissmann
Copy link
Author

Some transitions starting at ROOT_0 are duplicated, e.g.:

ROOT_0 -> ROOT_0 1.0
ROOT_0 -> ROOT_0 1.0

What does this mean? Can these duplicates be ignored or do the weights 1.0 sum up to 2.0 such that the above transitions are equivalent to the following?

ROOT_0 -> ROOT_0 2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant