Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal compiler error "Impossible: LambdaStatSem.APP.argument type not arrow" when compiling MLton with MLKit #97

Closed
Tracked by #103
MatthewFluet opened this issue Jan 12, 2022 · 8 comments
Assignees
Labels

Comments

@MatthewFluet
Copy link
Contributor

I get an internal compiler error (Impossible: LambdaStatSem.APP.argument type not arrow) when compiling the current MLton sources (which recently added preliminary support for building with MLKit) with MLKit 4.6.0 (and also some other previous versions that I tried):

$ git clone https://github.com/MLton/mlton.git 
$ cd mlton/mlton
$ make mlkit-mlton
...
[reading source file:   ../lib/mlton/basic/random.sig]
[reading source file:   ../lib/mlton/basic/random.sml]
Impossible: LambdaStatSem.APP.argument type not arrow

rand a
...

I attach the full error message (make.mlkit-mlton.log). I didn't have much luck minimizing the random.sml file and its dependencies to produce a smaller example. In particular, the ICE didn't occur when I eliminated the use of the Trace.trace{,0} functions in random.sml.

@melsman
Copy link
Owner

melsman commented Jan 13, 2022

That is an interesting bug, which seems to be caused by a function (result of trace0 applied to two arguments) appearing in a file named random.sml being inlined into code that resides in another file called random.sml. The function being inlined contains a free identifier called rand (a function) and the underlying "name" (i.e., a stamp consisting of an integer and a file name) of this function becomes identical to a name associated with a variable (of type real) defined in the file where the trace0 function is inlined.

Notice also that the two files ../lib/stubs/mlton-stubs/random.sml and ../lib/mlton/basic/random.sml are referenced from the same mlb-file (i.e., mlton-mlkit.mlb).

A possible fix could be to include the path in all names and not only the file.mlb-file.sml string. Thus we could use the string formed by file.mlb-path.sml. Unfortunately, this solution would only be a half fix. There would still be a chance for collision, if two different mlb-files, located in different folders, but named the same, contain references to identically named (but different) files.

Notice, btw that an implementation file (.sml or .sig) is enforced to be included in an mlb-project at most once (this property is checked at toplevel for any program defined by an mlb-file).

Thus, a solution could be:

  1. Use the string formed by file.mlb-path.sml as part of a name, when a new identifier is created (Name.new()).
  2. Enforce at top-level for any program defined by an mlb-file that any implementation file is unique when identified by the string formed by file.mlb-path.sml.

Admittedly, this last restriction seems annoying, but I don't see any other solution if target code should be relocatable and if different projects should be able to make use of the result of compiling an mlb-file (e.g., precompiled basis library).

@MatthewFluet
Copy link
Contributor Author

Interesting analysis. It does seem that file.mlb-path.sml would suffice for this case, where there is a monolithic mlton-mlkit.mlb file.

With respect to the last restriction, I will note that MLton's source code organization has adopted a sources.mlb convention (inherited from SML/NJ's CM's sources.cm convention), where the mlb-file for any given library/directory is named sources.mlb. So, if we were to try to use MLton's shipped mlb files with MLKit (which we can't currently due to #74), I believe that we could run into the same problem, because the ../lib/stubs/mlton-stubs/sources.mlb would reference random.sml (no path, because it is the same directory as the containing sources.mlb) and the ../lib/mlton/basic/sources.mlb would also reference random.sml (again, no path).

I wonder if you could make the stamp more likely to be unique by using file.mlb-path.sml-<hash>, where <hash> is some hash of the .sml file contents. I'm not sure how aggressive MLKit's cut-off recompilation is, but it would seem that if a file's contents change, then its dependents need to be recompiled, so it's ok that the stamps have changed.

@melsman
Copy link
Owner

melsman commented Jan 13, 2022

Yes - I see the problem... Unfortunately, the <hash>-solution is problematic for MLKit's cut-off recompilation, which is indeed aggressive in case type information (and optimisation information) does not change... A possibility would be to take a hash value of the hosting mlb-file; then recompilation would be enforced only when the mlb-file changes and a static check of wellformedness can be performed up front (two identically named mlb-files containing identical content must be identical).

@MatthewFluet : Another solution that would work well with MLKit's cut-off recompilation scheme would be to allow for an mlkit-specific mlb-annotation in one of the mlb-files to specify a unique name to include instead of the hash? Technically, this last solution could be enforced (by a static check) also in the present case of the random.sml duplicate and it would lead to shorter labels in the generated assembler code...

@MatthewFluet
Copy link
Contributor Author

@melsman Yes, some kind of stamp annotation seems reasonable.

@melsman melsman self-assigned this Jan 16, 2022
@melsman melsman added the bug label Jan 16, 2022
@melsman
Copy link
Owner

melsman commented Jan 16, 2022

I went for the hash-solution. A hash value involving the content of the mlb-file, the name of the mlb-file, and the relative path to the source file is now used for distinguishing identifiers coming from different source files (an example hash value looks like dhqtTSEl3COsVLrTsVmdpq and is created using a modified MD5 digest algorithm that makes use of more characters than the ordinary hex-characters).

We still need to implement a static check of wellformedness of an mlb-project enforcing that two identically named mlb-files containing identical content must be identical.

It turns out that recompilation is not necessarily triggered for a source file when the hosting mlb-file changes - only when another source file on which the source file depends is changed, which is ok.

This fix, however, is not sufficient to have MLKit compile MLton, which is a huge test for the MLKit implementation and has already uncovered a few other bugs!! :)

At least two issues are holding back a successful compilation:

  1. When GC is enabled, regions variables of type bot (those associated with type variables) are in a few cases enforced into latent function effects and later unified with other bot region variables, which causes instantiation problems with multiple candidate region types for the same region variable. I currently don't have a good solution for solving this issue (the problems appears when compiling control/control-flags.sml, which makes use of the parser combinators from ../lib/mlton/basic/parse.sml).
  2. Elaboration in the MLKit (in particular for Modules) is slow. It takes many hours just to elaborate the MLton sources, which are heavily functorised. The situation may be improved slightly by organising the mlb-file somewhat using basis identifiers and expressing the dependencies differently than just listing the source files (see e.g. mlton.mlb for some help regarding dependencies). Increasing the efficiency of the Modules elaboration will be a huge task. Update: It turns out that the actual performance problem has to do with serialisation of static environments during static interpretation: When environments are serialised, it seems that quite some time is spend on discovering sharing (with parts of previously serialised environments)...

@MatthewFluet
Copy link
Contributor Author

@melsman Thanks for investigating!

In my experiments, I had encountered a couple of additional bugs beyond the random.sml one. Did you encounter one with ../lib/mlton/basic/property.fun? I had triggered an Impossible: CodeGenUtilX64.prefix_sm: DROPPED_RVAR_ATY not implemented. error after tweaking ../lib/mlton/basic/random.sml to avoid the Impossible: LambdaStatSem.APP.argument type not arrow error, but perhaps it was another manifestation of the stamp collision. After that one, I also encountered the control/control-flags.sml region variables error.

I went for the hash-solution. A hash value involving the content of the mlb-file, the name of the mlb-file, and the relative path to the source file is now used for distinguishing identifiers coming from different source files. ... It turns out that recompilation is not necessarily triggered for a source file when the hosting mlb-file changes - only when another source file on which the source file depends is changed, which is ok.

But, if the hosting mlb-file changes, won't the hash value (involving the content of the mlb-file) change? And that would require recompilation to update the identifier identities?

When GC is enabled, regions variables of type bot (those associated with type variables) are in a few cases enforced into latent function effects and later unified with other bot region variables, which causes instantiation problems with multiple candidate region types for the same region variable. I currently don't have a good solution for solving this issue (the problems appears when compiling control/control-flags.sml, which makes use of the parser combinators from ../lib/mlton/basic/parse.sml).

I wonder how well an MLKit compiled MLton would do without the GC enabled. What I'd be worried about is that there is a lot of implicit sharing between IR programs, both when translating from one IR to the next and also when optimizing via IR to IR transformations. As a whole-program compiler, the IR programs are already "big" and if the sharing forced a lot of data of the IR programs to be in the same long-lived region, then there might be memory issues.

Elaboration in the MLKit (in particular for Modules) is slow. It takes many hours just to elaborate the MLton sources, which are heavily functorised. The situation may be improved slightly by organising the mlb-file somewhat using basis identifiers and expressing the dependencies differently than just listing the source files (see e.g. mlton.mlb for some help regarding dependencies). Increasing the efficiency of the Modules elaboration will be a huge task. Update: It turns out that the actual performance problem has to do with serialisation of static environments during static interpretation: When environments are serialised, it seems that quite some time is spend on discovering sharing (with parts of previously serialised environments)...

Interesting. I guess with the heavily functorization, we are almost turning MLKit in to a whole-program compiler --- it seems that it doesn't actually generate any code until the functors are applied. It isn't strictly necessary in MLton --- just a convention for which we pay nothing when compiling MLton with MLton. But, it can be problematic for other compilers.

It probably wouldn't be too bad to write a script to generate sources.mlkit.mlb and export.mkit.sml files from the sources.mlb files, to perform the export rebinding via a file. Would that help with the serialization as well?

@melsman
Copy link
Owner

melsman commented Jan 18, 2022

In my experiments, I had encountered a couple of additional bugs beyond the random.sml one. Did you encounter one with ../lib/mlton/basic/property.fun? I had triggered an Impossible: CodeGenUtilX64.prefix_sm: DROPPED_RVAR_ATY not implemented. error after tweaking ../lib/mlton/basic/random.sml to avoid the Impossible: LambdaStatSem.APP.argument type not arrow error, but perhaps it was another manifestation of the stamp collision.

Yes, I also ran into this issue, which should now be fixed. It turns out that the drop-regions pass is so eager at dropping region parameters that it sometimes drops region parameters that will newer contain values, but which should be passed to a function because the function (for other instantiations) will allocate into the passed region (such cases seem to happen only with very serious use of higher-order functions)...

But, if the hosting mlb-file changes, won't the hash value (involving the content of the mlb-file) change? And that would require recompilation to update the identifier identities?

Hmm... I'm thinking that in practice, it will be ok that a changed hash value does not cause any recompilation by itself. The MLKit does not associate any stored information with the mlb-files, which just specify the source dependencies. Consistency is checked whenever a program is built, which may cause recompilation if some of the dependencies change. However, I speculate that there might be a scenario involving changes to mlb-files that could cause identifier clashes, but the scenario seems very very unlikely to happen in practice...

I wonder how well an MLKit compiled MLton would do without the GC enabled.

We definitely need the GC for MLton to work with the MLKit. I have an idea of how to solve the issue but it involves changing the region inference algorithm and requires associating ML type variables with parameterised regions that have region type RT_bot, so that such region types instead take the form RT_bot(tv). We then need to make sure that regions of such types are quantified together with the ML type variables, but never unified...

It probably wouldn't be too bad to write a script to generate sources.mlkit.mlb and export.mlkit.sml files from the sources.mlb files, to perform the export rebinding via a file. Would that help with the serialization as well?

I'm thinking that some manual tweaking may be necessary anyway as file-based functor export rebindings will require also the functor argument signature identifier to be in scope... What would really help the MLKit would be to avoid the heavy functorisation in cases where it is not really used for multiple instantiations or for serious abstraction guarantees (I guess this is not an option ;)

I believe I can solve some of the serialisation problems just by tweaking the serialisation code, but many of the environment serialisations are due to the serialisation of functor closures during static interpretation.

Thanks for providing this test case - I'm sure it will improve the MLKit down the road...

@melsman
Copy link
Owner

melsman commented Jan 20, 2022

I'm closing this issue as the issue reported here has been solved. Other issues have been created related to compiling MLton with MLKit, including a meta issue (#103) for tracking general progress...

@melsman melsman closed this as completed Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants