Add modern poisson2d solver & rename optimized.f90 #23

rouson · 2021-06-29T03:52:06Z

I hope the benefit of this pull request extends beyond this specific refactoring and serves as a guide for other codes contributed to this project. The aim is to demonstrate what modern Fortran can do to improve the expressiveness of code while maintaining and potentially improving performance.

In my local tests compiling with gfortran -Ofast, what was previously named optimized.f90 only performs about 5% faster than modern.f90 on average. In 2021, that hardly justifies restricting oneself to a small subset of Fortran 95 that is very nearly Fortran 77 + free format even in the name of performance. The modern version is more compact in some places, adds safety in some places (e.g., tighter variable scoping means fewer opportunities for mistakenly changing variables), and adds additional optimization opportunities now that compilers such as NVIDIA's compiler can offload do concurrent to a GPU. This means there's even more that can be explored with this new code than I've been able to investigate so far.

Refactorings

Use block for tighter variable scoping (reduces some clutter).
Use array statements with vector subscripts to write compact initializations.
Use do concurrent to express fine-grained parallelism and expose opportunities for optimizations such as vectorization, multithreading, or GPU offloading.
Replace some [magic numbers], e.g., initial delta.
Eliminate some unnecessary computation by initializing and updating only values that actually need initializing and updating.
Use more compact conditional logic.
Use a real kind specification based on precision requirements rather than the ambiguously defined double-precision representation of any one given compiler on any one given platform.
Eliminate unnecessary precision specifiers (_dp) when an expression is simply a constant that is exactly representable in any reasonable floating-point representation.
Separate the procedure interface into a module and the procedure definition into a submodule.**
Use a more descriptive variable names in a few places.
Use associate to ensure immutable state for what is essentially a runtime constant.
Add comments! (including some in FORD style)

** In larger projects, item 9 makes code much more user-friendly by isolation the application programmer interface (API) and reduces compile times by eliminating unnecessary compilation cascades.
[magic numbers]: https://en.wikipedia.org/wiki/Magic_number_(programming)#Unnamed_numerical_constants

Refactorings ------------ 1. Use BLOCK for tighter variable scoping (reduces some clutter). 2. Use array statements with vector subscripts to write compact initializations. 3. Use DO CONCURRENT to express fine-grained parallelism and expose opportunities for optimizations such as vectorization, multithreading, or GPU offloading. 4. Replace some magic values, e.g., initial delta. 5. Eliminate some unnecessary computation by initializing and updating only values that actually need initializing and updating. 6. Use more compact conditional logic. 7. Use a real kind specificaiton based on precision requirements rather than the ambiguously defined double-precision representation of any one given compiler on any one given platform. 8. Eliminate unnecessary precision specifiers ("_dp") when an expression is simply a constant that is exactly representable in any reasonable floating-point representation. 9. Separate the procedure interface into a module and the procedure definition into a submodule.** 10. Use a more descriptive variable names in a few places. 11. Add comments! (includeing some in FORD style) ** In larger projects, item 9 makes code much more user-friendly by isolation the application programmer interface (API) and reduces compile times by eliminating unnecessary compilation cascades.

This code is old-fashioned, verbose, and most importantly, less clealry expressive, particularly in terms of exposing optimization opportunties, which makes the file name a misnomer. By contrast, commit a7ccf54 introduces a modern.f90 refactored to expose concurrency more explicitly through the use of `do concurrent` and array statements, both of which express sets of operations that can be safely executed in any order, opening up opportunities for vectorization, multithreading, and GPU offloading. In timings on a 2.3 GHz 8-Core Intel Core i9, modern.f90 executes in roughly the same time -- possibly 5-10% longer based on informal ad hoc testing -- but is considerably more compact in many places, is arguably more clear and expressive.

poisson2d/modern.f90

certik · 2021-06-29T04:53:04Z

Thank you for this! Looks nice.

Eliminate unnecessary precision specifiers (_dp) when an expression is simply a constant that is exactly representable in any reasonable floating-point representation.

I see --- I think you are taking advantage of this here, so you wouldn't need my suggested changes above? But is that really true? I know that 1.0 is exactly representable, but I thought the compilers will still only treat it as single precision when assigning and not set the higher bits. But maybe not. But 0.01 I think is not exactly representable.

Either is fine. I'm a minimalist so I use the first one because the `_dp` isn't required when the real literal is exactly representable in the chosen floating-post representation, which presumably is the case for `1.` in any floating-point scheme. Co-authored-by: Ondřej Čertík <[email protected]>

I concur that `_dp` is required in this one. Co-authored-by: Ondřej Čertík <[email protected]>

certik · 2021-06-29T04:55:16Z

I see. I am a minimalist too, that's why I just use rho = 1.

certik

This is +1 from me. @milancurcic would you mind also please reviewing this?

rouson · 2021-06-29T04:56:33Z

@certik good points. I think I just accepted your changes, but I'm not sure. If so, feel free to merge this or let me know if you want me to merge it. If I didn't accept your changes correctly, let me know and I'll see if I can fix it.

Co-authored-by: Ondřej Čertík <[email protected]>

poisson2d/modern.f90

jvdp1 · 2021-06-29T06:52:19Z

poisson2d/modern.f90

+  interface 
+    pure real(dp) module function rho(x,y)
+      !! Poisson equation inhomogeneous term.
+      implicit none


Just a question for my knowledge: why is there an implicit none here? Is the implicit none in the module not covering it?

Yes, that is a common mistake that I also learned the hard way --- implicit none in a module propagates and applies to everything except interface, where you have to repeat it.... :(

Thank you @certik for your answer. I will have to check all my codes ;)

arunningcroc · 2021-06-29T13:23:12Z

I don't know how to suggest changes to files that weren't changed in the pull request, so I'll just write here. Since you renamed one of the Fortran files, can you also rename the corresponding row in the table in README.md?

Also, you could add a row to the table, with the following numbers:

M=100: 0.43 M=200: 8.24 M=300: 38.39

which are the numbers I got on my machine with the same settings as the other files (-Ofast, WSL2, same CPU, etc.)

certik · 2021-06-29T13:36:43Z

Ah, I didn't realize the file got renamed from optimized to archaic. @rouson, do you have a better name? The term archaic suggests (to me) that that style should not be used anymore, but I actually like that style, that's how I write most of my codes. I don't know if you like any of the terms such as simple, traditional, plain, direct, ... I don't want to pass judgement yet, I am hoping to first get all the examples up with lots of versions and later I think it will be easier to discuss the different styles and to agree (or agree to disagree) on the recommended "idiomatic style(s)". We can also just number the different Fortran versions/styles.

(To rephrase what I am trying to achieve: I absolutely want to have such a discussion about how modern Fortran codes should be written and arrive at more opinionated community agreed upon style(s). But in order to have a good discussion about this, I am hoping to have a few dozen examples, each with different styles. And only then have that discussion. I am afraid the discussion would not be productive if we do it now.)

rouson · 2021-06-29T15:11:41Z

@certik I like your suggestion to rename archaic.f90 to traditional.f90. I'll push a new commit that makes that change. On a related note, you can think of a better name for modern.f90, please let me know. There can of course be many different modern approaches. I chose modern.f90 for two reasons: (1) it uses features from more recent standards and (2) I'm hoping more Fortran programming moves in this direction for several reasons.

poisson2d/modern.f90

milancurcic · 2021-06-29T16:09:40Z

poisson2d/modern.f90

+          phi_prime(i,j) =   (phi(i+1,j) + phi(i-1,j) + phi(i,j+1) + phi(i,j-1))/4._dp &
+                           + (dx/2._dp)*(dy/2._dp)/epsilon0*rho_sampled(i,j)


The _dp suffix for a few literal constants here is unnecessary (let the compiler coerce to correct precision). And, if you wanted to switch to single precision, you'd only need to change it in the declaration of phi and other arrays, and not fiddle with literal constants' precisions here.

Suggested change

phi_prime(i,j) = (phi(i+1,j) + phi(i-1,j) + phi(i,j+1) + phi(i,j-1))/4._dp &

+ (dx/2._dp)*(dy/2._dp)/epsilon0*rho_sampled(i,j)

phi_prime(i,j) = (phi(i+1,j) + phi(i-1,j) + phi(i,j+1) + phi(i,j-1))/4 &

+ (dx/2)*(dy/2)/epsilon0*rho_sampled(i,j)

I don't know the rules on expression evaluation well. I was guessing that the _dp is required to avoid getting a default-precision result for the division. I'm fine with changing it. but GitHub is marking this review comment as outdated so I think I'll have to make the change locally and push it.

I am a fan of concise code, so I agree with Milan. :)

In the above cases where 2 is an integer it doesn't matter. However when you are dealing with 0.213535 then things gets interesting since it first gets interpreted as a real(single), then gets cast to a real(double) which is not what you want. :)

But here 2 is sufficient int -> double is perfect representation.

Yes, I recommend to always write floating point numbers as 0.5_dp, whether or not they can be represented exactly. That way one cannot make a mistake and it's a simple rule to follow. Floating point divided by an integer is guaranteed to be a floating point, so there I prefer to just use an integer 2 instead of 2.0_dp.

poisson2d/modern.f90

milancurcic · 2021-06-29T16:19:51Z

If you like the whole-array form over the do concurrent one, you could rename the sources as "archaic" -> "imperative" and "modern" -> "array-oriented". I think the style-based naming communicates more than the era-based naming. Just an idea.

milancurcic · 2021-06-29T16:21:40Z

I don't know why yet, but modern.f90 in this PR, using gfortran-9.3.0 converges only for -O[01], but does not converge for -O{2,3,fast}. I think we need to fix it before merging.

awvwgk

Unfortunately, the code fails to compile with two of four compilers I tested (gfortran 11.1 and nvfortran 21.5 are okay, while nagfor 7.0 and ifort 2021.2 fail). At least the build with ifort seems mendable by putting the submodule in a separate file.

arunningcroc · 2021-06-29T16:59:58Z

I don't know why yet, but modern.f90 in this PR, using gfortran-9.3.0 converges only for -O[01], but does not converge for -O{2,3,fast}. I think we need to fix it before merging.

I seem to be able to converge it for M=100, 200, 300 with -Ofast, or at least it outputs a number of iterations in line with the C code (and Cython/python where available) using gfortran-9.3.0

milancurcic · 2021-06-29T17:10:36Z

I apologize, I gave incomplete information. I don't converge for M=50 (I made it small for shorter tests). Do the method and the problem require a minimum value of M for the method to converge?

arunningcroc · 2021-06-29T17:21:46Z

I apologize, I gave incomplete information. I don't converge for M=50 (I made it small for shorter tests). Do the method and the problem require a minimum value of M for the method to converge?

I get the same problem for M=50. I'm not sure if there's any general proof for what kind of grid is too small, but the other Fortran codes seem to work. So clearly the modern code is somehow generating a different code. The basic logic of it seems to be the same, though, and works for bigger M, so I wonder if the compiler isn't doing some wonky optimization because of the modern features that is simply buggy?

Co-authored-by: Milan Curcic <[email protected]>

Co-authored-by: Ondřej Čertík <[email protected]>

Co-authored-by: Milan Curcic <[email protected]>

rouson · 2021-06-29T18:58:40Z

When compiled with gfortran (installed via Homebrew on macOS Big Sur) 11.1.0 with no optimization , the code converges for me for all the M values that make sense: M>2. When compiled with gfortran -Ofast, the code converges for M>90.

I also tried the NAG compiler version 7.0, Build 7044, and found a compiler bug that I just reported to NAG. I'm pasting the bug report below.

I'm not sure what to do from here. Before refactoring code, it's always safest to have autoamated tests in place. I hope that any code contributed to this repository will include tests, which will probably need to be internal to the code, given the difficulty of setting up a multi-language test harness that covers so many languages.

Bug Report

% cat interface-not-found.f90 

module f_interface
contains
  pure real function f()
    f = 1.
  end function
end module 

  use f_interface
  implicit none

  integer i
  real f_sampled(1)
  
  associate(dy => 0.) 
    do concurrent(i=1:1)
      f_sampled(i) = f()
    end do
  end associate
end

% nagfor interface-not-found.f90 
NAG Fortran Compiler Release 7.0(Yurakucho) Build 7044
Error: interface-not-found.f90, line 19: Implicit type for F
Questionable: interface-not-found.f90, line 19: Variable F_SAMPLED set but never referenced
[NAG Fortran Compiler pass 1 error termination, 1 error, 1 warning]

arunningcroc · 2021-06-29T19:35:10Z

I've managed to isolate the problem. In a puzzling turn of events, it appears gfortran doesn't like the rho function and the submodule for some reason. Removing both the rewritten if-else as well as the submodule seems to fix this problem (but neither one on their own fixes it). That is, just use the old rho function in a module instead of the submodule, and for some reason the code works again.

Edit: but also, simply putting rho in a submodule and then calling it from a test program still doesn't cause any issues, and the code executes correctly. The loop seems to be required for the convergence issue to arise.

rouson · 2021-06-29T21:47:52Z

@arunningcroc could you either insert code edits on GitHub or submit a new pull request with working code? Most importantly, if you were the original author (or simply if you have thoughts on how to do so), could you include some sort of results-verification code? Possibly the best thing might be for each of the Fortran versions to be refactored into being functions instead of main programs. Assuming someone has verified that some version produces the desire solution, that version could be the reference version of the code. Then there can be just one Fortran main program that calls each version with cpu_time calls in-between for timings. The result produced by each refactored version can then be checked against the result of the reference version.

@certik if you don't mind introducing object-oriented programming, @everythingfunctional and I have developed a design pattern that captures everything I just wrote and specifies requirements in the form of deferred bindings on an abstract type we call the Oracle pattern, inspired by the common term "test oracle." It's actually a special case of a Template Method pattern. In addition to saving time explaining the same concepts in English, requiring contributors extend an oracle abstract derived type enforces the project requirements formally and operational and ensures uniformity across the project.

certik · 2021-06-29T22:08:57Z

@certik if you don't mind introducing object-oriented programming,

Absolutely. Please submit another PR with the OO code!

arunningcroc · 2021-06-30T05:41:04Z

@rouson
@arunningcroc could you either insert code edits on GitHub or submit a new pull request with working code? Most importantly, if you were the original author (or simply if you have thoughts on how to do so), could you include some sort of results-verification code? Possibly the best thing might be for each of the Fortran versions to be refactored into being functions instead of main programs. Assuming someone has verified that some version produces the desire solution, that version could be the reference version of the code. Then there can be just one Fortran main program that calls each version with cpu_time calls in-between for timings. The result produced by each refactored version can then be checked against the result of the reference version.

Yes, I'll try to get this done this afternoon. Also, I think I should probably write some sort of problem description pdf, explaining the theory and the numerical method in detail, so that anyone may easily implement it in their language of choice.

For the test cases, I know that the Python code should produce correct results, seeing as it is lifted almost wholesale from a computational physics textbook. However, there are also known analytical solutions to 2D Poisson and Laplace equations, and these might be valuable for testing as well. Presumably finding the correct solution to a few such cases will be a good guarantee that the code works as expected.

If I separate the Fortran code in to modules/functions and call them all from a single main program, should I do the same for the Python, C and Cython codes as well, so that each language is called from some file like "main.f90", "main.py", etc? Then anyone who wants to see all the timings of a particular language could always just execute the main file for that language, and one could imagine adding some bash/powershell script to run all of them consequentially.

rouson · 2021-06-30T17:34:21Z

@arunningcroc that all sounds great. The C code could be called by the Fortran main program via C bindings. I've heard that other languages also support C bindings so it might also be possible to call other languages from one Fortran main program.

rouson added 4 commits June 28, 2021 19:32

reduce poisson2d optimized.f90 problem size

39c006a

feat: make dy immutable via expression-association

648739c

certik reviewed Jun 29, 2021

View reviewed changes