Performance regression 2.28 #1055

rok-cesnovar · 2021-11-07T19:44:59Z

Description:

A performance regression was identified on Discourse: https://discourse.mc-stan.org/t/speed-difference-between-rstan-and-c
cmdstan-for-a-simple-model/25113

The regression is most evident with a trivial model like:

data {
 int N;
}
parameters {
 vector[N] a;
}
model {
 a ~ normal(0,1);
}

Execution times with N=10000:

It's much less evident with non-trivial models, but still, something to investigate.

Based on my investigation and investigation by the user rfc on Discourse (EDIT: @fcostin on Github) it has to do with writing output samples: https://discourse.mc-stan.org/t/speed-difference-between-rstan-and-cmdstan-for-a-simple-model/25113/23

It's probably the changes made to work with multiple chains.

EDIT: #987 was confirmed to be the issue.

Current Version:

v2.28.1

The text was updated successfully, but these errors were encountered:

fcostin · 2021-11-07T20:06:42Z

(i'm the same rfc from discourse)

I suspect it's the diagnostic output. With cmdstan 2.27.0 if you explicitly enable diagnostic output on the command line, then it takes a similar amount of running time to cmdstan 2.28.1. Whereas the running time of cmdstan 2.28.1 doesn't change if the user specifies a diagnostic output:

cmdstan version	command	running time (sec)
2.27.0	./bench sample data file=bench-data.json random seed=1234	19.326
2.27.0	./bench sample data file=bench-data.json random seed=1234 output diagnostic_file=diagnostic.log	29.611
2.28.1	./bench sample data file=bench-data.json random seed=1234	28.493
2.28.1	./bench sample data file=bench-data.json random seed=1234 output diagnostic_file=diagnostic.log	29.531

Could it be possible that cmdstan 2.28.1 by default is doing all the hard work to format diagnostic output and then throwing it away?

~~I don't really understand the code, but this part of #987 looks a bit suspicious:~~

cmdstan/src/cmdstan/command.hpp

Lines 501 to 504 in bc4bc94

    
           } else { 
        
             diagnostic_writers.emplace_back( 
        
                 std::make_unique<std::fstream>("", std::fstream::out), "# "); 
        
           }

rok-cesnovar · 2021-11-07T20:31:32Z

That is a very nice catch @fcostin!! You are spot on.

Commenting out all the diagnostic writes in the stan submodule gets the performance back. So we just need to figure out how to handle this in command.hpp.

fcostin · 2021-11-07T21:55:44Z

It looks like the culprit is stan/src/stan/callbacks/unique_stream_writer.hpp:

https://github.com/stan-dev/stan/blob/0e4e2a9f8e4af6fcc6baae5070fc4acddbddd9ad/src/stan/callbacks/unique_stream_writer.hpp

In the case where the user does not specify a diagnostic_file, command.hpp is still setting the diagnostic writer to a unique_stream_writer.

Internally, unique_stream_writer is buffering all writes to a local std::stringstream streamer before writing to the output stream.

Curiously, removing the the stringstream buffering doesn't suffice to fix the problem.

If I add an explicit bool toggle to unique_stream_writer to set it into no-op mode, and modify command.hpp to toggle on no-op mode when constructing the diagnostic writer when the user hasn't specified a diagnostic file, then I can observe speedup and running time in line with cmdstan 2.27.0 performance.

it looks like the default stan writer is a no-op writer implementation, but i'm not familiar enough with C++'s type system to suggest a clean patch.

A clean way to do it could perhaps be if diagnostic_writers in command.hpp was a vector of pointers to the base writer type, so it could contain a mix of different concrete writer types, so some could be swapped out for no-op writers.

Here's my dodgy patch:

diff --unified command_base.hpp command_hack.hpp 
--- command_base.hpp	2021-11-08 09:14:58.254349445 +1100
+++ command_hack.hpp	2021-11-08 09:15:37.941928613 +1100
@@ -490,17 +490,20 @@
     if (!sig_figs_arg->is_default()) {
       (*unique_fstream.get()) << std::setprecision(sig_figs_arg->value());
     }
-    sample_writers.emplace_back(std::move(unique_fstream), "# ");
+    sample_writers.emplace_back(false, std::move(unique_fstream), "# ");
     if (diagnostic_file != "") {
       auto diagnostic_filename
           = diagnostic_name + name_iterator(i) + diagnostic_ending;
       diagnostic_writers.emplace_back(
+          false,
           std::make_unique<std::fstream>(diagnostic_filename,
                                          std::fstream::out),
           "# ");
     } else {
       diagnostic_writers.emplace_back(
-          std::make_unique<std::fstream>("", std::fstream::out), "# ");
+          true,
+          std::make_unique<std::fstream>("", std::fstream::out),
+          "# ");
     }
   }
   for (int i = 0; i < num_chains; i++) {

diff --unified unique_stream_writer_base.hpp unique_stream_writer_hack.hpp 
--- unique_stream_writer_base.hpp	2021-11-08 09:12:18.815994333 +1100
+++ unique_stream_writer_hack.hpp	2021-11-08 09:11:41.792364077 +1100
@@ -26,14 +26,15 @@
    * @param[in] comment_prefix string to stream before each comment line.
    *  Default is "".
    */
-  explicit unique_stream_writer(std::unique_ptr<Stream>&& output,
+  explicit unique_stream_writer(const bool noop, std::unique_ptr<Stream>&& output,
                                 const std::string& comment_prefix = "")
-      : output_(std::move(output)), comment_prefix_(comment_prefix) {}
+      : noop_(noop), output_(std::move(output)), comment_prefix_(comment_prefix) {}
 
   unique_stream_writer();
   unique_stream_writer(unique_stream_writer& other) = delete;
   unique_stream_writer(unique_stream_writer&& other)
-      : output_(std::move(other.output_)),
+      : noop_(std::move(other.noop_)),
+        output_(std::move(other.output_)),
         comment_prefix_(std::move(other.comment_prefix_)) {}
   /**
    * Virtual destructor
@@ -70,10 +71,9 @@
    * Writes the comment_prefix to the stream followed by a newline.
    */
   void operator()() {
-    std::stringstream streamer;
-    streamer.precision(output_.get()->precision());
-    streamer << comment_prefix_ << std::endl;
-    *output_ << streamer.str();
+    if (noop_)
+      return;
+    *output_ << comment_prefix_ << std::endl;
   }
 
   /**
@@ -82,13 +82,13 @@
    * @param[in] message A string
    */
   void operator()(const std::string& message) {
-    std::stringstream streamer;
-    streamer.precision(output_.get()->precision());
-    streamer << comment_prefix_ << message << std::endl;
-    *output_ << streamer.str();
+    if (noop_)
+      return;
+    *output_ << comment_prefix_ << message << std::endl;
   }
 
  private:
+  bool noop_;
   /**
    * Output stream
    */
@@ -109,18 +109,15 @@
    */
   template <class T>
   void write_vector(const std::vector<T>& v) {
-    if (v.empty())
+    if (v.empty() or noop_)
       return;
     using const_iter = typename std::vector<T>::const_iterator;
     const_iter last = v.end();
     --last;
-    std::stringstream streamer;
-    streamer.precision(output_.get()->precision());
     for (const_iter it = v.begin(); it != last; ++it) {
-      streamer << *it << ",";
+      *output_ << *it << ",";
     }
-    streamer << v.back() << std::endl;
-    *output_ << streamer.str();
+    *output_ << v.back() << std::endl;
   }
 };

rok-cesnovar · 2021-11-08T07:30:31Z

Thanks @fcostin for the suggested fix.

I will let @SteveBronder chime in, as he know more about the decisions that went into this.

wds15 · 2021-11-08T11:21:07Z

So is the performance regression due to buffering output after all? That's done to avoid cluttered output under threading. Hopefully we can make the buffer more efficient as this thing is there for a good reason.

Also: Is the performance regression seen for any model which does some more intense computations? Binomial likelihood with normalisation - how is that (e.g. the Bernoulli example model)? Or more complicated stuff like a negative binomial?

I am having the impression that almost trivial models are affected by this... which does not mean we should not fix it, but maybe not at "any cost"?

rok-cesnovar · 2021-11-08T11:37:53Z

So is the performance regression due to buffering output after all?

Its due to buffering the output for diagnostics even when the diagnostics file is "". We essentially pay the cost of outputing diagnostics regardless if they are requested or not.

Is the performance regression seen for any model which does some more intense computations?

The more computation there is, the less obvious the performance hit is.

I am having the impression that almost trivial models are affected by this... which does not mean we should not fix it, but maybe not at "any cost"?

I agree that trivial models probably arent the focal point, but I think the solution here should be trivial. Worst case we add a boolean like @fcostin proposes.

The essential problem was that pre-2.28, this line produced a no-op diagnostic writer:

cmdstan/src/cmdstan/command.hpp

Line 211 in 998c08f

std::fstream diagnostic_stream(

while this line https://github.com/stan-dev/cmdstan/blob/v2.28.1/src/cmdstan/command.hpp#L502 does not do that.

wds15 · 2021-11-08T11:40:26Z

Sounds like an easy matter then to disable the diagnostic output whenever its not requested.

fcostin · 2021-11-08T18:29:06Z

i agree with @rok-cesnovar's summary.

[buffering is] done to avoid cluttered output under threading. Hopefully we can make the buffer more efficient as this thing is there for a good reason.

That makes sense. The problem is not the buffer itself, but that diagnostic output is getting computed and written to the buffer when a user has not requested any diagnostic output. If the user has requested diagnostic output, performance seems comparable to cmdstan 2.27.0 .

When a user does not ask for diagnostic output on this extremely trivial model, cmdstan 2.28.1 spends approximately 50% more CPU instructions than necessary during sampling (!), and these additional CPU instructions are spent string-formatting floating point numbers for diagnostic output, which is then thrown away. i.e. the 50% additional work is pure waste.

My example patch is not minimal or the best solution -- there is no need to remove the buffering in unique_stream_writer.

It would be cleaner to modify command.hpp so it disables diagnostic output by wiring in a no-op diagnostic writer, in the case where a user has not explicitly requested diagnostic output. It should be easy but may need the type of the vector diagnostic_writers in command.hpp to be changed from a vector of unique_stream_writer to a vector of pointers to the base https://github.com/stan-dev/stan/blob/develop/src/stan/callbacks/writer.hpp class

wds15 · 2021-11-08T19:34:59Z

Oh dear…I reviewed this pr at the time and did not notice. Hopefully @SteveBronder can sort this out quickly.

SteveBronder · 2021-11-09T04:51:02Z

@fcostin think you for the thorough review! Yes I like your solution of making these a pointer up in command.hpp and having a empty_writer for making the writes a no-op. I can put up a PR for this tmrw

EDIT: Actually I think we only need a constructor for writer that takes in a unique_ptr for the stream but does nothing

rok-cesnovar · 2021-11-22T07:38:12Z

Closed by #1060

This was referenced Nov 19, 2021

Do not write to stream when given a nullptr in the writer stan-dev/stan#3087

Merged

diagnostics writer now has a nullptr if we are not using it #1060

Merged

rok-cesnovar mentioned this issue Nov 19, 2021

Release CmdStan & Stan 2.28.2 #1061

Closed

9 tasks

rok-cesnovar closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression 2.28 #1055

Performance regression 2.28 #1055

rok-cesnovar commented Nov 7, 2021 •

edited

Loading

fcostin commented Nov 7, 2021 •

edited

Loading

rok-cesnovar commented Nov 7, 2021

fcostin commented Nov 7, 2021 •

edited

Loading

rok-cesnovar commented Nov 8, 2021

wds15 commented Nov 8, 2021

rok-cesnovar commented Nov 8, 2021 •

edited

Loading

wds15 commented Nov 8, 2021 •

edited

Loading

fcostin commented Nov 8, 2021 •

edited

Loading

wds15 commented Nov 8, 2021

SteveBronder commented Nov 9, 2021 •

edited

Loading

rok-cesnovar commented Nov 22, 2021

Performance regression 2.28 #1055

Performance regression 2.28 #1055

Comments

rok-cesnovar commented Nov 7, 2021 • edited Loading

Description:

Current Version:

fcostin commented Nov 7, 2021 • edited Loading

rok-cesnovar commented Nov 7, 2021

fcostin commented Nov 7, 2021 • edited Loading

rok-cesnovar commented Nov 8, 2021

wds15 commented Nov 8, 2021

rok-cesnovar commented Nov 8, 2021 • edited Loading

wds15 commented Nov 8, 2021 • edited Loading

fcostin commented Nov 8, 2021 • edited Loading

wds15 commented Nov 8, 2021

SteveBronder commented Nov 9, 2021 • edited Loading

rok-cesnovar commented Nov 22, 2021

rok-cesnovar commented Nov 7, 2021 •

edited

Loading

fcostin commented Nov 7, 2021 •

edited

Loading

fcostin commented Nov 7, 2021 •

edited

Loading

rok-cesnovar commented Nov 8, 2021 •

edited

Loading

wds15 commented Nov 8, 2021 •

edited

Loading

fcostin commented Nov 8, 2021 •

edited

Loading

SteveBronder commented Nov 9, 2021 •

edited

Loading