Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression 2.28 #1055

Closed
rok-cesnovar opened this issue Nov 7, 2021 · 11 comments
Closed

Performance regression 2.28 #1055

rok-cesnovar opened this issue Nov 7, 2021 · 11 comments

Comments

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Nov 7, 2021

Description:

A performance regression was identified on Discourse: https://discourse.mc-stan.org/t/speed-difference-between-rstan-and-c
cmdstan-for-a-simple-model/25113

The regression is most evident with a trivial model like:

data {
 int N;
}
parameters {
 vector[N] a;
}
model {
 a ~ normal(0,1);
}

Execution times with N=10000:

image

It's much less evident with non-trivial models, but still, something to investigate.

Based on my investigation and investigation by the user rfc on Discourse (EDIT: @fcostin on Github) it has to do with writing output samples: https://discourse.mc-stan.org/t/speed-difference-between-rstan-and-cmdstan-for-a-simple-model/25113/23

It's probably the changes made to work with multiple chains.

EDIT: #987 was confirmed to be the issue.

Current Version:

v2.28.1

@fcostin
Copy link

fcostin commented Nov 7, 2021

(i'm the same rfc from discourse)

I suspect it's the diagnostic output. With cmdstan 2.27.0 if you explicitly enable diagnostic output on the command line, then it takes a similar amount of running time to cmdstan 2.28.1. Whereas the running time of cmdstan 2.28.1 doesn't change if the user specifies a diagnostic output:

cmdstan version command running time (sec)
2.27.0 ./bench sample data file=bench-data.json random seed=1234 19.326
2.27.0 ./bench sample data file=bench-data.json random seed=1234 output diagnostic_file=diagnostic.log 29.611
2.28.1 ./bench sample data file=bench-data.json random seed=1234 28.493
2.28.1 ./bench sample data file=bench-data.json random seed=1234 output diagnostic_file=diagnostic.log 29.531

Could it be possible that cmdstan 2.28.1 by default is doing all the hard work to format diagnostic output and then throwing it away?

I don't really understand the code, but this part of #987 looks a bit suspicious:

} else {
diagnostic_writers.emplace_back(
std::make_unique<std::fstream>("", std::fstream::out), "# ");
}

@rok-cesnovar
Copy link
Member Author

That is a very nice catch @fcostin!! You are spot on.

Commenting out all the diagnostic writes in the stan submodule gets the performance back. So we just need to figure out how to handle this in command.hpp.

@fcostin
Copy link

fcostin commented Nov 7, 2021

It looks like the culprit is stan/src/stan/callbacks/unique_stream_writer.hpp:

https://github.com/stan-dev/stan/blob/0e4e2a9f8e4af6fcc6baae5070fc4acddbddd9ad/src/stan/callbacks/unique_stream_writer.hpp

In the case where the user does not specify a diagnostic_file, command.hpp is still setting the diagnostic writer to a unique_stream_writer.

Internally, unique_stream_writer is buffering all writes to a local std::stringstream streamer before writing to the output stream.

Curiously, removing the the stringstream buffering doesn't suffice to fix the problem.

If I add an explicit bool toggle to unique_stream_writer to set it into no-op mode, and modify command.hpp to toggle on no-op mode when constructing the diagnostic writer when the user hasn't specified a diagnostic file, then I can observe speedup and running time in line with cmdstan 2.27.0 performance.

it looks like the default stan writer is a no-op writer implementation, but i'm not familiar enough with C++'s type system to suggest a clean patch.

A clean way to do it could perhaps be if diagnostic_writers in command.hpp was a vector of pointers to the base writer type, so it could contain a mix of different concrete writer types, so some could be swapped out for no-op writers.

Here's my dodgy patch:

diff --unified command_base.hpp command_hack.hpp 
--- command_base.hpp	2021-11-08 09:14:58.254349445 +1100
+++ command_hack.hpp	2021-11-08 09:15:37.941928613 +1100
@@ -490,17 +490,20 @@
     if (!sig_figs_arg->is_default()) {
       (*unique_fstream.get()) << std::setprecision(sig_figs_arg->value());
     }
-    sample_writers.emplace_back(std::move(unique_fstream), "# ");
+    sample_writers.emplace_back(false, std::move(unique_fstream), "# ");
     if (diagnostic_file != "") {
       auto diagnostic_filename
           = diagnostic_name + name_iterator(i) + diagnostic_ending;
       diagnostic_writers.emplace_back(
+          false,
           std::make_unique<std::fstream>(diagnostic_filename,
                                          std::fstream::out),
           "# ");
     } else {
       diagnostic_writers.emplace_back(
-          std::make_unique<std::fstream>("", std::fstream::out), "# ");
+          true,
+          std::make_unique<std::fstream>("", std::fstream::out),
+          "# ");
     }
   }
   for (int i = 0; i < num_chains; i++) {
diff --unified unique_stream_writer_base.hpp unique_stream_writer_hack.hpp 
--- unique_stream_writer_base.hpp	2021-11-08 09:12:18.815994333 +1100
+++ unique_stream_writer_hack.hpp	2021-11-08 09:11:41.792364077 +1100
@@ -26,14 +26,15 @@
    * @param[in] comment_prefix string to stream before each comment line.
    *  Default is "".
    */
-  explicit unique_stream_writer(std::unique_ptr<Stream>&& output,
+  explicit unique_stream_writer(const bool noop, std::unique_ptr<Stream>&& output,
                                 const std::string& comment_prefix = "")
-      : output_(std::move(output)), comment_prefix_(comment_prefix) {}
+      : noop_(noop), output_(std::move(output)), comment_prefix_(comment_prefix) {}
 
   unique_stream_writer();
   unique_stream_writer(unique_stream_writer& other) = delete;
   unique_stream_writer(unique_stream_writer&& other)
-      : output_(std::move(other.output_)),
+      : noop_(std::move(other.noop_)),
+        output_(std::move(other.output_)),
         comment_prefix_(std::move(other.comment_prefix_)) {}
   /**
    * Virtual destructor
@@ -70,10 +71,9 @@
    * Writes the comment_prefix to the stream followed by a newline.
    */
   void operator()() {
-    std::stringstream streamer;
-    streamer.precision(output_.get()->precision());
-    streamer << comment_prefix_ << std::endl;
-    *output_ << streamer.str();
+    if (noop_)
+      return;
+    *output_ << comment_prefix_ << std::endl;
   }
 
   /**
@@ -82,13 +82,13 @@
    * @param[in] message A string
    */
   void operator()(const std::string& message) {
-    std::stringstream streamer;
-    streamer.precision(output_.get()->precision());
-    streamer << comment_prefix_ << message << std::endl;
-    *output_ << streamer.str();
+    if (noop_)
+      return;
+    *output_ << comment_prefix_ << message << std::endl;
   }
 
  private:
+  bool noop_;
   /**
    * Output stream
    */
@@ -109,18 +109,15 @@
    */
   template <class T>
   void write_vector(const std::vector<T>& v) {
-    if (v.empty())
+    if (v.empty() or noop_)
       return;
     using const_iter = typename std::vector<T>::const_iterator;
     const_iter last = v.end();
     --last;
-    std::stringstream streamer;
-    streamer.precision(output_.get()->precision());
     for (const_iter it = v.begin(); it != last; ++it) {
-      streamer << *it << ",";
+      *output_ << *it << ",";
     }
-    streamer << v.back() << std::endl;
-    *output_ << streamer.str();
+    *output_ << v.back() << std::endl;
   }
 };

@rok-cesnovar
Copy link
Member Author

Thanks @fcostin for the suggested fix.

I will let @SteveBronder chime in, as he know more about the decisions that went into this.

@wds15
Copy link
Contributor

wds15 commented Nov 8, 2021

So is the performance regression due to buffering output after all? That's done to avoid cluttered output under threading. Hopefully we can make the buffer more efficient as this thing is there for a good reason.

Also: Is the performance regression seen for any model which does some more intense computations? Binomial likelihood with normalisation - how is that (e.g. the Bernoulli example model)? Or more complicated stuff like a negative binomial?

I am having the impression that almost trivial models are affected by this... which does not mean we should not fix it, but maybe not at "any cost"?

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented Nov 8, 2021

So is the performance regression due to buffering output after all?

Its due to buffering the output for diagnostics even when the diagnostics file is "". We essentially pay the cost of outputing diagnostics regardless if they are requested or not.

Is the performance regression seen for any model which does some more intense computations?

The more computation there is, the less obvious the performance hit is.

I am having the impression that almost trivial models are affected by this... which does not mean we should not fix it, but maybe not at "any cost"?

I agree that trivial models probably arent the focal point, but I think the solution here should be trivial. Worst case we add a boolean like @fcostin proposes.

The essential problem was that pre-2.28, this line produced a no-op diagnostic writer:

std::fstream diagnostic_stream(

while this line https://github.com/stan-dev/cmdstan/blob/v2.28.1/src/cmdstan/command.hpp#L502 does not do that.

@wds15
Copy link
Contributor

wds15 commented Nov 8, 2021

Sounds like an easy matter then to disable the diagnostic output whenever its not requested.

@fcostin
Copy link

fcostin commented Nov 8, 2021

i agree with @rok-cesnovar's summary.

[buffering is] done to avoid cluttered output under threading. Hopefully we can make the buffer more efficient as this thing is there for a good reason.

That makes sense. The problem is not the buffer itself, but that diagnostic output is getting computed and written to the buffer when a user has not requested any diagnostic output. If the user has requested diagnostic output, performance seems comparable to cmdstan 2.27.0 .

When a user does not ask for diagnostic output on this extremely trivial model, cmdstan 2.28.1 spends approximately 50% more CPU instructions than necessary during sampling (!), and these additional CPU instructions are spent string-formatting floating point numbers for diagnostic output, which is then thrown away. i.e. the 50% additional work is pure waste.

My example patch is not minimal or the best solution -- there is no need to remove the buffering in unique_stream_writer.

It would be cleaner to modify command.hpp so it disables diagnostic output by wiring in a no-op diagnostic writer, in the case where a user has not explicitly requested diagnostic output. It should be easy but may need the type of the vector diagnostic_writers in command.hpp to be changed from a vector of unique_stream_writer to a vector of pointers to the base https://github.com/stan-dev/stan/blob/develop/src/stan/callbacks/writer.hpp class

@wds15
Copy link
Contributor

wds15 commented Nov 8, 2021

Oh dear…I reviewed this pr at the time and did not notice. Hopefully @SteveBronder can sort this out quickly.

@SteveBronder
Copy link
Contributor

SteveBronder commented Nov 9, 2021

@fcostin think you for the thorough review! Yes I like your solution of making these a pointer up in command.hpp and having a empty_writer for making the writes a no-op. I can put up a PR for this tmrw

EDIT: Actually I think we only need a constructor for writer that takes in a unique_ptr for the stream but does nothing

@rok-cesnovar
Copy link
Member Author

Closed by #1060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants