-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading time is really slow with large thread count once again #293
Comments
Marian is not designed to work with shared memories across multiple graphs. mmap was used as a hack to achieve that in master, but it's not compatible with models that are being prepared on demand (which we have to do if we want efficient multiplication on a wide range of hardware.) Making graphs work with the same shared (constant) memory is doable, but hacky with the current implementation. |
I hadn't thought about the pivoting use case. That would not work well with the std::thread() approach. std::async should still be safe, but I agree that it isn't ideal as you give away control (and what about freeing the I'd have a small preference for not having For wasm it might be easier to have loading happen somewhere in One thing I was thinking of was just passing in empty translation requests, which would trigger that initial load. It would also solve the issue of having to retain the model instance until it is loaded because the request would own a shared_ptr to it. But the batcher doesn't have a way to force me to send a request to each worker specifically. So that's a no-go. I think the best way to go right now is do initialisation ( |
Pivoting should not conflict with any edits here is my understanding. If I understand correctly, you're concerned about the first translation for the user. I think going from lazy to eager is achievable externally by an empty translation and discard, but laziness has to be implemented internally. So I like the empty-translation approach followed by making things lazy, deferring the expensive op until it is absolutely required.
This sounds like a way to go. A factory for
I remember visiting a few translateLocally threads where this was part of the discussion and understand there's value in this - especially to developers and enthusiasts. However, my thinking is the normal person will not bother wps or how things are happening - all they'd prefer to have is the translation in front of them. Let's open an issue for this and take this forward. |
yep that was what I was thinking! Only I’m a bit sceptical that just an empty translation is sufficient as I didn’t see an easy way to guarantee all workers receiving a batch. The hatchet doesn’t seem to be aware of which worker is requesting a batch, so no easy way to get an empty batch to each of them. But just having lazy loading would already be a huge improvement on its own. We can worry about making it eager (but parallel!) via some tricks afterwards |
The main problem as it stands currently is that if you try to load translateLocally (especially on Windows) with a gaming machine with 24 cores you would have a very annoying and inexplicable 6 second lag until you can do the first translation. I don't quite see the merit of lazy initialisation, I think most of the time the user would want to translate more than n sentences where n is the number of threads locally available and that initialisation in parallel of n workers would take as much time as initialising one worker (discarding factors such as turbo single thread boost) |
It's a bit of a misnomer in this case maybe. The problem right now is that all workers are initialised sequentially on the main thread. Delaying that initialisation till its necessary is just an easy way to move the initialisation into the worker threads. The worker threads can then all individually do their own initialisation, so the initialisation can be done in parallel. |
One thing I'm worried about is that when you have 24 threads, and 24 workers, but a very small workload (i.e. you just started typing in translateLocally and you have translate-as-you-type turned on) will those sentences end up in that mostly-idle worker that has the model already loaded, or will it end up at one of the other 23 workers that then first have to do the (lazy) loading of the model, and only then provide a translation? Context: translateLocally currently still makes sure there is only one translation request at a time. But such a request could of course contain multiple batches. |
We should put energy into solving the underlying problem by loading the model once and sharing the memory across threads, rather than kludges on top.
…On January 3, 2022 2:06:30 PM UTC, Nikolay Bogoychev ***@***.***> wrote:
The main problem as it stands currently is that if you try to load translateLocally (especially on Windows) with a gaming machine with 24 cores you would have a very annoying and inexplicable 6 second lag until you can do the first translation.
I don't quite see the merit of lazy initialisation, I think most of the time the user would want to translate more than n sentences where n is the number of threads locally available and that initialisation in parallel of n workers would take as much time as initialising one worker (discarding factors such as turbo single thread boost)
--
Reply to this email directly or view it on GitHub:
#293 (comment)
You are receiving this because you are subscribed to this thread.
Message ID: ***@***.***>
|
Until this is done though, can we have a quick fix? |
From what I've gathered so far:
So solving the loading speed regression would be loading in the worker threads as there is no way to avoid not initialising N graphs and N scorers. Either through lazy loading or having a bit more complex threadpool (there's one in marian's 3rd_party/threadpool.h) that can handle arbitrary tasks. Marian just opens a new threadpool for loading and translating which guarantees that loading has finished before any translating tasks are started. Its pretty similar to my Sidetracked: I noticed that we now run a graph with a certain device id always in the same thread, but I can't find any code in tensor/cpu/backend.h that specifically calls for any affinity with that device id, it's just used seed the random generator. There might be opportunities to reduce memory consumption a bit by hooking the loading from memory bundles into the mmap loading path. Maybe that's already happening, not sure yet. Edit: when loading from a bundle, it's already using the same Slightly related to this, there is the reduce memory consumption by sharing a workspace work that might speed up loading a second model because we can re-use the workspace. |
Quick test in translateLocally:
Looks like parallel loading benefits from using the bundle already. bundle-load (patch for translateLocally): diff --git a/src/MarianInterface.cpp b/src/MarianInterface.cpp
index 378b08402941b2a45f8907dc4109d0fb39a8759f..0b9d2af438c3bff09286788cb2e1ac048e79fb61 100644
--- a/src/MarianInterface.cpp
+++ b/src/MarianInterface.cpp
@@ -112,7 +113,10 @@ MarianInterface::MarianInterface(QObject *parent)
// service is done with it, which it is since all translation
// requests are effectively blocking in this thread.
auto modelConfig = makeOptions(modelChange->config_file, modelChange->settings);
- model = std::make_shared<marian::bergamot::TranslationModel>(modelConfig, marian::bergamot::MemoryBundle{}, modelChange->settings.cpu_threads);
+
+ auto bundle = marian::bergamot::getMemoryBundleFromConfig(modelConfig);
+
+ model = std::make_shared<marian::bergamot::TranslationModel>(modelConfig, std::move(bundle), modelChange->settings.cpu_threads);
} else if (input) {
if (model) {
std::future<int> wordCount = std::async(countWords, *input); // @TODO we're doing an "unnecessary" string copy here (necessary because we std::move input into service->translate) parallel-init (patch for bergamot-translator): diff --git a/src/translator/translation_model.h b/src/translator/translation_model.h
index 6d2169494ee1ec1d66c0275b56a17a6d4e1e16ff..f9400751bed3c316db1f06ec5b0c6229ea2bec61 100644
--- a/src/translator/translation_model.h
+++ b/src/translator/translation_model.h
@@ -107,6 +107,7 @@ class TranslationModel {
Graph graph;
ScorerEnsemble scorerEnsemble;
+ bool initialised{false};
};
// ShortlistGenerator is purely const, we don't need one per thread.
diff --git a/src/translator/translation_model.cpp b/src/translator/translation_model.cpp
index 9d2eb0cdb73526584d53e5cc2e32facfffc9650e..df2d413f966fd2000a42580058438457570d460a 100644
--- a/src/translator/translation_model.cpp
+++ b/src/translator/translation_model.cpp
@@ -44,21 +44,17 @@ TranslationModel::TranslationModel(const Config &options, MemoryBundle &&memory
srcIdx, trgIdx, shared_vcb);
}
}
-
- for (size_t idx = 0; idx < replicas; idx++) {
- loadBackend(idx);
- }
}
void TranslationModel::loadBackend(size_t idx) {
auto &graph = backend_[idx].graph;
auto &scorerEnsemble = backend_[idx].scorerEnsemble;
@@ -172,6 +168,12 @@ Ptr<marian::data::CorpusBatch> TranslationModel::convertToMarianBatch(Batch &bat
void TranslationModel::translateBatch(size_t deviceId, Batch &batch) {
auto &backend = backend_[deviceId];
+
+ if (!backend.initialised) {
+ loadBackend(deviceId);
+ backend.initialised = true;
+ }
+
BeamSearch search(options_, backend.scorerEnsemble, vocabs_.target());
Histories histories = search.search(backend.graph, convertToMarianBatch(batch));
batch.completeBatch(histories); |
The bergamot-translator should do the bundle loading inside its constructor. Once it reads the config file, it has access to the memory information and can just load it before initialising workers.. There is no reason why the bundle should happen outside bergamot-translator. |
Could we please have a simple fix until the proper one, I would like to push a build that doesn't take 8 seconds to start on my pc... |
I've opened #303 which is basically the diff above. I've got no better short term solution (nor long term, really).
Yes please, that would be helpful. I didn't add it to the pull request yet because I couldn't see an easy way of checking whether the MemoryBundle argument was default-constructed one or not. Ideally I could just overload the TranslationModel constructor for the didnt-pass-in-a-memory-bundle case, but the bundle is not the last argument. Thread count is. Which you do want to pass in the TranslateLocally use case. |
diff --git a/src/translator/translation_model.h b/src/translator/translation_model.h
index 3e79fdb..1861bc9 100644
--- a/src/translator/translation_model.h
+++ b/src/translator/translation_model.h
@@ -58,6 +58,10 @@ class TranslationModel {
/// ShortlistGenerator, Vocabs and SentenceSplitter.
TranslationModel(const Config& options, MemoryBundle&& memory = MemoryBundle{}, size_t replicas = 1);
+ // Path for translateLocally
+ TranslationModel(const Config& options, size_t replicas = 1)
+ : TranslationModel(options, getMemoryBundleFromConfig(options), replicas) {}
+
/// Make a Request to be translated by this TranslationModel instance.
/// @param [in] requestId: Unique identifier associated with this request, available from Service.
/// @param [in] source: Source text to be translated. Ownership is accepted and eventually returned to the client in
Something like this? |
Exactly like that. Does that compile? If you only pass in options, how does the compiler know which of the two constructors to use? (Don't know why I hadn't thought of that I somehow had in my head that all constructors had to have the same arguments, just more or fewer optional ones?) |
diff --git a/src/translator/translation_model.h b/src/translator/translation_model.h
index 3e79fdb..3c8de81 100644
--- a/src/translator/translation_model.h
+++ b/src/translator/translation_model.h
@@ -6,6 +6,7 @@
#include "batch.h"
#include "batching_pool.h"
+#include "byte_array_util.h"
#include "cache.h"
#include "common/utils.h"
#include "data/shortlist.h"
@@ -56,7 +57,11 @@ class TranslationModel {
/// @param [in] options: Marian options object.
/// @param [in] memory: MemoryBundle object holding memory buffers containing parameters to build MarianBackend,
/// ShortlistGenerator, Vocabs and SentenceSplitter.
- TranslationModel(const Config& options, MemoryBundle&& memory = MemoryBundle{}, size_t replicas = 1);
+ TranslationModel(const Config& options, MemoryBundle&& memory, size_t replicas = 1);
+
+ // Path for translateLocally
+ TranslationModel(const Config& options, size_t replicas = 1)
+ : TranslationModel(options, getMemoryBundleFromConfig(options), replicas) {}
/// Make a Request to be translated by this TranslationModel instance.
/// @param [in] requestId: Unique identifier associated with this request, available from Service. The above compiles, still unverified as a fix though. |
This is identified to be a bergamot translator issue in XapaJIaMnu/translateLocally#76.
A nice solution may involve shared model memory across worker threads (avoiding intgemm/shortlist preprocessing placing stuff in graph [needs verification]). This memory will be owned by
TranslationModel
. Everything transient will remain in the workspace and workspace attached to the worker. This issue is closely related to #257.A temporary workaround provided by @jelmervdl is:
Unsure about putting
std::thread
orstd::async
withinTranslationModel
, the threading and delegations should ideally be withinService
. As part of resolving, we should ideally check-in something on var through BRT which checks model-loading speeds remain unaffected hereafter.The text was updated successfully, but these errors were encountered: