-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCSFilesystem creation is surprisingly slow #1768
Comments
aha, further testing indicates that creating |
url_to_fs
is surprisingly slow
specifying # /// script
# requires-python = ">=3.11"
# dependencies = [
# "fsspec",
# "gcsfs",
# ]
# ///
from fsspec import url_to_fs
from time import time
url = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
if __name__ == '__main__':
start = time()
_ = url_to_fs(url, token='anon')
elapsed = time() - start
print(elapsed) so I'm guessing the credentials search process is consuming most of the 25ish seconds I observed with the first version of this script. |
The default behaviour (unless otherwise specified) is to try various credential verification mechanisms in turn, until one works. anon is the fallback when everything else has failed. Unfortunately, there are many possibilities; maybe there's an opportunity to put very short timeouts on processes that ought to be fast, like _connect_cloud, which queries the GCP metadata service and is only available from within google's walls. It would be worthwhile putting timing around the methods in gcsfs.credentials.GoogleCredentials.connect to figure out if this is caused by one particular problematic method. For me, I don't see the problem:
(I did the following to get the times: --- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -229,6 +229,8 @@ class GoogleCredentials:
self._connect_token(method)
elif method is None:
for meth in ["google_default", "cache", "cloud", "anon"]:
+ import time
+ t = time.time()
try:
self.connect(method=meth)
logger.debug("Connected with method %s", meth)
@@ -242,6 +244,7 @@ class GoogleCredentials:
# Reset credentials if they were set but the authentication failed
# (reverts to 'anon' behavior)
self.credentials = None
+ print("FAIL", meth, time.time()-t)
else:
# Since the 'anon' connection method should always succeed, ) |
I noticed the same issue in universal-pathlib where the gcs tests where very slow, and it only occurred without a working internet connection, because then all the timeouts accumulated. setting (the test suite runs against a gcs docker container) |
I am unfortunately finding that running the default connection methods still happens fast and falls back to anon, even when I turn off the network. |
@ap-- and I are both in germany, so perhaps regional connectivity is part of the story here |
Thanks folks for bringing this up. It keeps biting us in different environments for a while now. I'm able to reproduce it easily at home and in the office and in many other different places.
@martindurant I think the difference should be here - https://github.com/googleapis/google-auth-library-python/blob/main/google/auth/compute_engine/_metadata.py#L207-L227 For me, I'm getting immediately into: except exceptions.TransportError as e:
_LOGGER.warning(
"Compute Engine Metadata server unavailable on "
"attempt %s of %s. Reason: %s",
attempt,
retry_count,
e,
) with a message: It means it goes into retry with an exponential backoff. In your case, for some reason From our brief research (@0x2b3bfa0 can also add details) - it's not that easy to hack it out of the exponential backoff. It would be actually interesting to see what does it return in your case. |
I was running from home, not GCP. I don't know whether it gave a different, non-retryable error. Since we completely control the compute_engine auth, we can add any timeout or retry arge in GoogleCredentials._connect_cloud . IF within GCP, this should always happen quickly. |
I'm not sure I fully understand the exact meaning. Do you mean we can change the lib itself ( I don't know from the top of my head and from that brief research a simple way to change the retry count or backoff logic outside. But we can check one more time. What we can do is to change the order or avoid metadata server if we are not on GCE (how does it determine that btw?). |
I had mean the latter; but I'm not sure that configuring the Session of Request can do what we want - the retry logic is actually within google.auth.compute_engine._metadata.get(). We would need to override the Request's --- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -36,6 +36,11 @@ client_config = {
}
+class TimeoutRequest(Request):
+ def __call__(self, *args, **kwargs):
+ return super().__call__(*args, timeout=1, **kwargs)
+
+
class GoogleCredentials:
def __init__(self, project, access, token, check_credentials=None):
self.scope = "https://www.googleapis.com/auth/devstorage." + access
@@ -93,7 +98,7 @@ class GoogleCredentials:
self.credentials = gauth.compute_engine.Credentials()
try:
with requests.Session() as session:
- req = Request(session)
+ req = TimeoutRequest(session)
self.credentials.refresh(req)
except gauth.exceptions.RefreshError as error:
raise ValueError("Invalid gcloud credentials") from error Of course, I'd need someone to verify that cloud auth still works! Otherwise, we could either vendor that (nasty!) code, or perhaps run with an explicit timeout (would require a thread or signal). |
The previous default timeout is 120s (x5 retries?) |
Yeah, I wonder though what was the logic behind setting up those timeouts on the compute engine side (we can break the things if we just override them). May be if we check what error code is returned and why on your machine we can find a better way? Another alternative can be - check if we can detect that we are running on GCE and enable cloud auth only in that case? |
Apparently, Why does |
This didn't capture all the ways people wanted to connect
Sorry, where is this? We could skip the cloud connect if we know it will fail. |
😅
|
It look like |
this snippet takes ~24s to run on my laptop in Germany:
26s seems a bit long for string parsing -- clearly something else is going on inside this function -- but I'm a bit surprised that there's anything else going on side here, given the apparent simplicity of the URL. Is this expected behavior, or something that should be fixed?
The text was updated successfully, but these errors were encountered: