Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSFilesystem creation is surprisingly slow #1768

Open
d-v-b opened this issue Jan 8, 2025 · 16 comments
Open

GCSFilesystem creation is surprisingly slow #1768

d-v-b opened this issue Jan 8, 2025 · 16 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Jan 8, 2025

this snippet takes ~24s to run on my laptop in Germany:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "fsspec",
#     "gcsfs",
# ]
# ///
from fsspec import url_to_fs
from time import time
url = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
if __name__ == '__main__':
    start = time()
    _ = url_to_fs(url)
    elapsed = time() - start
    print(elapsed)

26s seems a bit long for string parsing -- clearly something else is going on inside this function -- but I'm a bit surprised that there's anything else going on side here, given the apparent simplicity of the URL. Is this expected behavior, or something that should be fixed?

@d-v-b
Copy link
Contributor Author

d-v-b commented Jan 8, 2025

aha, further testing indicates that creating GCSFilesystem is the culprit. Any ideas how how to speed this up?

@d-v-b d-v-b changed the title url_to_fs is surprisingly slow GCSFilesystem creation is surprisingly slow Jan 8, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Jan 8, 2025

specifying token=anon reduces runtime to a fraction of a second:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "fsspec",
#     "gcsfs",
# ]
# ///
from fsspec import url_to_fs
from time import time

url = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'

if __name__ == '__main__':
    start = time()
    _ = url_to_fs(url, token='anon')
    elapsed = time() - start
    print(elapsed)

so I'm guessing the credentials search process is consuming most of the 25ish seconds I observed with the first version of this script.

@martindurant
Copy link
Member

the credentials search process is consuming most of the 25ish seconds

The default behaviour (unless otherwise specified) is to try various credential verification mechanisms in turn, until one works. anon is the fallback when everything else has failed. Unfortunately, there are many possibilities; maybe there's an opportunity to put very short timeouts on processes that ought to be fast, like _connect_cloud, which queries the GCP metadata service and is only available from within google's walls. It would be worthwhile putting timing around the methods in gcsfs.credentials.GoogleCredentials.connect to figure out if this is caused by one particular problematic method. For me, I don't see the problem:

2025-01-08 10:16:43,247 - gcsfs.credentials - DEBUG - connect -- Connection with method "google_default" failed
Traceback (most recent call last):
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 235, in connect
    self.connect(method=meth)
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 253, in connect
    self.__getattribute__("_connect_" + method)()
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 77, in _connect_google_default
    credentials, project = gauth.default(scopes=[self.scope])
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 615, in default
    credentials, project_id = checker()
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 608, in <lambda>
    lambda: _get_explicit_environ_credentials(quota_project_id=quota_project_id),
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 228, in _get_explicit_environ_credentials
    credentials, project_id = load_credentials_from_file(
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 116, in load_credentials_from_file
    raise exceptions.DefaultCredentialsError(
google.auth.exceptions.DefaultCredentialsError: File /Users/mdurant/Downloads/it-4397-aol-71cec3094fa0.json was not found.
FAIL google_default 0.0012540817260742188
2025-01-08 10:16:43,249 - gcsfs.credentials - DEBUG - connect -- Connection with method "cache" failed
Traceback (most recent call last):
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 235, in connect
    self.connect(method=meth)
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 253, in connect
    self.__getattribute__("_connect_" + method)()
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 104, in _connect_cache
    raise ValueError("No cached tokens")
ValueError: No cached tokens
FAIL cache 0.00011086463928222656
2025-01-08 10:16:43,298 - gcsfs.credentials - DEBUG - connect -- Connection with method "cloud" failed
Traceback (most recent call last):
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 111, in refresh
    self._retrieve_info(request)
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 87, in _retrieve_info
    info = _metadata.get_service_account_info(
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/_metadata.py", line 234, in get_service_account_info
    return get(request, path, params={"recursive": "true"})
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/_metadata.py", line 163, in get
    raise exceptions.TransportError(
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 97, in _connect_cloud
    self.credentials.refresh(req)
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 117, in refresh
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 235, in connect
    self.connect(method=meth)
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 253, in connect
    self.__getattribute__("_connect_" + method)()
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 99, in _connect_cloud
    raise ValueError("Invalid gcloud credentials") from error
ValueError: Invalid gcloud credentials
FAIL cloud 0.050685882568359375
2025-01-08 10:16:43,299 - gcsfs.credentials - DEBUG - connect -- Connected with method anon

(I did the following to get the times:

--- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -229,6 +229,8 @@ class GoogleCredentials:
             self._connect_token(method)
         elif method is None:
             for meth in ["google_default", "cache", "cloud", "anon"]:
+                import time
+                t = time.time()
                 try:
                     self.connect(method=meth)
                     logger.debug("Connected with method %s", meth)
@@ -242,6 +244,7 @@ class GoogleCredentials:
                     # Reset credentials if they were set but the authentication failed
                     # (reverts to 'anon' behavior)
                     self.credentials = None
+                    print("FAIL", meth, time.time()-t)
             else:
                 # Since the 'anon' connection method should always succeed,

)

@ap--
Copy link
Contributor

ap-- commented Jan 9, 2025

I noticed the same issue in universal-pathlib where the gcs tests where very slow, and it only occurred without a working internet connection, because then all the timeouts accumulated.

setting token='anon' restored the short runtime of the test suite, when running without access to the internet: fsspec/universal_pathlib@d034637

(the test suite runs against a gcs docker container)

@martindurant
Copy link
Member

I am unfortunately finding that running the default connection methods still happens fast and falls back to anon, even when I turn off the network.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jan 9, 2025

@ap-- and I are both in germany, so perhaps regional connectivity is part of the story here

@shcheklein
Copy link

Thanks folks for bringing this up. It keeps biting us in different environments for a while now. I'm able to reproduce it easily at home and in the office and in many other different places.


I am unfortunately finding that running the default connection methods still happens fast and falls back to anon, even when I turn off the network.

@martindurant I think the difference should be here - https://github.com/googleapis/google-auth-library-python/blob/main/google/auth/compute_engine/_metadata.py#L207-L227

For me, I'm getting immediately into:

except exceptions.TransportError as e:
            _LOGGER.warning(
                "Compute Engine Metadata server unavailable on "
                "attempt %s of %s. Reason: %s",
                attempt,
                retry_count,
                e,
            )

with a message: Failed to resolve 'metadata.google.internal' ([Errno 8] nodename nor servname provided, or not known)"))'

It means it goes into retry with an exponential backoff.

In your case, for some reason metadata.google.internal can be resolved into something? Or may be you are running it on GCE? it's hard to tell. I've seen something like this before with some DNS providers (AT&T in the US for example is famous for intercepting unknown DNS requests and do some weird stuff).

From our brief research (@0x2b3bfa0 can also add details) - it's not that easy to hack it out of the exponential backoff. It would be actually interesting to see what does it return in your case.

@martindurant
Copy link
Member

I was running from home, not GCP. I don't know whether it gave a different, non-retryable error. Since we completely control the compute_engine auth, we can add any timeout or retry arge in GoogleCredentials._connect_cloud . IF within GCP, this should always happen quickly.

@shcheklein
Copy link

Since we completely control the compute_engine auth, we can add any timeout or retry arge in GoogleCredentials._connect_cloud

I'm not sure I fully understand the exact meaning. Do you mean we can change the lib itself (compute_engine) or do you mean that we can change the way we work with it (GoogleCredentials._connect_cloud)?

I don't know from the top of my head and from that brief research a simple way to change the retry count or backoff logic outside. But we can check one more time. What we can do is to change the order or avoid metadata server if we are not on GCE (how does it determine that btw?).

@martindurant
Copy link
Member

Do you mean we can change the lib itself (compute_engine) or do you mean that we can change the way we work with it (GoogleCredentials._connect_cloud)?

I had mean the latter; but I'm not sure that configuring the Session of Request can do what we want - the retry logic is actually within google.auth.compute_engine._metadata.get(). We would need to override the Request's __call__ to set the default timeout

--- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -36,6 +36,11 @@ client_config = {
 }


+class TimeoutRequest(Request):
+    def __call__(self, *args, **kwargs):
+        return super().__call__(*args, timeout=1, **kwargs)
+
+
 class GoogleCredentials:
     def __init__(self, project, access, token, check_credentials=None):
         self.scope = "https://www.googleapis.com/auth/devstorage." + access
@@ -93,7 +98,7 @@ class GoogleCredentials:
         self.credentials = gauth.compute_engine.Credentials()
         try:
             with requests.Session() as session:
-                req = Request(session)
+                req = TimeoutRequest(session)
                 self.credentials.refresh(req)
         except gauth.exceptions.RefreshError as error:
             raise ValueError("Invalid gcloud credentials") from error

Of course, I'd need someone to verify that cloud auth still works!

Otherwise, we could either vendor that (nasty!) code, or perhaps run with an explicit timeout (would require a thread or signal).

@martindurant
Copy link
Member

The previous default timeout is 120s (x5 retries?)

@shcheklein
Copy link

Yeah, I wonder though what was the logic behind setting up those timeouts on the compute engine side (we can break the things if we just override them). May be if we check what error code is returned and why on your machine we can find a better way?

Another alternative can be - check if we can detect that we are running on GCE and enable cloud auth only in that case?

@0x2b3bfa0
Copy link

Apparently, gcsfs is trying to use metadata server authentication regardless of the is_on_gce return value:

https://github.com/fsspec/gcsfs/blob/8a83230c671b9f1c1f500791e6b2b1810f8919b9/gcsfs/credentials.py#L93

Why does fsspec implement its own "default credentials" lookup mechanism, instead of relying on google.auth.default when necessary?

@martindurant
Copy link
Member

Why does fsspec implement its own "default credentials" lookup mechanism, instead of relying on google.auth.default when necessary?

This didn't capture all the ways people wanted to connect

is_on_gce

Sorry, where is this? We could skip the cloud connect if we know it will fail.

@0x2b3bfa0
Copy link

This didn't capture all the ways people wanted to connect

😅

Sorry, where is this?

https://github.com/googleapis/google-auth-library-python/blob/c3ea09fd8b9ee8f094263fdc809c59d9dfaa3124/google/auth/compute_engine/_metadata.py#L68

@martindurant
Copy link
Member

Sorry, where is this?

https://github.com/googleapis/google-auth-library-python/blob/c3ea09fd8b9ee8f094263fdc809c59d9dfaa3124/google/auth/compute_engine/_metadata.py#L68

It look like ping() in that module also calls the same endpoint with exponential backoff, so I don't suppose it gets us anything. We could copy the idea, I suppose, but I think a reasonable timeout on _connect_cloud is probably more straight-forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants