Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fast registration removes datetime metadata from files #4711

Open
2 tasks done
deadlycoconuts opened this issue Jan 10, 2024 · 6 comments
Open
2 tasks done

[BUG] Fast registration removes datetime metadata from files #4711

deadlycoconuts opened this issue Jan 10, 2024 · 6 comments
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working flytekit FlyteKit Python related issue waiting for reporter Used for when we need input from the bug reporter

Comments

@deadlycoconuts
Copy link

deadlycoconuts commented Jan 10, 2024

Describe the bug

Using pyflyte register to fast register a workflow strips the datetime metadata from all the files that are packaged into the .tar.gz file. This doesn't happen when using pyflyte package or flytectl register.

Flyte version: 1.10.6
Flytekit version: 1.10.1
Python version: 3.10.13

Expected behavior

Using pyflyte register to fast register a workflow should preserve the datetime metadata from all the files that are packaged into the .tar.gz file.

Additional context to reproduce

No response

Screenshots

Files that are zipped using fast registration have their datetime metadata stripped (all the datetime fields default to 1 Jan 1970 7.30 AM):
Screenshot 2024-01-10 at 6 01 37 PM

Files that are packaged using pyflyte register do not encounter this issue:
Screenshot 2024-01-10 at 6 03 48 PM

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@deadlycoconuts deadlycoconuts added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jan 10, 2024
Copy link

dosubot bot commented Jan 10, 2024

🤖

Hey there @deadlycoconuts! I'm Dosu, your friendly code enthusiast bot. Whether it's squashing bugs, answering queries, or assisting you become a contributor, I'm here to help. While the human maintainers are off refueling on coffee, I'm your go-to bot! Let's debug this together, shall we? 🐞💻

Thank you for bringing this issue to our attention. It seems like the pyflyte register command is not preserving the datetime metadata during the packaging process. This behavior is not expected and it's not clear why this is happening based on the current information available in the repository.

The pyflyte register command works by compiling tasks and workflows into their serialized representation as described in the Flyte Specification language. This process involves defining the tasks and workflows, compiling them into their serialized representation, and registering them with FlyteAdmin. However, the specific details of how file packaging and metadata preservation are handled during this process are not explicitly documented in the repository.

Unfortunately, without more information, it's difficult to determine the exact cause of this issue. It would be helpful if you could provide more details about your environment and the exact steps you're taking when you encounter this issue. This could include information such as:

  • The version of Flyte you're using
  • The operating system and version
  • The Python version
  • The exact commands you're running
  • Any relevant logs or error messages

This additional information will help us better understand the issue and work towards a resolution. Thank you for your cooperation. Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@eapolinario eapolinario added flytekit FlyteKit Python related issue backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Jan 25, 2024
@eapolinario
Copy link
Contributor

This was originally added to ensure that the pre-signed URL used to upload the compressed file used during fast registration was deterministic, otherwise, simply opening a file might change its metadata and consequently change the hash used during fast registration.

One of the original intents behind this feature is to allow clients to leverage the case where they are about to overwrite an existing file in the blob store and use the (local) hash to decide not to upload the file (since it's guaranteed that the metadata matches).

Before we decide to go one way or the other, can you talk about the problem that not having the original metadata in the files is causing you, @deadlycoconuts , @ChungYujoyce?

@deadlycoconuts
Copy link
Author

Thanks @ChungYujoyce for the PR and @eapolinario for taking a look at this. The reason why we brought this up was because one of our users was attempting to zip certain files in his/her Flyte task but the package zipfile that performed the zipping was unable to accept datetimes before 1980 (the default year after the datetime metadata has been stripped is 1970):

ValueError: ZIP does not support timestamps before 1980

While it's a library-specific error on the metadata, this is quite an unexpected difference in behaviour for regular registration vs fast registration from the user's perspective - one approach leads to an error but not the other.

@ChungYujoyce
Copy link

@eapolinario Can you hash the files without using any time data? In this case, hashing should be time-agnostic, only the changes matter. We don't need to include the time and then strip out. pls correct me if I'm wrong.

@kumare3
Copy link
Contributor

kumare3 commented Apr 20, 2024

The hashing was hashing the tarfile and tarfile with timestamp in the metadata affects this.
This will also cause registration failures, as you will see conflicts.

Not sure where the zone is used?

@eapolinario
Copy link
Contributor

eapolinario commented Apr 26, 2024

@ChungYujoyce , I noticed that you shared an excerpt of a python exception, correct? From that it looks like you're using ZipFile to parse the zip file. Are you running python a python >= 3.8? If so, Zipfile introduced an argument callled strict_timestamps to handle this case. From the docs:

The strict_timestamps argument, when set to False, allows to zip files older than 1980-01-01 at the cost of setting the timestamp to 1980-01-01. Similar behavior occurs with files newer than 2107-12-31, the timestamp is also set to the limit.

Can you try that?

@eapolinario eapolinario added the waiting for reporter Used for when we need input from the bug reporter label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working flytekit FlyteKit Python related issue waiting for reporter Used for when we need input from the bug reporter
Projects
None yet
Development

No branches or pull requests

4 participants