Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_user_data (or similar) #83

Closed
m-mohr opened this issue Sep 13, 2019 · 13 comments
Closed

load_user_data (or similar) #83

m-mohr opened this issue Sep 13, 2019 · 13 comments

Comments

@m-mohr
Copy link
Member

m-mohr commented Sep 13, 2019

A new process was proposed on the 3rd year planning: load_user_data (or similar).
Should load user-uploaded data and convert it into a data cube, similar to load_collection and load_result.
We need to check how to communicate to a user what file formats are allowed to be uploaded. (Change /output_formats to /file_formats and add a list of supported formats for loading as data cube?)

@mkadunc
Copy link
Member

mkadunc commented Sep 13, 2019

👍 to using the existing endpoint

@bossie
Copy link

bossie commented Oct 10, 2019

Since it returns an image collection, maybe it makes sense to name it load_user_collection, analog to load_collection. Any thoughts?

@m-mohr
Copy link
Member Author

m-mohr commented Oct 10, 2019

It doesn't return an image collection, but a data cube. load_user_data says where to load it from (user workspace), which is consistent with load_collection (loads data made available with the collections endpoints) and load_result (load a job result).

@bossie
Copy link

bossie commented Oct 10, 2019

Maybe a general load_data process scales better if one wants to load a data cube from data that is not uploaded to the user workspace or the result of a batch job, e.g. S3. In this case the process might look like this:

{
  "process_id": "load_data",
  "arguments": {
    "format": "GTiff",
    "source": "S3",
    "options": {
      "uri": "s3://bucket/prefix",
      "more_options": "here"
    }
  }
}

@m-mohr
Copy link
Member Author

m-mohr commented Oct 10, 2019

I feel that this is a bit too generic and it will be hard to document all the options. Wouldn't it bet easier to use if we would define more processes for specific use cases? For example, load_s3_data and load_gcs_data or so?

@bossie
Copy link

bossie commented Oct 11, 2019

We have a use case where we want to load geotiffs from disk so I'd like to add something like this:

{
  "process_id":"load_disk_data",
  "arguments":{
    "format":"GTiff",
    "glob_pattern":"/data/MTDA/CGS_S2/CGS_S2_FAPAR/2019/04/24/*/*/10M/*_FAPAR_10M_V102.tif",
    "options":{
      "date_regex":"_(\\d{4})(\\d{2})(\\d{2})T"
    }
  }
}

@m-mohr
Copy link
Member Author

m-mohr commented Oct 11, 2019

@bossie Go ahead and define such a function. I don't think this is a function for the process catalogue though as usually users won't know anything about the internal structure of your disks?! I think this function would be a good start for a list of proprietary extensions for the processes we could list somewhere here.

@jdries
Copy link
Contributor

jdries commented Oct 11, 2019

Hi Matthias,
in fact the use case is that the users has put the data there himself, so he does know the structure. It is basically the same as a user managing his files in object storage, only that we use good old NFS.
That's why we thought this might be a candidate for a generic process.

@m-mohr
Copy link
Member Author

m-mohr commented Oct 11, 2019

Go ahead and define such a function. I don't know what you need therefore it is better if you make a proposal we can discuss. The process looks relatively complicated with regex etc and therefore I'm not sure whether that might be too much for the "core". Also, I'm not sure whether this process is limited to your driver or whether other back-ends would also make use of it. I think we should discuss this process separately. In general, I think we should not discuss all kinds of loading functions in this single issue, but make an issue for each of them. Otherwise it gets complicated to follow and manage.

@m-mohr m-mohr added the help wanted Extra attention is needed label Nov 22, 2019
@m-mohr m-mohr added this to the v1.0 milestone Nov 22, 2019
@jdries
Copy link
Contributor

jdries commented Dec 12, 2019

Telco conclusion: wait with standardized definition until other backends (want to) implement this.
Meanwhile, here is the current VITO process definition:
http://openeo.vgt.vito.be/openeo/0.4.0/processes/load_disk_data

@m-mohr
Copy link
Member Author

m-mohr commented Dec 13, 2019

Thanks for the conclusion, @jdries. I'm not sure you discussed what the issue was originally about. load_user_data (but we may choose another name, maybe load_uploaded_files?) was already accepted as solution in the Rome meeting to import files from the uploaded files and the API already has changed /output_formats to /file_formats to also list supported input file formats.

For the other processes to import from non-API sources: I would clearly separate and define functions such as (names to be discussed): import_s3 (or load_s3), import_nfs, import_gcs etc. whenever required. For this I'd propose to open separate issues or PRs for discussion. Edit: see #105

m-mohr added a commit to Open-EO/openeo-api that referenced this issue Dec 13, 2019
…pecify the data cube loading/storing mechanism in /file_formats, see Open-EO/openeo-processes#83
@m-mohr m-mohr added work in progress and removed help wanted Extra attention is needed labels Dec 13, 2019
m-mohr added a commit that referenced this issue Dec 13, 2019
@m-mohr
Copy link
Member Author

m-mohr commented Dec 13, 2019

See PR #106 for a proposal of load_uploaded_files.
See issue #105 for everything related to "non-API" imports.

@m-mohr
Copy link
Member Author

m-mohr commented Dec 17, 2019

Pr has been merged.

@m-mohr m-mohr closed this as completed Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants