Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forge should get an option to append an optional .janno column with source-package information #251

Open
nevrome opened this issue May 19, 2023 · 7 comments

Comments

@nevrome
Copy link
Member

nevrome commented May 19, 2023

This was raised a couple of months ago by @TCLamnidis and I thought I quickly formalize it into an issue. Let me know if there already is an issue about that in some of our repositories.

The idea is to add an option to forge to document forge operations on a sample/individual level in the .janno file, specifically with a new column that documents the source package the individual is extracted from.

That means a forge operation like this

trident forge -d ./PacWithInd1 -d ./PacWithInd2 -f "<Ind1>,<Ind2>" -o outPac

should create a package outPac with a .janno file looking maybe something like this:

Poseidon_ID    ...    Source_Packages
Ind1           ...    PacWithInd1 -> outPac
Ind2           ...    PacWithInd2 -> outPac

The new column Source_Packages does not necessarily have to be included into the Poseidon schema -- we can discuss that. Also the name and the syntax are up for discussion.

If a package with this column is forged again, then we could simply append the next step in the column, so for example like this:

Poseidon_ID    ...    Source_Packages
Ind1           ...    PacWithInd1 -> outPac -> outPac2
Ind2           ...    PacWithInd2 -> outPac -> outPac2

Is this what you had in mind, Thiseas?

@TCLamnidis
Copy link
Member

roughly, yes. I was only looking for the original file, maybe with a version attached, rather than a complete reconstruction of the forge history of an individual (which I fear might look a bit messy after 3-4 forges).
That said, this approach would more than cover my suggestion so 👍 from me conceptually.

@stschiff
Copy link
Member

Yes, that's great.

What I had in mind was mirroring Thiseas idea, which was just to record this the very first time, so that only the very original source packages are recorded. This could simply be implemented by forge checking if the column Source_Packages is already present in a row, and if so, do not alter it.

The chain is a pretty cool idea. I am worried that it might become unwieldy though if people do many derived forges. Or is that just a too unlikely scenario?

Another disadvantage of the chain syntax is that we cannot simply use this column to filter on a given source package or so.

How about we just drop the very last arrow? Then we would get after the first forge

Poseidon_ID    ...    Source_Packages
Ind1           ...    PacWithInd1
Ind2           ...    PacWithInd2

which would then at least for the 90% of cases where there is only one forge make the output very easily understandable and filterable?

@nevrome
Copy link
Member Author

nevrome commented May 27, 2023

Yes - that sounds like a good compromise.

The arrow doesn't fit too well to our other columns anyway. We should probably just work with a list column, like the ones we already have. So the first forge would produce

Poseidon_ID    ...    Source_Packages
Ind1           ...    PacWithInd1-1.0.1
Ind2           ...    PacWithInd2-2.2.4

and the second one

Poseidon_ID    ...    Source_Packages
Ind1           ...    PacWithInd1-1.0.1;outPac-0.1.0
Ind2           ...    PacWithInd2-2.2.4;outPac-0.1.0

Which reminds me that we should consolidate how exactly we want to pair package names and package versions in one string, @stschiff. You came up with something like this for #234 already, but maybe we should think about it again to make sure it is future-proof.

@TCLamnidis
Copy link
Member

Might be getting a bit more complicated, but I think something like PacWithInd1{1.0.1};outPac{0.1.0} might be more easy to parse downstream?

@stschiff
Copy link
Member

stschiff commented Jun 5, 2023

Yeah, we discussed this with Clemens. I think most languages/repos actually use the hyphon-version (at least that's what I see in python and Haskell I think). What about R?

I would prefer the simple hyphenated version, tbh.

@TCLamnidis
Copy link
Member

In that case, do we have a restriction that package names cannot contain hyphens?

@stschiff
Copy link
Member

stschiff commented Jun 7, 2023

No we do not need that. It's still parsable, because you just have to make a regex that if a package ends with something like -[0-9]+\.[0-9]+\.[0-9]+\.([0-9]+){0,1} then that's the version.

I am really scared of using curly brackets to be part of paths/file_names/URLs (in the end we need to think that in the future we might want to use them also in URLs). I think that's not really an option. I think such names should be restricted to alphanumerics, dots and hyphens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants