-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge of Tjdett and Microtardis branches #4
base: master
Are you sure you want to change the base?
Conversation
Most site policies can now be customised with option constants: ALLOW_EXPERIMENT_CREATION = True # Should we create new experiments ALLOW_EXPERIMENT_TITLE_MATCHING = True # If there's no id, is the title enough to match on ALLOW_UNIDENTIFIED_EXPERIMENT = True # If there's no title/id, should we process it as "uncategorized"? DEFAULT_UNIDENTIFIED_EXPERIMENT_TITLE="Uncategorized Data" ALLOW_UNNAMED_DATASETS = True # If a dataset has no title, should we ingest it with a default name DEFAULT_UNNAMED_DATASET_TITLE = '(assorted files)' ALLOW_USER_CREATION = True # If experiments belong to unknown users, create them? # Can existing datasets be updated? If not, we ignore updates. To cause a new dataset to be created, the incoming # feed must have a unique EntryID for the dataset (eg, hash of its contents). ALLOW_UPDATING_DATASETS = True # If a datafile is modified, do we re-harvest it (creating two copies)? Else, we ignore the update. False is not recommended. ALLOW_UPDATING_DATAFILES = True # If files are served as /user/instrument/experiment/dataset/datafile/moredatafiles # then 'datafile' is at depth 5. This is so we can maintain directory structure that # is significant within a dataset. Set to -1 to assume the last directory DATAFILE_DIRECTORY_DEPTH = 9 Major changes include: - Support for updated datasets - Support for updated datafiles - Store datafile modification dates - Matching on experiment title - Lots of useful logging at debug/info/warn levels as appropriate - Comments everywhere Known issues/limitations: - Doesn't yet handle directory structures within datasets - Probably crashes if feed doesn't contain modification dates for datafiles - Title matches are case-sensitive - Matching against existing title-less experiments is untested - Sometimes duplicate entries appear in DatasetParameterSet and crash everything (possibly caused by strange debug scenarios - ie, simultaneous celery beats) - I didn't add any tests. - I only tested against my customised atom dataset provider (for all I know, Picasa import is now horribly broken) - Makes use of cringeworthy non-standard additions to the Atom feed. Sorry. - A few debuggy comments left in code.
Also, support subdirectories within datasets. Requires a change to tardis_portal.util.get_local_time() to work properly. Also some nicer log messages (instead of 134584 seconds etc)
…stamps for dataset updates that I don't understand.
…', possibly skipped datasets.
|
||
|
||
@task(name="atom_ingest.walk_feeds", ignore_result=True) | ||
def walk_feeds(*feeds): | ||
for feed in feeds: | ||
walk_feed.delay(feed) | ||
|
||
''' | ||
# This function is now in MyTardis core - but not including the 'local copy' mode. What to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replicating the functionality here isn't appropriate. Even if it wasn't dangerous replication of functionality, it skips the new file verification behaviour.
Is there a reason why you can't change the data file URLs when the dataset & datafiles are created by AtomPersister
?
(The use of tardis://
urls also has to be a mistake, as the core no longer knows how to handle them.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what are you suggesting exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method appears to differ from the core implementation simply in which URL it fetches.
Is there a reason why that functionality can't occur in process_enclosure
? eg.
def _get_enclosure_url(enclosure)
if IngestOptions.USE_LOCAL_TRANSFERS:
enclosure.href.replace(IngestOptions.URL_BASE_TO_REPLACE, \
IngestOptions.LOCAL_SOURCE_PATH)
else:
enclosure.href
def process_enclosure:
...
datafile = Dataset_File(url=self._get_enclosure_url(enclosure), \
filename=filename, \
dataset=dataset)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason at all - just time. I wrote the below hack pretty quickly - but after your comments about the performance of HTTP transfers compared to NFS, I'm not sure it's even necessary. Anyway, I'll see how I go with your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dataset, getattr(enclosure, 'href', "<no href!>"))) | ||
logging.getLogger(__name__).error("Enclosure: {0}".format(json.dumps(enclosure))) | ||
for media_content in getattr(entry, 'media_content', []): | ||
self.process_media_content(dataset, media_content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.process_media_content
is gone along with the Picasa feed functionality, as it did not support hash digests. You should no longer be calling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
format(enclosure.href, self.human_time(timediff))) | ||
if IngestOptions.HIDE_REPLACED_DATAFILES: | ||
# Mark all older versions of file as hidden. (!) | ||
from tardis.microtardis.models import Dataset_Hidden |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please at least wrap this in a try
...except
closure to handle the absence of MicroTardis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Although the default, in options.py, is not to use this feature - and there's a comment pointing out you need MicroTardis right next to it.
@steveandroulakis I found a number of cases in this pull-request where old MyTardis functionality is referenced. Could you please take a look and see if there are any additional issues. |
@tjdett Aside from testing the functionality, I'm admittedly a bit stuck for time while I work on 3.0 (and do my 'actual' job at Monash) -- could you please highlight some of the old parts for me and I'll follow it up. I'll be much free-er once the 3.0 code list is done (closer than you may think!). Thanks. |
@steveandroulakis I'm not an active maintainer for this app, so the following comments are merely suggestions. The final call is yours for this and all subsequent pull-requests. You're in a better position to evaluate the long-term interests of the MyTardis project when performing the merge.... As mentioned in comments, it's doubtful this is necessary:
Personally, I believe pull-requests incorporating new features should also include tests for those features (even if they're not particularly comprehensive). Without them, there's no guarantee that any of those features will stay working. Not having them also makes refactoring difficult (and as noted in the comments, I think there's plenty of potential for that). |
@tjdett Hi Tim, Thanks for the prompt reply, even though there's no requirement for you to do so. Much appreciated.. Steve |
I'm coming in late, but I'd like support Tim (and the MyTardis submission guidelines) in saying that pull requests should include unit tests for new and modified functionality. This is particularly important for me where I'm tracking HEAD and I don't have resources for repeated user / acceptance testing. |
I've more or less tested this and it more or less works.