Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge of Tjdett and Microtardis branches #4

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

stevage
Copy link

@stevage stevage commented Aug 31, 2012

I've more or less tested this and it more or less works.

Steve Bennett and others added 25 commits March 14, 2012 15:02
Most site policies can now be customised with option constants:

    ALLOW_EXPERIMENT_CREATION = True         # Should we create new experiments
    ALLOW_EXPERIMENT_TITLE_MATCHING = True   # If there's no id, is the title enough to match on
    ALLOW_UNIDENTIFIED_EXPERIMENT = True   # If there's no title/id, should we process it as "uncategorized"?
    DEFAULT_UNIDENTIFIED_EXPERIMENT_TITLE="Uncategorized Data"
    ALLOW_UNNAMED_DATASETS = True            # If a dataset has no title, should we ingest it with a default name
    DEFAULT_UNNAMED_DATASET_TITLE = '(assorted files)'
    ALLOW_USER_CREATION = True               # If experiments belong to unknown users, create them?
    # Can existing datasets be updated? If not, we ignore updates. To cause a new dataset to be created, the incoming
    # feed must have a unique EntryID for the dataset (eg, hash of its contents).
    ALLOW_UPDATING_DATASETS = True
    # If a datafile is modified, do we re-harvest it (creating two copies)? Else, we ignore the update. False is not recommended.
    ALLOW_UPDATING_DATAFILES = True

    # If files are served as /user/instrument/experiment/dataset/datafile/moredatafiles
    # then 'datafile' is at depth 5. This is so we can maintain directory structure that
    # is significant within a dataset. Set to -1 to assume the last directory
    DATAFILE_DIRECTORY_DEPTH = 9
Major changes include:
- Support for updated datasets
- Support for updated datafiles
- Store datafile modification dates
- Matching on experiment title
- Lots of useful logging at debug/info/warn levels as appropriate
- Comments everywhere

Known issues/limitations:
- Doesn't yet handle directory structures within datasets
- Probably crashes if feed doesn't contain modification dates for datafiles
- Title matches are case-sensitive
- Matching against existing title-less experiments is untested
- Sometimes duplicate entries appear in DatasetParameterSet and crash everything
  (possibly caused by strange debug scenarios - ie, simultaneous celery beats)
- I didn't add any tests.
- I only tested against my customised atom dataset provider (for all I know, Picasa import is now horribly broken)
- Makes use of cringeworthy non-standard additions to the Atom feed. Sorry.
- A few debuggy comments left in code.
Also, support subdirectories within datasets.
Requires a change to tardis_portal.util.get_local_time() to work properly.
Also some nicer log messages (instead of 134584 seconds etc)
…stamps for dataset updates that I don't understand.


@task(name="atom_ingest.walk_feeds", ignore_result=True)
def walk_feeds(*feeds):
for feed in feeds:
walk_feed.delay(feed)

'''
# This function is now in MyTardis core - but not including the 'local copy' mode. What to do?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replicating the functionality here isn't appropriate. Even if it wasn't dangerous replication of functionality, it skips the new file verification behaviour.

Is there a reason why you can't change the data file URLs when the dataset & datafiles are created by AtomPersister?

(The use of tardis:// urls also has to be a mistake, as the core no longer knows how to handle them.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what are you suggesting exactly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method appears to differ from the core implementation simply in which URL it fetches.

Is there a reason why that functionality can't occur in process_enclosure? eg.

def _get_enclosure_url(enclosure)
    if IngestOptions.USE_LOCAL_TRANSFERS:
        enclosure.href.replace(IngestOptions.URL_BASE_TO_REPLACE, \
                               IngestOptions.LOCAL_SOURCE_PATH)
    else:
        enclosure.href

def process_enclosure:
    ...
    datafile = Dataset_File(url=self._get_enclosure_url(enclosure), \
                            filename=filename, \
                            dataset=dataset)
    ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason at all - just time. I wrote the below hack pretty quickly - but after your comments about the performance of HTTP transfers compared to NFS, I'm not sure it's even necessary. Anyway, I'll see how I go with your suggestion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

dataset, getattr(enclosure, 'href', "<no href!>")))
logging.getLogger(__name__).error("Enclosure: {0}".format(json.dumps(enclosure)))
for media_content in getattr(entry, 'media_content', []):
self.process_media_content(dataset, media_content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.process_media_content is gone along with the Picasa feed functionality, as it did not support hash digests. You should no longer be calling it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@ghost ghost assigned steveandroulakis Sep 2, 2012
format(enclosure.href, self.human_time(timediff)))
if IngestOptions.HIDE_REPLACED_DATAFILES:
# Mark all older versions of file as hidden. (!)
from tardis.microtardis.models import Dataset_Hidden
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please at least wrap this in a try...except closure to handle the absence of MicroTardis.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Although the default, in options.py, is not to use this feature - and there's a comment pointing out you need MicroTardis right next to it.

@tjdett
Copy link
Contributor

tjdett commented Sep 2, 2012

@steveandroulakis I found a number of cases in this pull-request where old MyTardis functionality is referenced.

Could you please take a look and see if there are any additional issues.

@steveandroulakis
Copy link
Contributor

@tjdett Aside from testing the functionality, I'm admittedly a bit stuck for time while I work on 3.0 (and do my 'actual' job at Monash) -- could you please highlight some of the old parts for me and I'll follow it up.

I'll be much free-er once the 3.0 code list is done (closer than you may think!). Thanks.

@tjdett
Copy link
Contributor

tjdett commented Sep 6, 2012

@steveandroulakis I'm not an active maintainer for this app, so the following comments are merely suggestions. The final call is yours for this and all subsequent pull-requests. You're in a better position to evaluate the long-term interests of the MyTardis project when performing the merge....

As mentioned in comments, it's doubtful this is necessary:
https://github.com/mytardis/mytardis-app-atom/pull/4/files#L1R32

time_util.py is probably unnecessary duplication - it should at least be checked against core util.py to see if it's compatible before it's merged in.

make_local_copy needs to be removed entirely:
https://github.com/mytardis/mytardis-app-atom/pull/4/files#L3R30

Personally, I believe pull-requests incorporating new features should also include tests for those features (even if they're not particularly comprehensive). Without them, there's no guarantee that any of those features will stay working. Not having them also makes refactoring difficult (and as noted in the comments, I think there's plenty of potential for that).

@steveandroulakis
Copy link
Contributor

@tjdett Hi Tim,

Thanks for the prompt reply, even though there's no requirement for you to do so. Much appreciated..

Steve

@crawley
Copy link
Contributor

crawley commented Sep 7, 2012

I'm coming in late, but I'd like support Tim (and the MyTardis submission guidelines) in saying that pull requests should include unit tests for new and modified functionality. This is particularly important for me where I'm tracking HEAD and I don't have resources for repeated user / acceptance testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants