RFC79: Incremental Upload of Data Entries (#48)

* Add clinical_attribute_meta records to the seed mini To make the dataset look like real data in the database * Implement sample attribute rewriting flag * Add --overwrite-existing for the rest of test cases Apperently, the flag does not change anything. But we add it anyway as the tests for "incremental" data upload. * Test that mutations stay after updating the sample attributes * Add overwrite-existing support for mutations data * Fix --overwirte-existing flag description for importer of profile data * Add loader command to update case list with sample ids adding to the all case list and case list specified with command arguments is supported * Add option to remove sample ids from the remaining case lists From case lists that is not _all case list and not specified with --add-to-case-lists option * Make removing sample ids from not mentioned case lists a default behaviour * Make update case list command to read case lists files * Fix test clinical data headers * Test incremental patient upload * Add flag to reload patient clinical attributes * Add TODO comment to remove MIXED_ATTRIBUTES data type with a reference to the ticket * WIP adopt py script to incremental upload * Fix java.sql.SQLException: Generated keys not requested * Clean alteration_driver_annotation during mutations inc. upload * Fix validator and importer py scripts for inc. upload * Add test/demo data for incremental loading of study_es_0 study * Rename and move incremental tests to incementalTest folder * Update TODO comment how to deal with multiple sample files * Move study_es_0_inc to the new test data folder * Fix removing patient attributes on samples inc. upload * Change study_es_0_inc to contain more diverse data We changed them to work for the demo. Mutation numbers did not change on demo. * Specify that data_directory for incremental data * Disambiguate clinical data constants names Not it was easy to be confused where sample and clinical_sample (attributes), patient and clinical_patient (attributes) related code * Remove not necessary TODO comments * Remove MSK copyright mistakenly copy-pasted * Fix comment of UpdateCaseListsSampleIds.run() method * Make --overwrite-existing flag description more generic This flag for command to upload molecular profile data * Add TODO comments for possible reuse of the code * Update case lists for multiple clinical sample files Potentially for different studies * Extract and reuse common logic to read and validate case lists * Fix TestIntegrationTest - change location of the files - make sure assertions could work on the seed mini db - get rid from absent cbioportal dependencies * Revert RESOURCE_DEFINITION_DICTIONARY initialsation to empty set * Minor improvments. Apply PRs feedback * Make tests fail the build. Conduct exit status of tests correctly * Write Validation complete only in case of successful validation * Add python tests for incremental/full data import * Add unit test for incremental data validation * Test rough order of importer commands. Remove sorting in the script to guarantee that * Extract smaller functions from the big one in py script Make process_data_directory(...) smaller * Refactor tab delim. data importer - Calculate number of lines in the file in the loader - Remove unused imports and fields - Reuse constructors - Reuse common parsing logic in tab delimiter importer - Show full stacktrace which helps in dinding where tests errored out * Implement incremental upload of mRNA data * Add RPPA test * Add normal sample to thest data to test skipping * Add rows with more columns then in header to skip * Skip rows that don't have enough sample columns * Test for invalid entrez id * Extract common code from inc. tab. delim. tests * Implement incremntal upload of cna data via tab. delim. loader * Blanken values for genes not mentioned in the file * Remove unused code * Throw unsupported operation exception for GENESET_SCORE incremental upload * Add generic assay data incremental upload test * Fix integration tests * Make tab. delimiter data uploader transactional * Check for illegal state in tab delim. data update It's dangerous as we would further mess up the data in the row * Wire incremental tab delim. data upload to cli commands * Expand README with section on how to run incremental upload * Address TODOs in tab delim. importer * Add more data types to incremental data upload folder * Remove obsolete TODO comment * Reuse genetic_profile record if it exists in db already Do it for all data types, not only MAF * Test incremental upload of tab delim. data types from umbrella script - Split big tab. delim test to multiple tests based on data type. - Use ImportProfileData instead of ImportTabDelimData for testing. - We cover more logic with such tests. - This is more stable interface. ImportTabDelimData can be refactored. * Move counting lines if file inside generic assay patient level data uploader * Give error that generic asssay patient level data is not supported * Clean sample_cna_event despite whether it has alteration_driver_annotation rows or not * Fix cbioportalImport script execution args variable was not declared * Remove not needed spring context initialisation that caused different errors to occur * Make error message more informative when gene panel is not found Do not throw NPE, but NSEE with error message that mentions panel id * Add more genes to the mini seed to load study_es_0 * Make study_es_0_inc data pass validation * Document in README how to load study_es_0 study * Implement incremental upload for timeline data * Implement incremental upload of CNA DISCRETE long data * Add data type sanity check for tsv uploded * Move storing/dedup logic of genetic alteration values to importer * Move all inc. upload logic for tab delim. data types to GeneticAlterationImporter * Add CNA DISCRETE LONG to study_es0_inc test dataset * Remove unused code * Make validation to pass for CNA long and study_es_0_inc data * Implement incremental upload for gene panel matrix The uploader was working in incremental manner already. I had to add tests for those only. I had to implement incremental upload for gene panel matrix from differend data (CNA, Mutations) uploaders though. * Make validation of study_es_0_inc data to pass * Implement incremental upload of structural variants data I removed DaoGeneticProfileSamples.addGeneticProfileSamples(geneticProfileId, orderedSampleList); as it does not seem to be needed. it does not make any sense to store samples in genetic_profile_samples, if you don't use genetic_alteration table at all. * Implement incremental upload of CNA segmented data * Make it explicit that timeline uploader support bulk mode only * Fix number of columns in SV tsv data file * Update paragraph on inc. upload in README * Rename validation method to better describe it's purpose To really validate entrez id, we need to look it up * Fix cleaning alteration_driver_annotation table for specific sample * DRY tab separated value string parsing * Reuse FileUtil.isInfoLine(String line) throughout the code * Extract ensuring header and row match to tsv utility class * Simplify delete sql. Rely on cascade delete instead. * Generalise overwrite-existing flag description to make it more accurate * Rename updateMode to isIncrementalUpdateMode flag * Improve description of overwrite-existing flag for gene panel profile map * Implement more optimal way to update sample profile * Optimize code by always using batch upsert for sample profile * Recognise that SEG importer always use bulkLoad * Organise bulk mode flushing for SEG importer * Ignore case for bulkLoad load mode option as everywhere in the code * add comma to README * improve order comments for INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES * Add join by GENETIC_PROFILE_ID column for sample_cna_event and alteration_driver_annotaiton tables * Check for inconsistency in sample ids and values while reading genetic alterations * Make method name to initialise transaction clearer * Remove TODOs that were done * Rename isInfoLine util. method to isDataLine I got feedback that "info line" sounds like the header metadata lines starting with # * Simplify code by using inheritence instead of composition * Optimize removing genetic alterations by removing them for the whole genetic profile at once. one sql statment instead of N * Access inherited variables with this. intead of super. the confusion that triggered the change: The use of super. indicates that the subclass also declares one with the same name, but you are trying to not set that somehow? * Remove unused code from DaoSampleList.addSampleList() * Remove extra semicolons at the end of java statements * Rename upsertSampleProfiles to upsertSampleToProfileMapping method in DaoSampleProfile * Use java 8 way to convert typed list to array in GeneticAlterationIncrementalImporter * Improve doc comments for TsvUtil.isDataLine(String line) * Rename and codument better method to updateCaseLists * Remove DEFINED_CANCER_TYPES global variable * Add docstring to sample attribute remove methods Make it explicity that function will delete any matching records "if they exist" * Add docstring to method to update fraction genome altered clinical attribute Specify that sampleIds is optional and can be set to null * Make DAO contant that hold SQL private increase incapsulation * Stop doing rows math, it's just a status! * Adopt C style of incrementing jdbc paramters * Improve wording in error message * Remove unused method of genetic alteration importer * Extract db communicating methods out of the constructor introduce initialise() method * Improve time complexity from N^2 to N * Use american english for method names --------- Co-authored-by: pieterlukasse <[email protected]>
cBioPortal · Jul 16, 2024 · e7cfb7b · e7cfb7b
1 parent efcc1d2
commit e7cfb7b
Show file tree

Hide file tree

Showing 181 changed files with 5,468 additions and 1,394 deletions.
diff --git a/.github/workflows/validate-python.yml b/.github/workflows/validate-python.yml
@@ -14,7 +14,7 @@ jobs:
       - name: 'Validate tests'
         working-directory: ./cbioportal-core
         run: |
-          docker run -v ${PWD}:/cbioportal-core python:3.6 /bin/bash -c '
+          docker run -v ${PWD}:/cbioportal-core python:3.6 /bin/sh -c '
             cd cbioportal-core &&
             pip install -r requirements.txt &&
-            source test_scripts.sh'
+            ./test_scripts.sh'
diff --git a/README.md b/README.md
@@ -9,6 +9,59 @@ This repo contains:
 ## Inclusion in main codebase
 The `cbioportal-core` code is currently included in the final Docker image during the Docker build process: https://github.com/cBioPortal/cbioportal/blob/master/docker/web-and-data/Dockerfile#L48
 
+## Running in docker
+
+Build docker image with:
+```bash
+docker build -t cbioportal-core .
+```
+
+### Example of how to load `study_es_0` study
+
+Import gene panels
+
+```bash
+docker run -it -v $(pwd)/tests/test_data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
+perl importGenePanel.pl --data /data/study_es_0/data_gene_panel_testpanel1.txt
+docker run -it -v $(pwd)/tests/test_data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
+perl importGenePanel.pl --data /data/study_es_0/data_gene_panel_testpanel2.txt
+```
+
+Import gene sets and supplementary data
+
+```bash
+docker run -it -v $(pwd)/src/test/resources/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
+perl importGenesetData.pl --data /data/genesets/study_es_0_genesets.gmt --new-version msigdb_7.5.1 --supp /data/genesets/study_es_0_supp-genesets.txt
+```
+
+Import gene set hierarchy data
+
+```bash
+docker run -it -v $(pwd)/src/test/resources/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
+perl importGenesetHierarchy.pl --data /data/genesets/study_es_0_tree.yaml
+```
+
+Import study
+
+```bash
+docker run -it -v $(pwd)/tests/test_data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
+python importer/metaImport.py -s /data/study_es_0 -p /data/api_json_system_tests -o
+```
+
+### Incremental upload of data
+
+To add or update specific patient, sample, or molecular data in an already loaded study, you can perform an incremental upload. This process is quicker than reloading the entire study.
+
+To execute an incremental upload, use the -d (or --data_directory) option instead of -s (or --study_directory). Here is an example command:
+```bash
+docker run -it -v $(pwd)/data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core python importer/metaImport.py -d /data/study_es_0_inc -p /data/api_json -o
+```
+**Note:**
+While the directory should adhere to the standard cBioPortal file formats and study structure, incremental uploads are not supported for all data types though.
+For instance, uploading study metadata, resources, or GSVA data incrementally is currently unsupported.
+
+This method ensures efficient updates without the need for complete study reuploads, saving time and computational resources.
+
 ## How to run integration tests
 
 This section guides you through the process of running integration tests by setting up a cBioPortal MySQL database environment using Docker. Please follow these steps carefully to ensure your testing environment is configured correctly.
@@ -78,7 +131,7 @@ After you are done with the setup, you can build and test the project.
 
 1. Execute tests through the provided script:
 ```bash
-source test_scripts.sh
+./test_scripts.sh
 ```
 
 2. Build the loader jar using Maven (includes testing):
@@ -119,15 +172,3 @@ The script will search for `core-*.jar` in the root of the project:
 python scripts/importer/metaImport.py -s tests/test_data/study_es_0 -p tests/test_data/api_json_unit_tests -o
 ```
 
-## Running in docker
-
-Build docker image with:
-```bash
-docker build -t cbioportal-core .
-```
-
-Example of how to start the loading:
-```bash
-docker run -it -v $(pwd)/data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core python importer/metaImport.py -s /data/study_es_0 -p /data/api_json -o
-```
-
diff --git a/pom.xml b/pom.xml
@@ -252,6 +252,9 @@
 				<groupId>org.apache.maven.plugins</groupId>
 				<artifactId>maven-surefire-plugin</artifactId>
 				<version>2.21.0</version>
+				<configuration>
+					<trimStackTrace>false</trimStackTrace>
+				</configuration>
 				<executions>
 					<execution>
 						<id>default-test</id>

diff --git a/scripts/importer/cbioportalImporter.py b/scripts/importer/cbioportalImporter.py
@@ -12,6 +12,7 @@
 import logging
 import re
 from pathlib import Path
+from typing import Dict, Tuple
 
 # configure relative imports if running as a script; see PEP 366
 # it might passed as empty string by certain tooling to mark a top level module
@@ -39,6 +40,8 @@
 from .cbioportal_common import ADD_CASE_LIST_CLASS
 from .cbioportal_common import VERSION_UTIL_CLASS
 from .cbioportal_common import run_java
+from .cbioportal_common import UPDATE_CASE_LIST_CLASS
+from .cbioportal_common import INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES
 
 
 # ------------------------------------------------------------------------------
@@ -101,8 +104,17 @@ def remove_study_id(jvm_args, study_id):
     args.append("--noprogress") # don't report memory usage and % progress
     run_java(*args)
 
+def update_case_lists(jvm_args, meta_filename, case_lists_file_or_dir = None):
+    args = jvm_args.split(' ')
+    args.append(UPDATE_CASE_LIST_CLASS)
+    args.append("--meta")
+    args.append(meta_filename)
+    if case_lists_file_or_dir:
+        args.append("--case-lists")
+        args.append(case_lists_file_or_dir)
+    run_java(*args)
 
-def import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity = None, meta_file_dictionary = None):
+def import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity = None, meta_file_dictionary = None, incremental = False):
     args = jvm_args.split(' ')
 
     # In case the meta file is already parsed in a previous function, it is not
@@ -133,6 +145,10 @@ def import_study_data(jvm_args, meta_filename, data_filename, update_generic_ass
     importer = IMPORTER_CLASSNAME_BY_META_TYPE[meta_file_type]
 
     args.append(importer)
+    if incremental:
+        if meta_file_type not in INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES:
+            raise NotImplementedError("This type does not support incremental upload: {}".format(meta_file_type))
+        args.append("--overwrite-existing")
     if IMPORTER_REQUIRES_METADATA[importer]:
         args.append("--meta")
         args.append(meta_filename)
@@ -212,11 +228,20 @@ def process_command(jvm_args, command, meta_filename, data_filename, study_ids,
         else:
             raise RuntimeError('Your command uses both -id and -meta. Please, use only one of the two parameters.')
     elif command == IMPORT_STUDY_DATA:
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity)
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity)
     elif command == IMPORT_CASE_LIST:
         import_case_list(jvm_args, meta_filename)
 
-def process_directory(jvm_args, study_directory, update_generic_assay_entity = None):
+def get_meta_filenames(data_directory):
+    meta_filenames = [
+        os.path.join(data_directory, meta_filename) for
+        meta_filename in os.listdir(data_directory) if
+        re.search(r'(\b|_)meta(\b|[_0-9])', meta_filename,
+                  flags=re.IGNORECASE) and
+        not (meta_filename.startswith('.') or meta_filename.endswith('~'))]
+    return meta_filenames
+
+def process_study_directory(jvm_args, study_directory, update_generic_assay_entity = None):
     """
     Import an entire study directory based on meta files found.
 
@@ -241,12 +266,7 @@ def process_directory(jvm_args, study_directory, update_generic_assay_entity = N
     cna_long_filepair = None
 
     # Determine meta filenames in study directory
-    meta_filenames = (
-        os.path.join(study_directory, meta_filename) for
-        meta_filename in os.listdir(study_directory) if
-        re.search(r'(\b|_)meta(\b|[_0-9])', meta_filename,
-                  flags=re.IGNORECASE) and
-        not (meta_filename.startswith('.') or meta_filename.endswith('~')))
+    meta_filenames = get_meta_filenames(study_directory)
 
     # Read all meta files (excluding case lists) to determine what to import
     for meta_filename in meta_filenames:
@@ -353,53 +373,53 @@ def process_directory(jvm_args, study_directory, update_generic_assay_entity = N
         raise RuntimeError('No sample attribute file found')
     else:
         meta_filename, data_filename = sample_attr_filepair
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
 
     # Next, we need to import resource definitions for resource data
     if resource_definition_filepair is not None:
         meta_filename, data_filename = resource_definition_filepair
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
 
     # Next, we need to import sample definitions for resource data
     if sample_resource_filepair is not None:
         meta_filename, data_filename = sample_resource_filepair
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
 
     # Next, import everything else except gene panel, structural variant data, GSVA and
     # z-score expression. If in the future more types refer to each other, (like
     # in a tree structure) this could be programmed in a recursive fashion.
     for meta_filename, data_filename in regular_filepairs:
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
 
     # Import structural variant data
     if structural_variant_filepair is not None:
         meta_filename, data_filename = structural_variant_filepair
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
 
     # Import cna data
     if cna_long_filepair is not None:
         meta_filename, data_filename = cna_long_filepair
-        import_study_data(jvm_args=jvm_args, meta_filename=meta_filename, data_filename=data_filename,
-                          meta_file_dictionary=study_meta_dictionary[meta_filename])
+        import_data(jvm_args=jvm_args, meta_filename=meta_filename, data_filename=data_filename,
+                    meta_file_dictionary=study_meta_dictionary[meta_filename])
 
     # Import expression z-score (after expression)
     for meta_filename, data_filename in zscore_filepairs:
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
 
     # Import GSVA genetic profiles (after expression and z-scores)
     if gsva_score_filepair is not None:
 
         # First import the GSVA score data
         meta_filename, data_filename = gsva_score_filepair
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
 
         # Second import the GSVA p-value data
         meta_filename, data_filename = gsva_pvalue_filepair
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
 
     if gene_panel_matrix_filepair is not None:
         meta_filename, data_filename = gene_panel_matrix_filepair
-        import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
+        import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
 
     # Import the case lists
     case_list_dirname = os.path.join(study_directory, 'case_lists')
@@ -412,6 +432,72 @@ def process_directory(jvm_args, study_directory, update_generic_assay_entity = N
     # enable study
     update_study_status(jvm_args, study_id)
 
+def get_meta_filenames_by_type(data_directory) -> Dict[str, Tuple[str, Dict]]:
+    """
+    Read all meta files in the data directory and return meta information (filename, content) grouped by type.
+    """
+    meta_file_type_to_meta_files = {}
+
+    # Determine meta filenames in study directory
+    meta_filenames = get_meta_filenames(data_directory)
+
+    # Read all meta files (excluding case lists) to determine what to import
+    for meta_filename in meta_filenames:
+
+        # Parse meta file
+        meta_dictionary = cbioportal_common.parse_metadata_file(
+            meta_filename, logger=LOGGER)
+
+        # Retrieve meta file type
+        meta_file_type = meta_dictionary['meta_file_type']
+        if meta_file_type is None:
+            # invalid meta file, let's die
+            raise RuntimeError('Invalid meta file: ' + meta_filename)
+        if meta_file_type not in meta_file_type_to_meta_files:
+            meta_file_type_to_meta_files[meta_file_type] = []
+
+        meta_file_type_to_meta_files[meta_file_type].append((meta_filename, meta_dictionary))
+    return meta_file_type_to_meta_files
+
+def import_incremental_data(jvm_args, data_directory, update_generic_assay_entity, meta_file_type_to_meta_files):
+    """
+    Load all data types that are available and support incremental upload
+    """
+    for meta_file_type in INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES:
+        if meta_file_type not in meta_file_type_to_meta_files:
+            continue
+        meta_pairs = meta_file_type_to_meta_files[meta_file_type]
+        for meta_pair in meta_pairs:
+            meta_filename, meta_dictionary = meta_pair
+            data_filename = os.path.join(data_directory, meta_dictionary['data_filename'])
+            import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, meta_dictionary, incremental=True)
+
+def update_case_lists_from_folder(jvm_args, data_directory, meta_file_type_to_meta_files):
+    """
+    Updates case lists if clinical sample provided.
+    The command takes case_list/ folder as optional argument.
+    If folder exists case lists will be updated accordingly.
+    """
+    if MetaFileTypes.SAMPLE_ATTRIBUTES in meta_file_type_to_meta_files:
+        case_list_dirname = os.path.join(data_directory, 'case_lists')
+        sample_attributes_metas = meta_file_type_to_meta_files[MetaFileTypes.SAMPLE_ATTRIBUTES]
+        for meta_pair in sample_attributes_metas:
+            meta_filename, meta_dictionary = meta_pair
+            LOGGER.info('Updating case lists with sample ids', extra={'filename_': meta_filename})
+            update_case_lists(jvm_args, meta_filename, case_lists_file_or_dir=case_list_dirname if os.path.isdir(case_list_dirname) else None)
+
+def process_data_directory(jvm_args, data_directory, update_generic_assay_entity = None):
+    """
+    Incremental import of data directory based on meta files found.
+    """
+
+    meta_file_type_to_meta_files = get_meta_filenames_by_type(data_directory)
+
+    not_supported_meta_types = meta_file_type_to_meta_files.keys() - INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES
+    if not_supported_meta_types:
+        raise NotImplementedError("These types do not support incremental upload: {}".format(", ".join(not_supported_meta_types)))
+    import_incremental_data(jvm_args, data_directory, update_generic_assay_entity, meta_file_type_to_meta_files)
+    update_case_lists_from_folder(jvm_args, data_directory, meta_file_type_to_meta_files)
 
 def usage():
     # TODO : replace this by usage string from interface()
@@ -435,26 +521,27 @@ def check_files(meta_filename, data_filename):
         print('data-file cannot be found:' + data_filename, file=ERROR_FILE)
         sys.exit(2)
 
-def check_dir(study_directory):
+def check_dir(data_directory):
     # check existence of directory
-    if not os.path.exists(study_directory) and study_directory != '':
-        print('Study cannot be found: ' + study_directory, file=ERROR_FILE)
+    if not os.path.exists(data_directory) and data_directory != '':
+        print('Directory cannot be found: ' + data_directory, file=ERROR_FILE)
         sys.exit(2)
 
 def add_parser_args(parser):
-    parser.add_argument('-s', '--study_directory', type=str, required=False,
-                        help='Path to Study Directory')
+    data_source_group = parser.add_mutually_exclusive_group()
+    data_source_group.add_argument('-s', '--study_directory', type=str, help='Path to Study Directory')
+    data_source_group.add_argument('-d', '--data_directory', type=str, help='Path to Data Directory')
     parser.add_argument('-jvo', '--java_opts', type=str, default=os.environ.get('JAVA_OPTS'),
                         help='Path to specify JAVA_OPTS for the importer. \
-                        (default: gets the JAVA_OPTS from the environment)')
+                            (default: gets the JAVA_OPTS from the environment)')
     parser.add_argument('-jar', '--jar_path', type=str, required=False,
-                        help='Path to scripts JAR file')
+                            help='Path to scripts JAR file')
     parser.add_argument('-meta', '--meta_filename', type=str, required=False,
                         help='Path to meta file')
     parser.add_argument('-data', '--data_filename', type=str, required=False,
                         help='Path to Data file')
 
-def interface():
+def interface(args=None):
     parent_parser = argparse.ArgumentParser(description='cBioPortal meta Importer')
     add_parser_args(parent_parser)
     parser = argparse.ArgumentParser()
@@ -484,7 +571,7 @@ def interface():
     # TODO - add same argument to metaimporter
     # TODO - harmonize on - and _
 
-    parser = parser.parse_args()
+    parser = parser.parse_args(args)
     if parser.command is not None and parser.subcommand is not None:
         print('Cannot call multiple commands')
         sys.exit(2)
@@ -547,14 +634,16 @@ def main(args):
 
     # process the options
     jvm_args = "-Dspring.profiles.active=dbcp " + args.java_opts
-    study_directory = args.study_directory
 
     # check if DB version and application version are in sync
     check_version(jvm_args)
 
-    if study_directory != None:
-        check_dir(study_directory)
-        process_directory(jvm_args, study_directory, args.update_generic_assay_entity)
+    if args.data_directory is not None:
+        check_dir(args.data_directory)
+        process_data_directory(jvm_args, args.data_directory, args.update_generic_assay_entity)
+    elif args.study_directory is not None:
+        check_dir(args.study_directory)
+        process_study_directory(jvm_args, args.study_directory, args.update_generic_assay_entity)
     else:
         check_args(args.command)
         check_files(args.meta_filename, args.data_filename)