Various stuff (#24)

genouest · Dec 21, 2022 · a05133a · a05133a
1 parent 1add948
commit a05133a
Show file tree

Hide file tree

Showing 14 changed files with 1,242 additions and 121 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,21 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 This changelog was started for release 0.0.3.
 
-## [0.0.3] - Unreleased
+## [0.0.3] - 21/11/2022
 
 ### Added
 
-- empty_ok_if key for validator
-- empty_ok_unless key for validator
+- empty_ok_if key for validator & templates
+- empty_ok_unless key for validator & templates
 - readme key for validator
 - unique key for validator
 - expected_rows key for templates
 - logs parameters for templates
+- na_ok key for validators & templates
+- skip_generation key for validators & templates
+- skip_validation key for validators & templates
 
 ### Fixed
 
 - Bug for setValidator when using number values
+- Fixed regex for GPS
 
 ### Changed
 
 - Better validation for integers
+- Refactor validation in excel for most validators (to include unique & na_ok)
diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
 # Checkcel
 
 Checkcel is a generation & validation tool for CSV/ODS/XLSX/XLS files.
-Basic validations (sets, whole, decimals, unicity, emails, dates) are included, but also ontologies validation.
-(Using the [OLS API](https://www.ebi.ac.uk/ols/index))
+Basic validations (sets, whole, decimals, unicity, emails, dates, regex) are included, but also ontologies validation.
+(Using the [OLS API](https://www.ebi.ac.uk/ols/index), and the [INRAE thesaurus](https://consultation.vocabulaires-ouverts.inrae.fr))
 
 Checkcel works with either python templates or json/yml files for the generation and validation.  
 Examples are available [here](https://github.com/mboudet/checkcel_templates) or in the [example folder](examples/).  
@@ -98,6 +98,7 @@ Checkcel(
     sheet="0"
 ).load_from_json_file(your_json_template_file).validate()
 
+# You can access the logs from python with the 'logs' key of the Checkcel class
 ```
 
 # Templates
@@ -108,8 +109,12 @@ In all cases, you will need to at least include a list of validators and associa
 * *metadata*: A list of column names. This will create a metadata sheet with these columns, without validation on them
 * *expected_rows*: (Default 0): Number of *data* rows expected
 * *empty_ok* (Default False): Whether to accept empty values as valid
-* *ignore_space* (Default False): whether to trim the values for spaces before checking validity
-* *ignore_case* (Default False): whether to ignore the case
+* *na_ok* (Default False): whether to allow NA (or n/a) values as valid
+* *ignore_space* (Default False): whether to trim the values for spaces before checking validity in python
+* *ignore_case* (Default False): whether to ignore the case (when relevant)before checking validity in python
+* *skip_generation* (Default False): whether to skip the excel validation generation (for file generation) for all validators
+* *skip_validation* (Default False): whether to skip the python validation for all validators
+* *unique* (Default False): whether to require unicity for all validators
 
 The last 3 parameters will affect all the validators (when relevant), but can be overriden at the validator level (eg, you can set 'empty_ok' to True for all, but set it to False for a specific validator).
 
@@ -155,66 +160,69 @@ All validators (except NoValidator) have these options available. If relevant, t
       * The dict keys must be column names, and the values lists of 'rejected values'. The current column will accept empty values if the related column's value is **not** in the list of reject values
 * *ignore_space* (Default False): whether to trim the values for spaces before checking validity
 * *ignore_case* (Default False): whether to ignore the case
-* *unique* (Default False): whether to enforce unicity for this column. (Not enforced in excel yet, except if there are not other validation (ie TextValidator and RegexValidator in some cases))
+* *unique* (Default False): whether to enforce unicity for this column. (Not enforced in excel for 'Set-type' validators (set, linked-set, ontology, vocabulaireOuvert))
+* *na_ok* (Default False): whether to allow NA (or n/a) values as valid.
+* *skip_generation* (Default False): whether to skip the excel validation for this validator (for file generation)
+* *skip_validation* (Default False): whether to skip the python validation for this validator
 
 *As excel validation for non-empty values is unreliable, the non-emptiness cannot be properly enforced in excel files*
 
 ### Validator-specific options
 
 * NoValidator (always True)
   * **No in-file validation generated**
-* TextValidator(empty_ok=False)
+* TextValidator(**kwargs)
   * **No in-file validation generated** (unless *unique* is set)
-* IntValidator(min="", max="", empty_ok=False)
+* IntValidator(min="", max="", **kwargs)
   * Validate that a value is an integer
   * *min*: Minimal value allowed
   * *max*: Maximal value allowed
-* FloatValidator(min="", max="", empty_ok=False)
+* FloatValidator(min="", max="", **kwargs)
   * Validate that a value is an float
   * *min*: Minimal value allowed
   * *max*: Maximal value allowed
-* SetValidator(valid_values=[], empty_ok=False)
+* SetValidator(valid_values=[], **kwargs)
   * Validate that a value is part of a set of allowed values
   * *valid_values*: list of valid values
-* LinkedSetValidator(linked_column="", valid_values={}, empty_ok=False)
+* LinkedSetValidator(linked_column="", valid_values={}, **kwargs)
   * Validate that a value is part of a set of allowed values, in relation to another column value.
     * Eg: Valid values for column C will be '1' or '2' if column B value is 'Test', else '3' or '4'
   * *linked_column*: Linked column name
   * *valid_values*: Dict with the *linked_column* values as keys, and list of valid values as values
     * Ex: {"Test": ['1', '2'], "Test2": ['3', '4']}
-* EmailValidator(empty_ok=False)
-* DateValidator(day_first=True, empty_ok=False, before=None, after=None)
+* EmailValidator(**kwargs)
+* DateValidator(day_first=True, before=None, after=None, **kwargs)
   * Validate that a value is a date.
   * *day_first* (Default True): Whether to consider the day as the first part of the date for ambiguous values.
   * *before* Latest date allowed
   * *after*: Earliest date allowed
-* TimeValidator(empty_ok=False, before=None, after=None)
+* TimeValidator(before=None, after=None, **kwargs)
   * Validate that a value is a time of the day
   * *before* Latest value allowed
   * *after*: Earliest value allowed
-* UniqueValidator(unique_with=[], empty_ok=False)
+* UniqueValidator(unique_with=[], **kwargs)
   * Validate that a column has only unique values.
   * *unique_with*: List of column names if you need a tuple of column values to be unique.
     * Ex: *I want the tuple (value of column A, value of column B) to be unique*
-* OntologyValidator(ontology, root_term="", empty_ok=False)
+* OntologyValidator(ontology, root_term="", **kwargs)
   * Validate that a term is part of an ontology, using the [OLS API](https://www.ebi.ac.uk/ols/index) for validation
   * *ontology* needs to be a short-form ontology name (ex: ncbitaxon)
   * *root_term* can be used if you want to make sure your terms are *descendants* of a specific term
     * (Should be used when generating validated files using big ontologies)
-* VocabulaireOuvertValidator(root_term="", lang="en", labellang="en", vocab="thesaurus-inrae", empty_ok=False)
+* VocabulaireOuvertValidator(root_term="", lang="en", labellang="en", vocab="thesaurus-inrae", **kwargs)
   * Validate that a term is part of the INRAE(default) or IRSTEA thesaurus
   * **No in-file validation generated** *unless using root_term*
   * *root_term*: Same as OntologyValidator.
   * *lang*: Language for the queried terms *(en or fr)*
   * *labellang*: Language for the queries returns (ie, the generated validation in files). Default to *lang* values.
   * *vocab*: Vocabulary used. Either 'thesaurus-inrae' or 'thesaurus-irstea'.
-* GPSValidator(empty_ok=False, format="DD", only_long=False, only_lat=False)
+* GPSValidator(format="DD", only_long=False, only_lat=False, **kwargs)
   * Validate that a term is a valid GPS cordinate
   * **No in-file validation generated**
   * *format*: Expected GPS format. Valid values are *dd* (decimal degrees, default value) or *dms* (degree minutes seconds)
   * *only_long*: Expect only a longitude
   * *only_lat*: Expect only a latitude
-* RegexValidator(regex, excel_formulat="", empty_ok=False)
+* RegexValidator(regex, excel_formulat="", **kwargs)
   * Validate that a term match a specific regex
   * **No in-file validation generated** *unless using excel_formula*
   * *excel_formula*: Custom rules for in-file validation. [Examples here](http://www.contextures.com/xlDataVal07.html).

diff --git a/checkcel/checkerator.py b/checkcel/checkerator.py
@@ -42,7 +42,7 @@ def generate(self):
             if isinstance(validator, OntologyValidator) or isinstance(validator, VocabulaireOuvertValidator):
                 if not ontology_sheet:
                     ontology_sheet = wb.create_sheet(title="Ontologies")
-                data_validation = validator.generate(get_column_letter(current_data_column), get_column_letter(current_ontology_column), ontology_sheet)
+                data_validation = validator.generate(get_column_letter(current_data_column), column_name, get_column_letter(current_ontology_column), ontology_sheet)
                 current_ontology_column += 1
             elif isinstance(validator, SetValidator):
                 # Total size, including separators must be < 256
@@ -52,25 +52,28 @@ def generate(self):
                     data_validation = validator.generate(get_column_letter(current_data_column), column_name, get_column_letter(current_set_column), set_sheet)
                     current_set_column += 1
                 else:
-                    data_validation = validator.generate(get_column_letter(current_data_column))
+                    data_validation = validator.generate(get_column_letter(current_data_column), column_name)
                 set_columns[column_name] = get_column_letter(current_data_column)
             elif isinstance(validator, LinkedSetValidator):
                 if not set_sheet:
                     set_sheet = wb.create_sheet(title="Sets")
-                data_validation = validator.generate(get_column_letter(current_data_column), set_columns, column_name, get_column_letter(current_set_column), set_sheet, wb)
+                data_validation = validator.generate(get_column_letter(current_data_column), column_name, set_columns, get_column_letter(current_set_column), set_sheet, wb)
                 current_set_column += 1
                 set_columns[column_name] = get_column_letter(current_data_column)
             elif isinstance(validator, UniqueValidator):
-                data_validation = validator.generate(get_column_letter(current_data_column), column_dict)
+                data_validation = validator.generate(get_column_letter(current_data_column), column_name, column_dict)
             else:
-                data_validation = validator.generate(get_column_letter(current_data_column))
+                data_validation = validator.generate(get_column_letter(current_data_column), column_name)
             if data_validation:
                 data_sheet.add_data_validation(data_validation)
             current_data_column += 1
         for sheet in wb.worksheets:
             for column_cells in sheet.columns:
                 length = (max(len(self.as_text(cell.value)) for cell in column_cells) + 2) * 1.2
                 sheet.column_dimensions[get_column_letter(column_cells[0].column)].width = length
+
+        if self.freeze_header:
+            data_sheet.freeze_panes = "A2"
         wb.save(filename=self.output)
 
     def as_text(self, value):

diff --git a/checkcel/checkplate.py b/checkcel/checkplate.py
@@ -15,19 +15,24 @@
 
 class Checkplate(object):
     """ Base class for templates """
-    def __init__(self, validators={}, empty_ok=False, ignore_case=False, ignore_space=False, metadata=[], expected_rows=None):
+    def __init__(self, validators={}, empty_ok=False, ignore_case=False, ignore_space=False, metadata=[], expected_rows=None, na_ok=False, unique=False, skip_generation=False, skip_validation=False, freeze_header=False):
         self.metadata = metadata
         self.logger = logs.logger
         self.validators = validators or getattr(self, "validators", {})
         self.logs = []
         # Will be overriden by validators config
         self.empty_ok = empty_ok
+        self.na_ok = na_ok
+        self.unique = unique
+        self.skip_generation = skip_generation
+        self.skip_validation = skip_validation
         self.ignore_case = ignore_case
         self.ignore_space = ignore_space
         self.expected_rows = expected_rows
+        self.freeze_header = freeze_header
         # self.trim_values = False
         for validator in self.validators.values():
-            validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space)
+            validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space, self.na_ok, self.unique, self.skip_generation, self.skip_validation)
 
     def debug(self, message):
         self.logger.debug(message)
@@ -69,9 +74,14 @@ def load_from_python_file(self, file_path):
         self.metadata = getattr(custom_class, 'metadata', [])
         self.validators = deepcopy(custom_class.validators)
         self.empty_ok = getattr(custom_class, 'empty_ok', False)
+        self.na_ok = getattr(custom_class, 'na_ok', False)
+        self.unique = getattr(custom_class, 'unique', False)
+        self.skip_generation = getattr(custom_class, 'skip_generation', False)
+        self.skip_validation = getattr(custom_class, 'skip_validation', False)
         self.ignore_case = getattr(custom_class, 'ignore_case', False)
         self.ignore_space = getattr(custom_class, 'ignore_space', False)
         self.expected_rows = getattr(custom_class, 'expected_rows', 0)
+        self.freeze_header = getattr(custom_class, 'freeze_header', False)
         try:
             self.expected_rows = int(self.expected_rows)
         except ValueError:
@@ -80,7 +90,7 @@ def load_from_python_file(self, file_path):
             )
 
         for key, validator in self.validators.items():
-            validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space)
+            validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space, self.na_ok, self.unique, self.skip_generation, self.skip_validation)
         return self
 
     def load_from_json_file(self, file_path):
@@ -136,9 +146,14 @@ def _load_from_dict(self, data):
             return exits.UNAVAILABLE
 
         self.empty_ok = data.get("empty_ok", False)
+        self.na_ok = data.get("na_ok", False)
         self.ignore_case = data.get('ignore_case', False)
         self.ignore_space = data.get('ignore_space', False)
         self.expected_rows = data.get('expected_rows', 0)
+        self.unique = data.get('unique', False)
+        self.skip_generation = data.get('skip_generation', False)
+        self.skip_validation = data.get('skip_validation', False)
+        self.freeze_header = data.get('freeze_header', False)
         try:
             self.expected_rows = int(self.expected_rows)
         except ValueError:
@@ -161,7 +176,7 @@ def _load_from_dict(self, data):
             try:
                 validator_class = getattr(validators, validator['type'])
                 val = validator_class(**options)
-                val._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space)
+                val._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space, self.na_ok, self.unique, self.skip_generation, self.skip_validation)
             except AttributeError:
                 self.error(
                     "{} is not a valid Checkcel Validator".format(validator['type'])