Skip to content

Commit

Permalink
Various stuff (#24)
Browse files Browse the repository at this point in the history
  • Loading branch information
mboudet authored Dec 21, 2022
1 parent 1add948 commit a05133a
Show file tree
Hide file tree
Showing 14 changed files with 1,242 additions and 121 deletions.
11 changes: 8 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

This changelog was started for release 0.0.3.

## [0.0.3] - Unreleased
## [0.0.3] - 21/11/2022

### Added

- empty_ok_if key for validator
- empty_ok_unless key for validator
- empty_ok_if key for validator & templates
- empty_ok_unless key for validator & templates
- readme key for validator
- unique key for validator
- expected_rows key for templates
- logs parameters for templates
- na_ok key for validators & templates
- skip_generation key for validators & templates
- skip_validation key for validators & templates

### Fixed

- Bug for setValidator when using number values
- Fixed regex for GPS

### Changed

- Better validation for integers
- Refactor validation in excel for most validators (to include unique & na_ok)
44 changes: 26 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Checkcel

Checkcel is a generation & validation tool for CSV/ODS/XLSX/XLS files.
Basic validations (sets, whole, decimals, unicity, emails, dates) are included, but also ontologies validation.
(Using the [OLS API](https://www.ebi.ac.uk/ols/index))
Basic validations (sets, whole, decimals, unicity, emails, dates, regex) are included, but also ontologies validation.
(Using the [OLS API](https://www.ebi.ac.uk/ols/index), and the [INRAE thesaurus](https://consultation.vocabulaires-ouverts.inrae.fr))

Checkcel works with either python templates or json/yml files for the generation and validation.
Examples are available [here](https://github.com/mboudet/checkcel_templates) or in the [example folder](examples/).
Expand Down Expand Up @@ -98,6 +98,7 @@ Checkcel(
sheet="0"
).load_from_json_file(your_json_template_file).validate()

# You can access the logs from python with the 'logs' key of the Checkcel class
```

# Templates
Expand All @@ -108,8 +109,12 @@ In all cases, you will need to at least include a list of validators and associa
* *metadata*: A list of column names. This will create a metadata sheet with these columns, without validation on them
* *expected_rows*: (Default 0): Number of *data* rows expected
* *empty_ok* (Default False): Whether to accept empty values as valid
* *ignore_space* (Default False): whether to trim the values for spaces before checking validity
* *ignore_case* (Default False): whether to ignore the case
* *na_ok* (Default False): whether to allow NA (or n/a) values as valid
* *ignore_space* (Default False): whether to trim the values for spaces before checking validity in python
* *ignore_case* (Default False): whether to ignore the case (when relevant)before checking validity in python
* *skip_generation* (Default False): whether to skip the excel validation generation (for file generation) for all validators
* *skip_validation* (Default False): whether to skip the python validation for all validators
* *unique* (Default False): whether to require unicity for all validators

The last 3 parameters will affect all the validators (when relevant), but can be overriden at the validator level (eg, you can set 'empty_ok' to True for all, but set it to False for a specific validator).

Expand Down Expand Up @@ -155,66 +160,69 @@ All validators (except NoValidator) have these options available. If relevant, t
* The dict keys must be column names, and the values lists of 'rejected values'. The current column will accept empty values if the related column's value is **not** in the list of reject values
* *ignore_space* (Default False): whether to trim the values for spaces before checking validity
* *ignore_case* (Default False): whether to ignore the case
* *unique* (Default False): whether to enforce unicity for this column. (Not enforced in excel yet, except if there are not other validation (ie TextValidator and RegexValidator in some cases))
* *unique* (Default False): whether to enforce unicity for this column. (Not enforced in excel for 'Set-type' validators (set, linked-set, ontology, vocabulaireOuvert))
* *na_ok* (Default False): whether to allow NA (or n/a) values as valid.
* *skip_generation* (Default False): whether to skip the excel validation for this validator (for file generation)
* *skip_validation* (Default False): whether to skip the python validation for this validator

*As excel validation for non-empty values is unreliable, the non-emptiness cannot be properly enforced in excel files*

### Validator-specific options

* NoValidator (always True)
* **No in-file validation generated**
* TextValidator(empty_ok=False)
* TextValidator(**kwargs)
* **No in-file validation generated** (unless *unique* is set)
* IntValidator(min="", max="", empty_ok=False)
* IntValidator(min="", max="", **kwargs)
* Validate that a value is an integer
* *min*: Minimal value allowed
* *max*: Maximal value allowed
* FloatValidator(min="", max="", empty_ok=False)
* FloatValidator(min="", max="", **kwargs)
* Validate that a value is an float
* *min*: Minimal value allowed
* *max*: Maximal value allowed
* SetValidator(valid_values=[], empty_ok=False)
* SetValidator(valid_values=[], **kwargs)
* Validate that a value is part of a set of allowed values
* *valid_values*: list of valid values
* LinkedSetValidator(linked_column="", valid_values={}, empty_ok=False)
* LinkedSetValidator(linked_column="", valid_values={}, **kwargs)
* Validate that a value is part of a set of allowed values, in relation to another column value.
* Eg: Valid values for column C will be '1' or '2' if column B value is 'Test', else '3' or '4'
* *linked_column*: Linked column name
* *valid_values*: Dict with the *linked_column* values as keys, and list of valid values as values
* Ex: {"Test": ['1', '2'], "Test2": ['3', '4']}
* EmailValidator(empty_ok=False)
* DateValidator(day_first=True, empty_ok=False, before=None, after=None)
* EmailValidator(**kwargs)
* DateValidator(day_first=True, before=None, after=None, **kwargs)
* Validate that a value is a date.
* *day_first* (Default True): Whether to consider the day as the first part of the date for ambiguous values.
* *before* Latest date allowed
* *after*: Earliest date allowed
* TimeValidator(empty_ok=False, before=None, after=None)
* TimeValidator(before=None, after=None, **kwargs)
* Validate that a value is a time of the day
* *before* Latest value allowed
* *after*: Earliest value allowed
* UniqueValidator(unique_with=[], empty_ok=False)
* UniqueValidator(unique_with=[], **kwargs)
* Validate that a column has only unique values.
* *unique_with*: List of column names if you need a tuple of column values to be unique.
* Ex: *I want the tuple (value of column A, value of column B) to be unique*
* OntologyValidator(ontology, root_term="", empty_ok=False)
* OntologyValidator(ontology, root_term="", **kwargs)
* Validate that a term is part of an ontology, using the [OLS API](https://www.ebi.ac.uk/ols/index) for validation
* *ontology* needs to be a short-form ontology name (ex: ncbitaxon)
* *root_term* can be used if you want to make sure your terms are *descendants* of a specific term
* (Should be used when generating validated files using big ontologies)
* VocabulaireOuvertValidator(root_term="", lang="en", labellang="en", vocab="thesaurus-inrae", empty_ok=False)
* VocabulaireOuvertValidator(root_term="", lang="en", labellang="en", vocab="thesaurus-inrae", **kwargs)
* Validate that a term is part of the INRAE(default) or IRSTEA thesaurus
* **No in-file validation generated** *unless using root_term*
* *root_term*: Same as OntologyValidator.
* *lang*: Language for the queried terms *(en or fr)*
* *labellang*: Language for the queries returns (ie, the generated validation in files). Default to *lang* values.
* *vocab*: Vocabulary used. Either 'thesaurus-inrae' or 'thesaurus-irstea'.
* GPSValidator(empty_ok=False, format="DD", only_long=False, only_lat=False)
* GPSValidator(format="DD", only_long=False, only_lat=False, **kwargs)
* Validate that a term is a valid GPS cordinate
* **No in-file validation generated**
* *format*: Expected GPS format. Valid values are *dd* (decimal degrees, default value) or *dms* (degree minutes seconds)
* *only_long*: Expect only a longitude
* *only_lat*: Expect only a latitude
* RegexValidator(regex, excel_formulat="", empty_ok=False)
* RegexValidator(regex, excel_formulat="", **kwargs)
* Validate that a term match a specific regex
* **No in-file validation generated** *unless using excel_formula*
* *excel_formula*: Custom rules for in-file validation. [Examples here](http://www.contextures.com/xlDataVal07.html).
Expand Down
13 changes: 8 additions & 5 deletions checkcel/checkerator.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def generate(self):
if isinstance(validator, OntologyValidator) or isinstance(validator, VocabulaireOuvertValidator):
if not ontology_sheet:
ontology_sheet = wb.create_sheet(title="Ontologies")
data_validation = validator.generate(get_column_letter(current_data_column), get_column_letter(current_ontology_column), ontology_sheet)
data_validation = validator.generate(get_column_letter(current_data_column), column_name, get_column_letter(current_ontology_column), ontology_sheet)
current_ontology_column += 1
elif isinstance(validator, SetValidator):
# Total size, including separators must be < 256
Expand All @@ -52,25 +52,28 @@ def generate(self):
data_validation = validator.generate(get_column_letter(current_data_column), column_name, get_column_letter(current_set_column), set_sheet)
current_set_column += 1
else:
data_validation = validator.generate(get_column_letter(current_data_column))
data_validation = validator.generate(get_column_letter(current_data_column), column_name)
set_columns[column_name] = get_column_letter(current_data_column)
elif isinstance(validator, LinkedSetValidator):
if not set_sheet:
set_sheet = wb.create_sheet(title="Sets")
data_validation = validator.generate(get_column_letter(current_data_column), set_columns, column_name, get_column_letter(current_set_column), set_sheet, wb)
data_validation = validator.generate(get_column_letter(current_data_column), column_name, set_columns, get_column_letter(current_set_column), set_sheet, wb)
current_set_column += 1
set_columns[column_name] = get_column_letter(current_data_column)
elif isinstance(validator, UniqueValidator):
data_validation = validator.generate(get_column_letter(current_data_column), column_dict)
data_validation = validator.generate(get_column_letter(current_data_column), column_name, column_dict)
else:
data_validation = validator.generate(get_column_letter(current_data_column))
data_validation = validator.generate(get_column_letter(current_data_column), column_name)
if data_validation:
data_sheet.add_data_validation(data_validation)
current_data_column += 1
for sheet in wb.worksheets:
for column_cells in sheet.columns:
length = (max(len(self.as_text(cell.value)) for cell in column_cells) + 2) * 1.2
sheet.column_dimensions[get_column_letter(column_cells[0].column)].width = length

if self.freeze_header:
data_sheet.freeze_panes = "A2"
wb.save(filename=self.output)

def as_text(self, value):
Expand Down
23 changes: 19 additions & 4 deletions checkcel/checkplate.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,24 @@

class Checkplate(object):
""" Base class for templates """
def __init__(self, validators={}, empty_ok=False, ignore_case=False, ignore_space=False, metadata=[], expected_rows=None):
def __init__(self, validators={}, empty_ok=False, ignore_case=False, ignore_space=False, metadata=[], expected_rows=None, na_ok=False, unique=False, skip_generation=False, skip_validation=False, freeze_header=False):
self.metadata = metadata
self.logger = logs.logger
self.validators = validators or getattr(self, "validators", {})
self.logs = []
# Will be overriden by validators config
self.empty_ok = empty_ok
self.na_ok = na_ok
self.unique = unique
self.skip_generation = skip_generation
self.skip_validation = skip_validation
self.ignore_case = ignore_case
self.ignore_space = ignore_space
self.expected_rows = expected_rows
self.freeze_header = freeze_header
# self.trim_values = False
for validator in self.validators.values():
validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space)
validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space, self.na_ok, self.unique, self.skip_generation, self.skip_validation)

def debug(self, message):
self.logger.debug(message)
Expand Down Expand Up @@ -69,9 +74,14 @@ def load_from_python_file(self, file_path):
self.metadata = getattr(custom_class, 'metadata', [])
self.validators = deepcopy(custom_class.validators)
self.empty_ok = getattr(custom_class, 'empty_ok', False)
self.na_ok = getattr(custom_class, 'na_ok', False)
self.unique = getattr(custom_class, 'unique', False)
self.skip_generation = getattr(custom_class, 'skip_generation', False)
self.skip_validation = getattr(custom_class, 'skip_validation', False)
self.ignore_case = getattr(custom_class, 'ignore_case', False)
self.ignore_space = getattr(custom_class, 'ignore_space', False)
self.expected_rows = getattr(custom_class, 'expected_rows', 0)
self.freeze_header = getattr(custom_class, 'freeze_header', False)
try:
self.expected_rows = int(self.expected_rows)
except ValueError:
Expand All @@ -80,7 +90,7 @@ def load_from_python_file(self, file_path):
)

for key, validator in self.validators.items():
validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space)
validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space, self.na_ok, self.unique, self.skip_generation, self.skip_validation)
return self

def load_from_json_file(self, file_path):
Expand Down Expand Up @@ -136,9 +146,14 @@ def _load_from_dict(self, data):
return exits.UNAVAILABLE

self.empty_ok = data.get("empty_ok", False)
self.na_ok = data.get("na_ok", False)
self.ignore_case = data.get('ignore_case', False)
self.ignore_space = data.get('ignore_space', False)
self.expected_rows = data.get('expected_rows', 0)
self.unique = data.get('unique', False)
self.skip_generation = data.get('skip_generation', False)
self.skip_validation = data.get('skip_validation', False)
self.freeze_header = data.get('freeze_header', False)
try:
self.expected_rows = int(self.expected_rows)
except ValueError:
Expand All @@ -161,7 +176,7 @@ def _load_from_dict(self, data):
try:
validator_class = getattr(validators, validator['type'])
val = validator_class(**options)
val._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space)
val._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space, self.na_ok, self.unique, self.skip_generation, self.skip_validation)
except AttributeError:
self.error(
"{} is not a valid Checkcel Validator".format(validator['type'])
Expand Down
Loading

0 comments on commit a05133a

Please sign in to comment.