Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change taking path to taking buffer #166

Open
frehburg opened this issue Oct 8, 2024 · 2 comments
Open

change taking path to taking buffer #166

frehburg opened this issue Oct 8, 2024 · 2 comments

Comments

@frehburg
Copy link
Contributor

frehburg commented Oct 8, 2024

No description provided.

@frehburg
Copy link
Contributor Author

frehburg commented Oct 8, 2024

@ielis I could not find any convenient way to take a buffer instead of a path. How would you deal with it in a method such as this:

def load_data_using_data_model(
        path: Union[str, Path],
        data_model: DataModel,
        column_names: Dict[str, str],
        compliance: Literal['lenient', 'strict'] = 'lenient',
) -> DataSet:
    """Loads data from a file using a DataModel definition

    List a column for each field of the `DataModel` in the `column_names` dictionary. The keys of the dictionary should
    be {id}_column for each field and the values should be the name of the column in the file.

    E.g.:
    ```python
    data_model = DataModel("Test data model", [DataField(name="Field 1", value_set=ValueSet())])
    column_names = {"field_1_column": "column_name_in_file"}
    load_data_using_data_model("data.csv", data_model, column_names)
    ```

    :param path: Path to  formatted csv or excel file
    :param data_model: DataModel to use for reading the file
    :param column_names: A dictionary mapping from the id of each field of the `DataField` to the name of a
                        column in the file
    :param compliance: Compliance level to enforce when reading the file. If 'lenient', the file can have extra fields
                        that are not in the DataModel. If 'strict', the file must have all fields in the DataModel.
    :return: List of DataModelInstances
    """
    if isinstance(path, Path):
        pass
    elif isinstance(path, str):
        path = Path(path)
    else:
        raise ValueError(f'Path must be a string or Path object, not {type(path)}')
    


    # check column_names is in the correct format
    if isinstance(column_names, MappingProxyType):
        column_names = dict(column_names)
    for f in data_model.fields:
        if f.id not in column_names.keys() and f.id + "_column" not in column_names.keys():
            raise ValueError(f"Column name for field id: {f.id} name: {f.name} not found in column_names dictionary,"
                             f" list it with the key '{f.id}_column'")
        elif f.id + "_column" in column_names.keys():
            column_names[f.id] = column_names.pop(f.id + "_column")

    data_model_instances = []

    for i in range(len(df)):  # todo: change to iter also non tabular data
        values = []
        for f in data_model.fields:
            column_name = column_names[f.id]

            pandas_value = loc_default(df, row_index=i, column_name=column_name)

            if not pandas_value or (isinstance(pandas_value, float) and math.isnan(pandas_value)):
                continue

            value_str = str(pandas_value)
            value = parsing.parse_value(value_str=value_str, resources=data_model.resources, compliance=compliance)
            values.append(DataFieldValue(row_no=i, field=f, value=value))
        data_model_instances.append(
            DataModelInstance(
                row_no=i,
                data_model=data_model,
                values=values,
                compliance=compliance)
        )

    return DataSet(data_model=data_model, data=data_model_instances)

I would welcome your feedback

@ielis
Copy link
Collaborator

ielis commented Oct 8, 2024

I have a similar routine here:

https://github.com/monarch-initiative/gpsea/blob/develop/src/gpsea/util.py

and a few tests here (but only for the buffer part...):

https://github.com/monarch-initiative/gpsea/blob/develop/tests/test_util.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants