Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LogScaler transformer #930

Open
frances-h opened this issue Jan 13, 2025 · 0 comments
Open

Add LogScaler transformer #930

frances-h opened this issue Jan 13, 2025 · 0 comments
Labels
feature request Request for a new feature

Comments

@frances-h
Copy link
Contributor

Problem Description

Currently, the ScalarRange, ScalarInequality, Positive, and Negative constraints all scale the data using log-based transformers. Since the SDV already enforces min/max values by simpler means, these constraints can be deprecated. However, it would be helpful to move the scaling logic into the RDT library, so that it may be used independently of the constraints in the future.

Expected behavior

Create a new RDT called LogScaler. This RDT transforms the data by applying a log. Its functionality is equivalent to the ScalarInequality constraint (as well as Positive and Negative).

Parameters

  • missing_value_replacement (object): Same as FloatFormatter
  • missing_value_generation (str or None): Same as FloatFormatter
  • constant (float): The constant to set as the 0-value for the log-based transform. Default to 0 (do not modify the 0-value of the data)
  • invert (bool): Whether to invert the data with respect to the constant value. If False, do not invert the data (all values will be greater than the constant value). If True, invert the data (all the values will be less than the constant value). Defaults to False.
  • learn_rounding_scheme (bool): Same as FloatFormatter

Implementation Notes

  • Apply similar logic that is used in ScalarInequality
  • During fit:
    • Learn needed information about missing values and rounding scheme.
    • Validate that we can actually take the log of the data.
  • During transform:
    • Fill in missing values
    • Transform the data
      • If invert is False (default): transformed_data = log(data - constant)
      • If invert is True: transformed_data = log(constant - data)
    • If there are any issues with taking the log, raise a descriptive error explaining what wen wrong:
      • Example: Error: Unable to apply a log transform to column 'capital-gains' due to a non-positive value (-1).

Intended usage: A user would call update_transformers to replace the FloatFormatter with the LogScaler for any columns that would benefit from this.

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.auto_assign_transformers(data)
synthesizer.update_transformers({
  'capital-gains': LogScaler(missing_value_replacement=0, constant=-1e-5)
})
synthesizer.fit(data)
@frances-h frances-h added the feature request Request for a new feature label Jan 13, 2025
@npatki npatki changed the title Add LogScalar transformer Add LogScaler transformer Jan 13, 2025
@frances-h frances-h changed the title Add LogScaler transformer Add LogScalar transformer Jan 13, 2025
@frances-h frances-h changed the title Add LogScalar transformer Add LogScaler transformer Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant