Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Schema Reflection API #211

Merged
merged 19 commits into from
Nov 19, 2023

Conversation

Saeris
Copy link
Contributor

@Saeris Saeris commented Oct 9, 2023

This PR is a proposal implementation for a schema reflection API submitted for consideration and refinement.

Adding a reflection API would enable development of third-party utility libraries for Valibot such as a mock generation library similar to @anatine/zod-mock, which is what initially inspired this effort.

At present, given Valibot's architecture utilizing anonymous arrow functions as the return values of Validation functions and the lack of access to the given Pipe in the Schema object, it would be difficult if not entirely impossible to gather the required metadata about a Valibot schema definition in order to generate mock data that would pass schema validation. This is because many of the required values, such as function arguments and internal variables, cannot be accessed at all once defined.

In this PR I've added what I believe to be a minimal API which would expose this Schema and Pipe metadata. Let's consider some examples.

First, let's take a look at the Email validation function, which doesn't accept a user-defined requirement:

export function email<TInput extends string>(error?: ErrorMessage) {
  const kind = 'email' as const;
  const requirement =
    /^[\w+-]+(?:\.[\w+-]+)*@[\da-z]+(?:[.-][\da-z]+)*\.[a-z]{2,}$/i;
  const message = error ?? ('Invalid email' as const);
  return Object.assign(
    (input: TInput): PipeResult<TInput> =>
      !requirement.test(input)
        ? getPipeIssues(kind, message, input)
        : getOutput(input),
    {
      kind,
      requirement,
      message,
    }
  );
}

Note that kind, requirement, and message were all extracted to variables to keep the internals DRY, and that Object.assign() is being used to assign each of these values as static properties of the returned validator function.

This exposes two of the most critical pieces of information about a validation check: what kind of check is being performed and what condition needs to be met for success.

Since the function being returned here is anonymous, even if we had access to the Pipe array in a Schema, each Pipe Item wouldn't have a name property to identify what validation was used. Additionally, because of the security limitations of strict mode, this function's arguments property also cannot be accessed.

With this setup, we're safely exposing all of this information in a manner which these values can be accessed without risk of modifying the runtime behavior of the validation itself.

Note

Originally I had considered wrapping this "metadata" object in Object.freeze() to ensure that it was read-only at runtime, but I don't think it provides much benefit for the performance tradeoff.

Let's take a look at one more Validation example:

export function minLength<
  TInput extends string | any[],
  const TRequirement extends number
>(requirement: TRequirement, error?: ErrorMessage) {
  const kind = 'min_length' as const;
  const message = error ?? ('Invalid length' as const);
  return Object.assign(
    (input: TInput): PipeResult<TInput> =>
      input.length < requirement
        ? getPipeIssues(kind, message, input)
        : getOutput(input),
    {
      kind,
      requirement,
      message,
    }
  );
}

User-defined requirements are roughly the same except for one important detail: const TRequirement extends number, where the use of the const modifier on this generic will instruct Typescript to treat the user-supplied value as narrowly as possible. This means that the return type of minLength() will show the user-defined values it was supplied rather than a wider requirement: number type.

Note

For this initial implementation I wanted to introduce as few changes to the existing types as possible, so there's room for improvement still.

Next let's take a look at a Schema function:

export type StringSchema<TOutput = string> = BaseSchema<string, TOutput> & {
  schema: 'string';
  checks: PipeMeta[];
};

// ...

export function string(
  arg1?: ErrorMessage | Pipe<string>,
  arg2?: Pipe<string>
): StringSchema {
  const [error, pipe] = getDefaultArgs(arg1, arg2);

  return {
    schema: 'string',

    async: false,

    checks: getChecks(pipe),

    _parse(input, info) {
      // ...
    },
  };
}

New things here are the addition of the checks property, which always returns an array of PipeMeta objects, which are the values we added to each of the Validation functions. For every Schema that accepts a Pipe argument, a checks property was added in the return value to expose metadata about the Pipe array.

Notably every Transformation was untouched, so they don't provide any metadata. checks only contains metadata from Validation functions.

Last, here's another Schema example:

export type ObjectSchema<
  TObjectShape extends ObjectShape,
  TOutput = ObjectOutput<TObjectShape>
> = BaseSchema<ObjectInput<TObjectShape>, TOutput> & {
  schema: 'object';
  object: TObjectShape;
  /* entries: [key: PropertyKey, value: SchemaMeta][]; // Removed */
  checks: PipeMeta[];
};

// ...

export function object<TObjectShape extends ObjectShape>(
  object: TObjectShape,
  arg2?: Pipe<ObjectOutput<TObjectShape>> | ErrorMessage,
  arg3?: Pipe<ObjectOutput<TObjectShape>>
): ObjectSchema<TObjectShape> {
  const [error, pipe] = getDefaultArgs(arg2, arg3);

  let cachedEntries: [string, BaseSchema<any>][];

  return {
    schema: 'object',

    object,

    async: false,

    /* entries: getEntries(object), // Removed */

    checks: getChecks(pipe),

    _parse(input, info) {
        // ...
    },
  };
}

Note

After further exploration writing a mocking library based on this PR, I no longer think the entries prop / SchemaMeta type are necessary, so I've removed them.

For iterable schemas (intersection, object, tuple, and union), there's also a new entries property, which either returns an array of key/value pairs, or simply a SchemaMeta[] array. This property helps with introspection on nested schema types.

SchemaMeta varies by schema, but it breaks down like this: This is what the PipeMeta object looks like:

export type PipeMeta<
  TKind extends string = string,
  TRequirement = unknown
> = Readonly<{
  kind: TKind;
  message: ErrorMessage;
  requirement?: TRequirement;
}>;

/* Removed
export type SchemaMeta =
  | {
      schema: string;
    }
  | {
      schema: 'literal';
      literal: string | number | bigint | boolean | symbol;
    }
  | {
      schema: 'any' | 'array' | 'map' | 'record' | 'set' | 'special';
      checks: PipeMeta[];
    }
  | {
      schema: 'intersection' | 'union';
      entries: SchemaMeta[];
    }
  | {
      schema: 'tuple';
      entries: SchemaMeta[];
      checks: PipeMeta[];
    }
  | {
      schema: 'object';
      entries: [key: PropertyKey, value: SchemaMeta][];
      checks: PipeMeta[];
    };
   */

These are all the fields which comprise the reflection API. I think there's room for some further simplification and improvements to the type definitions. Names for properties/types aren't set in stone, so feedback on those is welcomed and expected.

In this first pass I went and added a bunch of new tests to most of the affected schemas to validate the property additions and the a utility function getChecks() and getEntries().

There may be a few things I missed our haven't considered too. My goal here was adding just enough information about schemas to feed into a library such as Faker to generate mock data for tests, which is what I'm currently doing for my zod based schemas in @discordkit/client. I can see this also being useful for other use cases as well, such as JSON Schema or OpenAPI spec code generation tools.

Finally I want to acknowledge that I understand that this project is part of your Bachelor's thesis and that there wasn't a contributing guide that I could find, so I understand if you don't have much time for reviewing PRs or would prefer to pursue your own implementation. This was a fun weekend project for me and at the end of the day I'm just hoping this is more inspiring than just opening another feature request issue.

@netlify
Copy link

netlify bot commented Oct 9, 2023

Deploy Preview for valibot canceled.

Name Link
🔨 Latest commit a90ac7d
🔍 Latest deploy log https://app.netlify.com/sites/valibot/deploys/6558f419cfb353000811b089

@fabian-hiller
Copy link
Owner

Thanks a lot for this PR. Unfortunately, I don't have time to read everything right now. However, this comment I wrote this morning seems to be related. As soon as I find the time, I will read everything carefully and answer you in detail.

: getOutput(input);
const kind = 'ip' as const;
const requirement =
/^((((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?\b){4})|((([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))))$/;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I combined the regex expressions into a single value. It's a little hard to read but essentially it's just introducing an | OR between the two so that this can be exposed as a single requirement value instead of two.

@Saeris
Copy link
Contributor Author

Saeris commented Oct 11, 2023

Just wanted to pop in with an update:

Since opening this I started working on a refactor of @anatine/zod-mock to work with this new API and I have some great coverage of most types already!

In working on this I realized that the entries property isn't necessary and so after a little more testing with this mocking library implementation, I'll probably simplify this PR a bit to remove the unnecessary bits.

@fabian-hiller
Copy link
Owner

Thanks a lot for your contribution! I will answer you by the weekend at the latest.

@Saeris
Copy link
Contributor Author

Saeris commented Oct 12, 2023

No rush!

@fabian-hiller
Copy link
Owner

Can you comment on my thoughts in this comment? Instead of a function with properties, what do you think about choosing a structure similar to the schemas using an object with a ._parse method?

@fabian-hiller
Copy link
Owner

This is my answer on #191 (comment)

If a validation function returns an object with ._parse, this has the advantage that the implementation and structure is uniform with our schemas. The other advantage I see is that .checks is not needed this way and therefore we can speed up the initialization of a schema. Instead of .check I would add the property .pipe, which is 1 to 1 the same as the pipeline passed as an argument. This way no additional calculations are needed. .pipe has almost the same content as .checks. Basically, we have the same idea, just with two different implementations.

According to my idea, minLength would then be implemented as follows:

export function minLength<TInput extends string | any[]>(
  requirement: number,
  message: ErrorMessage = 'Invalid length'
): MinLengthPipe<TInput> {
  return {
    kind: 'validation',
    name: 'min_length',
    requirement,
    message,
    _parse(input) {
      return input.length < requirement
        ? getPipeIssues('min_length', message, input)
        : getOutput(input);
    },
  };
}

Calling string([minLength(10)]) would then return the following data structure:

const schema = {
  kind: 'schema',
  name: 'string',
  async: false,
  pipe: [{
    kind: 'validation',
    name: 'min_length',
    requirement: 10,
    message: 'Invalid length',
    _parse(input) {
      // ...
    },
  }],
  _parse(input, info) {
    // ...
  },
};

This is just a first idea. I have not made a decision yet and am open to any ideas.

@Saeris
Copy link
Contributor Author

Saeris commented Oct 13, 2023

I think I'd rather go the opposite direction, get rid of _parse() entirely and adopt the Object.assign() format across the library.

One thing I'm concerned about is that _parse doesn't compress well because it's a property name. If the default behavior of each the Schemas, Transforms, Validations, etc was to be invoked as a function, that would shave additional bytes off a compiled app using valibot. Additional space savings could be had by aliasing Object.assign() behind a utility function, let's call it assign(). However, I'm not an expert on this subject, I just know that mangling property names is considered dangerous.

There's also to consider how much refactoring would be required by either approach. It would seem to me like with your proposal it would necessitate additional refactors to every Transform to have a consistent Pipe API. In setting out with this PR, I aimed to be purely additive and not introduce breaking changes to the existing API. So if possible, I'd like to keep the scope of changes here to a minimum.


I did some reading on JSON schema to get a better sense of how this might relate to #23, and I don't think that it's a good idea to tightly tie the metadata properties of Schemas, Validations, Transforms, etc to the JSON schema format, but rather pursue writing a schema serializer. You can read up on the JSON schema spec here

What is important to keep in mind, I think, is to avoid the use of type, title, and description in the metadata for these functions, as they are keywords with specific meaning inside of JSON schema definitions. This also relates to the naming of various validation functions, as they could have 1:1 matches with the validation keywords that JSON schema uses.

One such example is maxLength, which in Valibot applies to both strings and arrays, but in JSON schema these are divided into maxLength and maxItems. Additional thought on how to align those sorts of differences might be better explored in a different PR, but in theory a serializer can take the existing validations and translate them to the appropriate JSON schema validation constraints. In either case, a serializer would need to work on a Schema and translate it's pipe into the appropriate validation keywords. I would not attempt to set those properties on schemas themselves, as I think it would just make them heavier.


I think I could get behind exposing the entire pipe instead of checks, but my concern is that because Pipe can be either a Validation or a Transform, the user will then have to filter pipe on their end. That wouldn't be that difficult to do, just something worth considering.

However, I don't think repurposing kind here to distinguish between schema, validation, transform etc is necessary, because that can be inferred via the shape of the return type. TypeScript on its own can already tell us what we're working with because for example, all schema generators return a type that extends BaseSchema. To me this doesn't seem important at runtime, when really this is metadata that is mainly useful at authoring time.


For now, this is where my thoughts are on where to go from here.

tl;dr is that I'd prefer to keep this PR narrowly scoped and save larger refactors for another PR, and that we should be mindful of the tradeoffs at hand

@fabian-hiller
Copy link
Owner

fabian-hiller commented Oct 13, 2023

One thing I'm concerned about is that _parse doesn't compress well...

I think ._parse can be compressed very well, because the word "parse" also appears in the parse and safeParse methods, and these are needed to use a schema. Also, I suspect it takes less bytes than Object.assign(). In practice, however, the difference will be so small that it will not matter in the end, and it is not a deciding point for me.

There's also to consider how much refactoring would be required by either approach...

We won't be able to prevent breaking changes while we're still in v0 and working on major features. I don't care about the effort either, because the best end product is what matters to me.

What is important to keep in mind, I think, is to avoid the use of type, title, and description in the metadata for these functions, as they are keywords with specific meaning inside of JSON schema definitions...

Why would you consider the JSON schema definition? It doesn't matter for Valibot and doesn't play a role, or am I misunderstanding something here?

One such example is maxLength, which in Valibot applies to both strings and arrays, but in JSON schema...

I would develop Valibot completely independent of the JSON Schema definition, since serialization is not a relevant feature of Valibot, as it negates Valibot's advantages in terms of bundle size. If serialization and JSON Schema are important, I recommend using TypeBox instead of Valibot.

I think I could get behind exposing the entire pipe instead of checks, but my concern is that because Pipe can be either a Validation or a Transform, the user will then have to filter pipe on their end. That wouldn't be that difficult to do, just something worth considering.

Yes, I agree, we should take that into account. However, since transformations are part of the Valibot pipeline, developers working with Valibot should be aware of this anyway. Exposing the pipeline as .pipe has the additional advantage that the pipeline can be modified afterwards, for example with methods like removePipe or overwritePipe.

However, I don't think repurposing kind here to distinguish between schema, validation, transform etc...

Yes, with a schema it is not necessarily required. But for the pipeline I think it is, since transformations do not have .requirement and .message. Or do you have an alternative idea for this?


For now, the disadvantages of Object.assign() outweigh an object with ._parse in my opinion:

  • The code is harder for beginners to understand in usage and implementation
  • If you can call a schema directly, we can't indicate that it's an internal API like we can with the underscore in ._parse
  • Also, the difference between schema(input) and safeParse(schema, input) is unclear, which makes the API more complicated for users
  • Object.assign() takes more CPU time than creating a simple JavaScript object
  • Can TypeScript correctly type Object.assign()? If not, this is another drawback

@fabian-hiller fabian-hiller self-assigned this Oct 13, 2023
@fabian-hiller fabian-hiller added enhancement New feature or request priority This has priority labels Oct 13, 2023
@Saeris
Copy link
Contributor Author

Saeris commented Oct 14, 2023

Please take a look over the latest commit. I went and refactored everything to use a uniform return shape with a _parse() method.

I was going to continue advocating for Object.assign() but I spent some time this morning researching the performance impact it could possibly have, and while there might be when assigning a lot of properties or frequent execution, I figured it wasn't worth the risk. I did however do a refactor last night with this implementation if you want to compare to this branch: https://github.com/Saeris/valibot/tree/feat/return-parse-function-from-schemas (there were some test errors with comparable() that I punted on resolving, code works fine otherwise)

Important to note, still, is that _parse and every other property name is unchanged by minification, whereas function names are mangled. You can run tsup with minify: true to compare the outputs and observe this behavior.


As far as my JSON schema comments go, you can ignore those. I assumed because of #23 that there was desire to support it somehow with this PR.


For reference, Object.assign() does work just fine with TypeScript. Under the hood it takes the types of each arg and returns an intersection of all of them. You can see for yourself in your IDE that the types come through exactly as expected.


Yes, with a schema it is not necessarily required. But for the pipeline I think it is, since transformations do not have .requirement and .message. Or do you have an alternative idea for this?

I think it's rather simple to pattern match on unique properties to Schema, Validation, or Transform to determine what type they are. getChecks() has a utility function that does exactly this for Validations. async is unique to Schema, message is unique to Validation, etc.

Calling a night on contributions for now. Will catch up on any feedback tomorrow.

@fabian-hiller
Copy link
Owner

Thank you for your contribution. I'll try to look at it tomorrow and think about it. I expect to give you feedback next week, with the goal of merging the improvements in the following days.

> [!NOTE]
>
> The `@internal` TSDoc comment signals that a type is for internal use only. TypeScript can strip these types from the emitted type definitions to hide these internal APIs from third-party developers.
@Saeris Saeris marked this pull request as ready for review October 14, 2023 05:29
@fabian-hiller
Copy link
Owner

Short update: I try to review everything tomorrow or on the weekend. The goal is to release a new minor version next week.

@fabian-hiller
Copy link
Owner

I have started to investigate and improve the code. I'll let you know as soon as I'm further along.

@fabian-hiller
Copy link
Owner

I expect to share my changes here in the middle or end of this week so we can discuss the final details.

@fabian-hiller
Copy link
Owner

fabian-hiller commented Nov 15, 2023

@Saeris and @lo1tuma can you review my changes?

@Saeris
Copy link
Contributor Author

Saeris commented Nov 15, 2023

I took a look over your latest commits (da3ca63, 775d539 and 5838643) and everything looks good to me.

A concern I have with _parse always taking the returned values for options like pipe and others is that they are no longer statically analyzable, because runtime logic could change their values. But that would be an exception rather than the norm, so I suppose it's okay because it's really a "don't change this unless you know what you are doing" kind of feature. Generally, I don't like this kind of unpredictability in something like a data validation schema, but then it's also not a feature I see myself using.

But otherwise, I like the addition of exposing error messages as well as the validation pipelines. That should help in form input validation.

I'm good to ship this if you are! Thanks for all the effort in reviewing @fabian-hiller!

@fabian-hiller
Copy link
Owner

Thanks a lot for the work you put into this PR! I will wait 1 to 3 days and then do the final review to merge the changes.

@lo1tuma
Copy link
Contributor

lo1tuma commented Nov 16, 2023

I had a brief look at the changes. I still think that requirement itself is not typed is a bummer. If this is only a first iteration and there are plans to improve the precision of its type I’m fine. But as it is right now I would probably not use it.

@fabian-hiller
Copy link
Owner

fabian-hiller commented Nov 16, 2023

@lo1tuma can you describe in detail what you mean (e.g. with some sample code) and how you would improve it?

@lo1tuma
Copy link
Contributor

lo1tuma commented Nov 17, 2023

What I mean is the problem that on the Issue type requirement is defined as an optional of unknown. So when I want to analyze an issue like this:

if (issue.type === 'stars_with') {
    someStringOperation(issue.requirement); // requirement is unknown but I would expect type string
}

I would need to do further runtime checks to verify what type requirement has. So ideally Issue could be a discriminated union. But right now I think it is hard to implement because we also have custom validations and therefore arbitrary values of issue.type. One idea would be not allow custom validations to overwrite type and make issue.type a known value in this case, for example custom.
I think this is a bit tricky to solve and there might be other solution, so maybe it is better to do it as a follow-up.

@fabian-hiller
Copy link
Owner

Yes, I agree that we should address this in a follow-up. I plan to integrate Valibot more into Modular Forms and will probably find a lot of necessary improvements to Valibot along the way.

@fabian-hiller
Copy link
Owner

I will think about your feedback and probably create an issue after this one is merged so that we do not forget about this in the long run.

@fabian-hiller
Copy link
Owner

fabian-hiller commented Nov 19, 2023

Thanks again for creating and contributing to this PR!

@fabian-hiller fabian-hiller merged commit db57a67 into fabian-hiller:main Nov 19, 2023
10 checks passed
@ciscoheat
Copy link
Contributor

This looks very promising, especially with my idea that Superforms v2 should be able to use most validation libraries through typeschema, in more or less complete versions of the validation format, where Zod is the benchmark, being fully introspectable.

I would be happy to use Valibot as the test library for experimentation, so is there an example how to use the Reflection API? This is how I do it with Zod: https://github.com/ciscoheat/sveltekit-superforms/blob/main/src/lib/schemaEntity.ts#L293-L358

It's not difficult, once the metadata is easily accessible. So an example of that, together with how to be able to access some other properties like nullable, optional, default value, would get me very far.

@fabian-hiller
Copy link
Owner

I think the code for Valibot is very similar to the code for Zod. The biggest difference will be that you extract all the details of a data type from the pipeline. I have created an issue so that we add a guide on this to the documentation in the long run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority This has priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants