Skip to content

Commit

Permalink
feat: add mask segments (#385)
Browse files Browse the repository at this point in the history
* feat(segment): create empty mask

* feat(segment): implement mask

* docs(segment): changelog and readme
  • Loading branch information
adrienaury authored Jan 15, 2025
1 parent f2acba3 commit 1873072
Show file tree
Hide file tree
Showing 8 changed files with 242 additions and 4 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ Types of changes

## [1.30.0]

- `Added` mask `partition` to handle fields containing different types of values by applying distinct transformations
- `Added` mask `partitions` to handle fields containing different types of values by applying distinct transformations
- `Added` mask `segments` to allow transformations on specific parts of a field's value using regular expressions to capture subgroups

## [1.29.1]

Expand Down
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ The following types of masks can be used :
* [`pipe`](#pipe) is a mask to handle complex nested array structures, it can read an array as an object stream and process it with a sub-pipeline.
* [`apply`](#apply) process selected data with a sub-pipeline.
* [`partitions`](#partitions) will rely on conditions to identify specific cases.
* [`segments`](#segments) allow transformations on specific parts of a field's value using regular expressions subgroups captures.
* [`luhn`](#luhn) can generate valid numbers using the Luhn algorithm (e.g. french SIRET or SIREN).
* [`markov`](#markov) can generate pseudo text based on a sample text.
* [`findInCSV`](#findincsv) get one or multiple csv lines which matched with Json entry value from CSV files.
Expand Down Expand Up @@ -1099,6 +1100,31 @@ The partition mask will rely on conditions to identify specific cases and apply

[Return to list of masks](#possible-masks)

### Segments

[![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFNMkkBaJCEAGxAGMAXGMcq7pAKwngAHXKwAWyFFAAmWHnkJcedECWwg4rCIqVIwKkAA8JAPQAKACgD8pgDxNWrcBAB8AbQCC1AFoBdAN4AzAC+AJRWtlJQJFCabgAM1ACc-sEhACSyOrogggy4zCDaWfaOkEVZNEgAZlVo5RXcBCAAngBiYDDYAKJwwBKtrWgA+l0AcgDCAEoAmqYAKgCSAPKjQwDSXdOZDUpSnbjEEu4AQuMAIl2tAOIAEgsAUmsAMgCyo0umAIqTAMpzAKoANQA6gANaZebZZSLRTT1HS0Gp1Sg7JRNNodbq9fqDEYTGbzZarDZbFGo7h7PCHVBxNAAJgCABYAKwANgA7AAORJYIA&i=N4KABGBECWAmkC4oAUCCAhAwgRgEwGZIQBfIA)

The segments mask allow transformations on specific parts of a field's value. This mask will use regular expressions to capture subgroups and apply transformations to them individually. Example configuration:

```yaml
- selector:
jsonpath: "id"
mask:
segments:
regex: "^P(?P<letters>[A-Z]{3})(?P<digits>[0-9]{3})$"
replace:
letters:
- ff1:
keyFromEnv: "FF1_ENCRYPTION_KEY"
domain: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
digits:
- ff1:
keyFromEnv: "FF1_ENCRYPTION_KEY"
domain: "0123456789"
```

[Return to list of masks](#possible-masks)

### FindInCSV

[![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFMkkBaJCEAGxAGMAXGMcyrpAKwngAOuFgAtkKYgDMYWbnkIRO3aklwATNUnGzlNScTUBJOAGEAygDUlyrgFcwUcSJYsBigPTuSUCCwB03qK2AEa2dGBM8CwgcP6R2O64YNje9IwQ7mgAnAAswUySkgDMAKwADGVoIADsIMElRbgATLgAHME5OVXtTUwAbO5guADu7llNTRX5Zbh91UVqJUwgTWhoM7jqOSWdIGp9TDmVZZJ9rdXuAjAEINjwfkwQwDo2lCAAHrisALLCTGIUV7cR7AZAAcgA3hCABQGD5IPyoAAqAE8BCAkBgAJRIAA+SHoMGG4CQAF9SWDAUC3rEwCjxFC-Cw0SAAPpockvV48L5MJJqazUkHgxkAOVw2Ax+MJxLAZIpVOpMRYdIZEL8cAlUpl4E51K4ipsH3RrD24mEVEY+BYVHgIC5NhEIHU4GQKtsIENyhVUGwbrAHswQA&i=N4KABGBEAuCeAOBTA+gRkgLigMwJYCdFIAacKAOwEMBbIrSAY0v1vIBNF9IQBfIA)
Expand Down
2 changes: 2 additions & 0 deletions internal/app/pimo/pimo.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ import (
"github.com/cgi-fr/pimo/pkg/regex"
"github.com/cgi-fr/pimo/pkg/remove"
"github.com/cgi-fr/pimo/pkg/replacement"
"github.com/cgi-fr/pimo/pkg/segment"
"github.com/cgi-fr/pimo/pkg/sequence"
"github.com/cgi-fr/pimo/pkg/sha3"
"github.com/cgi-fr/pimo/pkg/statistics"
Expand Down Expand Up @@ -345,6 +346,7 @@ func injectMaskFactories() []model.MaskFactory {
sha3.Factory,
apply.Factory,
partition.Factory,
segment.Factory,
}
}

Expand Down
8 changes: 7 additions & 1 deletion pkg/model/model.go
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,11 @@ type PartitionType struct {
Then []MaskType `yaml:"then" json:"then" jsonschema_description:"list of masks to execute if the condition is active"`
}

type SegmentType struct {
Regex string `yaml:"regex" json:"regex" jsonschema_description:"regex used to create segments using group captures, groups must be named"`
Replace map[string][]MaskType `yaml:"replace" json:"replace" jsonschema_description:"list of masks to execute for each group"`
}

type MaskType struct {
Add Entry `yaml:"add,omitempty" json:"add,omitempty" jsonschema:"oneof_required=Add,title=Add Mask,description=Add a new field in the JSON stream"`
AddTransient Entry `yaml:"add-transient,omitempty" json:"add-transient,omitempty" jsonschema:"oneof_required=AddTransient,title=Add Transient Mask" jsonschema_description:"Add a new temporary field, that will not show in the JSON output"`
Expand Down Expand Up @@ -286,7 +291,8 @@ type MaskType struct {
Sequence SequenceType `yaml:"sequence,omitempty" json:"sequence,omitempty" jsonschema:"oneof_required=Sequence,title=Sequence Mask" jsonschema_description:"Generate a sequenced ID that follows specified format"`
Sha3 Sha3Type `yaml:"sha3,omitempty" json:"sha3,omitempty" jsonschema:"oneof_required=Sha3,title=Sha3 Mask" jsonschema_description:"Generate a variable-length crytographic hash (collision resistant)"`
Apply ApplyType `yaml:"apply,omitempty" json:"apply,omitempty" jsonschema:"oneof_required=Apply,title=Apply Mask" jsonschema_description:"Call external masking file"`
Partition []PartitionType `yaml:"partitions,omitempty" json:"partitions,omitempty" jsonschema:"oneof_required=Partition,title=Partition Mask" jsonschema_description:"Identify specific cases and apply a defined list of masks for each case"`
Partition []PartitionType `yaml:"partitions,omitempty" json:"partitions,omitempty" jsonschema:"oneof_required=Partition,title=Partitions Mask" jsonschema_description:"Identify specific cases and apply a defined list of masks for each case"`
Segment SegmentType `yaml:"segments,omitempty" json:"segments,omitempty" jsonschema:"oneof_required=Segment,title=Segments Mask" jsonschema_description:"Allow transformations on specific parts of a field's value"`
}

type Masking struct {
Expand Down
2 changes: 1 addition & 1 deletion pkg/partition/partition.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ func execPipeline(pipeline model.Pipeline, e model.Entry) (model.Entry, error) {
}

func (me MaskEngine) Mask(e model.Entry, context ...model.Dictionary) (model.Entry, error) {
log.Info().Msg("Mask partition")
log.Info().Msg("Mask partitions")

// exec all partitions
for _, partition := range me.partitions {
Expand Down
138 changes: 138 additions & 0 deletions pkg/segment/segment.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
package segment

import (
"hash/fnv"
"regexp"
"strings"
tmpl "text/template"

"github.com/cgi-fr/pimo/pkg/model"
"github.com/rs/zerolog/log"
)

type MaskEngine struct {
re *regexp.Regexp
pipelines map[string]model.Pipeline
seed int64
seeder model.Seeder
}

func buildDefinition(masks []model.MaskType, globalSeed int64) model.Definition {
definition := model.Definition{
Version: "1",
Seed: globalSeed,
Functions: nil,
Masking: []model.Masking{},
Caches: nil,
}

for _, mask := range masks {
definition.Masking = append(definition.Masking, model.Masking{
Selector: model.SelectorType{Jsonpath: "."},
Mask: mask,
})
}

return definition
}

// NewMask return a MaskEngine from a value
func NewMask(segment model.SegmentType, caches map[string]model.Cache, fns tmpl.FuncMap, seed int64, seeder model.Seeder, seedField string) (MaskEngine, error) {
var err error

pipelines := map[string]model.Pipeline{}

for groupname, masks := range segment.Replace {
definition := buildDefinition(masks, seed)
pipeline := model.NewPipeline(nil)
pipeline, _, err = model.BuildPipeline(pipeline, definition, caches, fns, "", "")
if err != nil {
return MaskEngine{}, err
}

pipelines[groupname] = pipeline
}

return MaskEngine{
re: regexp.MustCompile(segment.Regex),
pipelines: pipelines,
seed: seed,
seeder: seeder,
}, nil
}

// replace captured groups named in the `value` string using the values ​​calculated by the `replacements` map
func replace(value string, re *regexp.Regexp, replacements map[string]func(string) (string, error)) (string, error) {
result := &strings.Builder{}

matchIndexes := re.FindStringSubmatchIndex(value)
groupNames := re.SubexpNames()

writeCount := 0
for i := 2; i < len(matchIndexes); i += 2 {
groupNumber := i / 2
groupName := groupNames[groupNumber]
startIndex := matchIndexes[i]
endIndex := matchIndexes[i+1]
capturedValue := value[startIndex:endIndex]

result.WriteString(value[writeCount:startIndex])
writeCount = endIndex

if replacement, exists := replacements[groupName]; exists {
if masked, err := replacement(capturedValue); err != nil {
return value, err
} else {
result.WriteString(masked)
}
}
}
result.WriteString(value[writeCount:])

return result.String(), nil
}

func (me MaskEngine) Mask(e model.Entry, context ...model.Dictionary) (model.Entry, error) {
log.Info().Msg("Mask segments")

replacements := map[string]func(string) (string, error){}

for groupname, pipeline := range me.pipelines {
replacements[groupname] = func(match string) (string, error) {
var result []model.Entry
err := pipeline.
WithSource(model.NewSourceFromSlice([]model.Dictionary{model.NewDictionary().With(".", match)})).
AddSink(model.NewSinkToSlice(&result)).
Run()
if err != nil {
return match, err
}
return result[0].(string), nil
}
}

result, err := replace(e.(string), me.re, replacements)
if err != nil {
return e, err
}

return result, nil
}

// Factory create a mask from a configuration
func Factory(conf model.MaskFactoryConfiguration) (model.MaskEngine, bool, error) {
if len(conf.Masking.Mask.Segment.Regex) > 0 {
seeder := model.NewSeeder(conf.Masking.Seed.Field, conf.Seed)

// set differents seeds for differents jsonpath
h := fnv.New64a()
h.Write([]byte(conf.Masking.Selector.Jsonpath))
conf.Seed += int64(h.Sum64()) //nolint:gosec
mask, err := NewMask(conf.Masking.Mask.Segment, conf.Cache, conf.Functions, conf.Seed, seeder, conf.Masking.Seed.Field)
if err != nil {
return mask, true, err
}
return mask, true, nil
}
return nil, false, nil
}
37 changes: 36 additions & 1 deletion schema/v1/pimo.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -590,6 +590,12 @@
"partitions"
],
"title": "Partition"
},
{
"required": [
"segments"
],
"title": "Segment"
}
],
"properties": {
Expand Down Expand Up @@ -790,8 +796,13 @@
"$ref": "#/$defs/PartitionType"
},
"type": "array",
"title": "Partition Mask",
"title": "Partitions Mask",
"description": "Identify specific cases and apply a defined list of masks for each case"
},
"segments": {
"$ref": "#/$defs/SegmentType",
"title": "Segments Mask",
"description": "Allow transformations on specific parts of a field's value"
}
},
"additionalProperties": false,
Expand Down Expand Up @@ -1030,6 +1041,30 @@
"additionalProperties": false,
"type": "object"
},
"SegmentType": {
"properties": {
"regex": {
"type": "string",
"description": "regex used to create segments using group captures, groups must be named"
},
"replace": {
"additionalProperties": {
"items": {
"$ref": "#/$defs/MaskType"
},
"type": "array"
},
"type": "object",
"description": "list of masks to execute for each group"
}
},
"additionalProperties": false,
"type": "object",
"required": [
"regex",
"replace"
]
},
"SelectorType": {
"properties": {
"jsonpath": {
Expand Down
30 changes: 30 additions & 0 deletions test/suites/masking_segment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: segment mask
testcases:
- name: simple segmentation test
steps:
- script: |-
cat > masking.yml <<EOF
version: "1"
seed: 42
masking:
- selector:
jsonpath: "id"
mask:
segments:
regex: "^P(?P<letters>[A-Z]{3})(?P<digits>[0-9]{3})$"
replace:
letters:
- ff1:
keyFromEnv: "FF1_ENCRYPTION_KEY"
domain: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
digits:
- ff1:
keyFromEnv: "FF1_ENCRYPTION_KEY"
domain: "0123456789"
EOF
- script: |-
echo '{"id": "PABC123"}' | FF1_ENCRYPTION_KEY="70NZ2NWAqk9/A21vBPxqlA==" pimo
assertions:
- result.code ShouldEqual 0
- result.systemoutjson.id ShouldEqual PVBR675
- result.systemerr ShouldBeEmpty

0 comments on commit 1873072

Please sign in to comment.