Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parquet support #334

Merged
merged 22 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Types of changes

## [1.28.0]

- `Added` new action for parquet files (experimental feature)
- `Added` mock command to intercept HTTP requests/responses to a web service and apply maskings
- `Added` time in JSON logs generated by `--log-json` flag

Expand Down
71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1439,6 +1439,77 @@ After executing the command with the correct configuration, here is the expected

[Return to list of masks](#possible-masks)

### Parsing Parquet files

Warning: parquet support is still an experimental feature, we are currently considering to migrate this feature to a new dataconnector type in LINO (might be dropped from PIMO in future releases)

To mask data in a Parquet file using PIMO with the correct configuration option, follow this updated approach:

```bash
pimo parquet data.parquet maskedData.parquet --config masking.yml
```

#### Example

Assume the Parquet file `data.parquet` has the following table structure:

| agency | agency_number | name | account_type | account_number | annual_income |
|--------------|---------------|--------|--------------|----------------|---------------|
| NewYork | 0032 | Doe | classic | 12345 | 50000 |
| SanFrancisco | 7894 | Smith | saving | 67890 | 60000 |

#### Masking Configuration (`masking.yml`)

```yaml
version: "1"
seed: 42

masking:
- selector:
jsonpath: "agency_number" # mask agency_number column
mask:
template: '{{MaskRegex "[0-9]{4}$"}}'

- selector:
jsonpath: "name" # mask name column
mask:
randomChoiceInUri: "pimo://nameFR"

- selector:
jsonpath: "account_type" # mask account_type column
mask:
randomChoice:
- "classic"
- "saving"
- "securitie"

- selector:
jsonpath: "account_number" # mask account_number column
masks:
- incremental:
start: 1
increment: 1
- template: "{{.account_number}}"
```

#### Resulting Masked Parquet File

After executing the command:

```bash
pimo parquet data.parquet maskedData.parquet --config masking.yml
```

The `maskedData.parquet` file will contain the following masked data:

| agency | agency_number | name | account_type | account_number | annual_income |
|--------------|---------------|----------|--------------|----------------|---------------|
| NewYork | 2308 | Rolande | saving | 1 | 50000 |
| SanFrancisco | 9724 | Matéo | securitie | 2 | 60000 |

This example demonstrates how to mask specific columns using PIMO, applying random choices, regular expressions, and incremental masking.

[Return to list of masks](#possible-masks)

## `pimo://` scheme

Expand Down
24 changes: 24 additions & 0 deletions cmd/pimo/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@ var (
serve string
maxBufferCapacity int
profiling string
parquetInput string
parquetOutput string
)

func main() {
Expand Down Expand Up @@ -187,6 +189,26 @@ There is NO WARRANTY, to the extent permitted by law.`, version, commit, buildDa
xmlCmd.Flags().Int64VarP(&seedValue, "seed", "s", 0, "set seed")
rootCmd.AddCommand(xmlCmd)

// Add command for parquet transformer
parquetCmd := &cobra.Command{
Use: "parquet input_parquet_file output_parquet_file",
Short: "Parsing and masking a parquet file",
Args: cobra.ExactArgs(2),
Run: func(cmd *cobra.Command, args []string) {
initLog()
if len(catchErrors) > 0 {
skipLineOnError = true
skipLogFile = catchErrors
}
parquetInput = args[0]
parquetOutput = args[1]

run(cmd)
},
}
parquetCmd.Flags().Int64VarP(&seedValue, "seed", "s", 0, "set seed")
rootCmd.AddCommand(parquetCmd)

rootCmd.AddCommand(&cobra.Command{
Use: "flow",
Run: func(cmd *cobra.Command, args []string) {
Expand Down Expand Up @@ -254,6 +276,8 @@ func run(cmd *cobra.Command) {
CachesToDump: cachesToDump,
CachesToLoad: cachesToLoad,
XMLCallback: len(serve) > 0,
ParquetInput: parquetInput,
ParquetOutput: parquetOutput,
}

var pdef model.Definition
Expand Down
25 changes: 22 additions & 3 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ require (
github.com/CGI-FR/xixo v0.1.8
github.com/Masterminds/sprig/v3 v3.3.0
github.com/adrienaury/zeromdc v0.1.1
github.com/apache/arrow/go/v12 v12.0.1
github.com/capitalone/fpe v1.2.1
github.com/goccy/go-json v0.10.3
github.com/goccy/go-yaml v1.12.0
Expand All @@ -28,35 +29,53 @@ require (

require (
dario.cat/mergo v1.0.1 // indirect
github.com/JohnCGriffin/overflow v0.0.0-20211019200055-46fa312c352c // indirect
github.com/Masterminds/goutils v1.1.1 // indirect
github.com/Masterminds/semver/v3 v3.3.0 // indirect
github.com/andybalholm/brotli v1.1.0 // indirect
github.com/apache/thrift v0.16.0 // indirect
github.com/bahlo/generic-list-go v0.2.0 // indirect
github.com/buger/jsonparser v1.1.1 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/fatih/color v1.13.0 // indirect
github.com/felixge/fgprof v0.9.3 // indirect
github.com/golang-jwt/jwt v3.2.2+incompatible // indirect
github.com/golang/protobuf v1.5.2 // indirect
github.com/golang/snappy v0.0.4 // indirect
github.com/google/flatbuffers v2.0.8+incompatible // indirect
github.com/google/gxui v0.0.0-20151028112939-f85e0a97b3a4 // indirect
github.com/google/pprof v0.0.0-20211214055906-6f57359322fd // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/huandu/xstrings v1.5.0 // indirect
github.com/inconshreveable/mousetrap v1.1.0 // indirect
github.com/klauspost/asmfmt v1.3.2 // indirect
github.com/klauspost/compress v1.17.9 // indirect
github.com/klauspost/cpuid/v2 v2.0.9 // indirect
github.com/labstack/gommon v0.4.2 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/mattn/go-colorable v0.1.13 // indirect
github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8 // indirect
github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3 // indirect
github.com/mitchellh/copystructure v1.2.0 // indirect
github.com/mitchellh/reflectwalk v1.0.2 // indirect
github.com/pierrec/lz4/v4 v4.1.21 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/shopspring/decimal v1.4.0 // indirect
github.com/smartystreets/goconvey v1.6.4 // indirect
github.com/spf13/pflag v1.0.5 // indirect
github.com/valyala/bytebufferpool v1.0.0 // indirect
github.com/valyala/fasttemplate v1.2.2 // indirect
github.com/wk8/go-ordered-map/v2 v2.1.8 // indirect
golang.org/x/net v0.24.0 // indirect
github.com/zeebo/xxh3 v1.0.2 // indirect
golang.org/x/mod v0.19.0 // indirect
golang.org/x/net v0.27.0 // indirect
golang.org/x/sync v0.8.0 // indirect
golang.org/x/sys v0.25.0 // indirect
golang.org/x/time v0.5.0 // indirect
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127 // indirect
golang.org/x/tools v0.23.0 // indirect
golang.org/x/xerrors v0.0.0-20220609144429-65e65417b02f // indirect
google.golang.org/genproto v0.0.0-20200526211855-cb27e3aa2013 // indirect
google.golang.org/grpc v1.49.0 // indirect
google.golang.org/protobuf v1.34.2 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
Loading
Loading