Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(taps): Generate fake data with stream maps #2170

Merged
merged 35 commits into from
Jan 31, 2024
Merged

feat(taps): Generate fake data with stream maps #2170

merged 35 commits into from
Jan 31, 2024

Conversation

ReubenFrankel
Copy link
Contributor

@ReubenFrankel ReubenFrankel commented Jan 22, 2024

Usage

The fake property (a Faker instance):

config:
  stream_maps:
    users:
      name: fake.name()

The Faker instance can be configured through faker_config:

Seed

config:
  faker_config:
    seed: 0

Locale

config:
  faker_config:
    locale: en_GB

Multiple locales

config:
  faker_config:
    locale:
    - en_GB
    - es_MX
    - ja_JP

Closes #1009



📚 Documentation preview 📚: https://meltano-sdk--2170.org.readthedocs.build/en/2170/

Copy link

codspeed-hq bot commented Jan 22, 2024

CodSpeed Performance Report

Merging #2170 will not alter performance

Comparing ReubenFrankel:1009-stream-maps-faker (ca5997e) with main (41ca39e)

Summary

✅ 6 untouched benchmarks

@ReubenFrankel ReubenFrankel marked this pull request as ready for review January 22, 2024 23:54
Copy link

codecov bot commented Jan 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (41ca39e) 88.46% compared to head (ca5997e) 88.46%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2170   +/-   ##
=======================================
  Coverage   88.46%   88.46%           
=======================================
  Files          54       54           
  Lines        4716     4716           
  Branches      917      917           
=======================================
  Hits         4172     4172           
  Misses        383      383           
  Partials      161      161           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@edgarrmondragon edgarrmondragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ReubenFrankel!

TLDR: What do you think about making this an optional dependency and add it as an extra both to singer-sdk and the 3 cookiecutters.

Pros: The site packages for most tap/target installations saves on the ~10MB (uncompressed) of faker. I've been meaning to slim down the SDK dependencies, and though the biggest offender is sqlalchemy, faker is not small either.

Cons: Folks have to either install faker alongside the tap/target or install with the extra if it's supported, e.g. tap-github[faker].

Having docs for this feature in https://sdk.meltano.com/en/latest/stream_maps.html that call out that users should install faker would be great.

Thanks!

@ReubenFrankel
Copy link
Contributor Author

@edgarrmondragon Sure, we can go with the extra approach. A couple of questions:

  1. Cons: Folks have to either install faker alongside the tap/target or install with the extra if it's supported, e.g. tap-github[faker].

    Would that not mean that the tap developer just has to specify singer-sdk[faker] as a dependency (indirectly through the cookiecutter), and otherwise from a Meltano perspective the pip_url doesn't have to change for a plugin?

  2. Are you happy with the stream_map_config approach for configuring the Faker instance? This does mean faker.seed and faker.locale become reserved config values for a tap using the faker extra. Alternatively, we could make it a top-level config option for taps with the stream-maps capability - like how the batch capability exposes all the batch_config options (although this might have to be a separate capability if faker becomes an extra).

@edgarrmondragon
Copy link
Collaborator

  1. Would that not mean that the tap developer just has to specify singer-sdk[faker] as a dependency (indirectly through the cookiecutter), and otherwise from a Meltano perspective the pip_url doesn't have to change for a plugin?

That's correct, that'd also work 👍

2. Are you happy with the stream_map_config approach for configuring the Faker instance? This does mean faker.seed and faker.locale become reserved config values for a tap using the faker extra. Alternatively, we could make it a top-level config option for taps with the stream-maps capability - like how the batch capability exposes all the batch_config options (although this might have to be a separate capability if faker becomes an extra).

Yeah. I think I'd like them better as top-level configs.

@edgarrmondragon
Copy link
Collaborator

@ReubenFrankel there seems to be a performance regression in setting up a mapper, reported by codspeed:
Screenshot 2024-01-29 at 10 34 32 a m

Do you think this drop in performance wouldn't be there if faker is not installed? (i.e. if users don't need fake and thus are not requesting faker be installed)

@ReubenFrankel
Copy link
Contributor Author

@ReubenFrankel there seems to be a performance regression in setting up a mapper, reported by codspeed: Screenshot 2024-01-29 at 10 34 32 a m

Do you think this drop in performance wouldn't be there if faker is not installed? (i.e. if users don't need fake and thus are not requesting faker be installed)

When faker is not installed, both of these exceptions should be suppressed:

sdk/singer_sdk/mapper.py

Lines 271 to 272 in 3bad517

with contextlib.suppress(ImportError):
self._init_faker_instance()

sdk/singer_sdk/mapper.py

Lines 336 to 337 in 3bad517

with contextlib.suppress(AttributeError):
names["fake"] = self.fake

I would have thought that without faker installed, this would cause a very minimal performance delta for what is essentially two try-except statements.

@edgarrmondragon
Copy link
Collaborator

edgarrmondragon commented Jan 29, 2024

@ReubenFrankel I've pushed 4e7ea15 to test performance without faker and it seems to get worse 😅. I confirmed by running locally with poetry run pytest --benchmark-only tests.

What do you think?

EDIT: Perhaps the the benchmark test in

def test_bench_simple_map_transforms(
is not correctly set up to truly capture mapper performance because mapper initialization gets in the mix?

@edgarrmondragon
Copy link
Collaborator

edgarrmondragon commented Jan 30, 2024

@edgarrmondragon What do you think of these docs additions?

@ReubenFrankel Should we mention that faker should be injected in the virtual environment by the end user?

We should also add faker_config to

Property(
"stream_map_config",
ObjectType(),
description="User-defined config values to be used within map expressions.",
),
.

@ReubenFrankel
Copy link
Contributor Author

ReubenFrankel commented Jan 30, 2024

Ah, yeah good call. I'll add both those. Regarding updating the capability:

  1. Should we conditionally include the faker_config property on importlib.util.find_spec("faker"), or just add it regardless?

    STREAM_MAPS_CONFIG = PropertiesList(
        Property(
            "stream_maps",
            ObjectType(),
            description=(
                "Config object for stream maps capability. "
                "For more information check out "
                "[Stream Maps](https://sdk.meltano.com/en/latest/stream_maps.html)."
            ),
        ),
        Property(
            "stream_map_config",
            ObjectType(),
            description="User-defined config values to be used within map expressions.",
        ),
        *[
            Property(
                "faker_config",
                ObjectType(
                    Property(
                        "seed",
                        OneOf(NumberType, StringType, BooleanType),
                        description=(
                            "[Faker generator seed]"
                            "(https://faker.readthedocs.io/en/master/index.html?highlight=seed#seeding-the-generator)"
                        ),
                    ),
                    Property(
                        "locale",
                        OneOf(StringType, ArrayType(StringType)),
                        description=(
                            "[Faker local normalization]"
                            "(https://faker.readthedocs.io/en/master/fakerclass.html#locale-normalization)"
                        ),
                    ),
                ),
                description=(
                    "Config for the [`Faker`](https://faker.readthedocs.io/en/master/) "
                    "instance variable `fake` used within map expressions.",
                ),
            ),
        ]
        if importlib.util.find_spec("faker")
        else [],
    ).to_dict()
  2. Sort of unrelated, but is this documentation out-of-date? As a developer, I don't think I've ever defined those settings in a plugin I've built using the SDK, since they are automatically added given the stream_maps capability

    sdk/docs/stream_maps.md

    Lines 93 to 96 in fa888be

    To support inline mapping functions, the developer only needs to declare two plugin settings,
    called `stream_maps` and `stream_map_config`, and declare both settings as `object` type. (For example:
    `Property("stream_maps, ObjectType())` if using the python helper classes or
    `"stream_maps": {"type": "object"}` if using native JSON Schema declarations.)

I'll also take a look at updating the cookiecutter tap and target templates to support singer-sdk[faker].

@edgarrmondragon
Copy link
Collaborator

1. Should we conditionally include the faker_config property on importlib.util.find_spec("faker"), or just add it regardless?

@ReubenFrankel I think it's ok to just add it regardless. We can reiterate in the field description that it requires injecting faker.

2. Sort of unrelated, but is this documentation out-of-date? As a developer, I don't think I've ever defined those settings in a plugin I've built using the SDK, since they are automatically added given the stream_maps capability

Ah, good catch. Feel free to get rid of that!

@ReubenFrankel
Copy link
Contributor Author

@edgarrmondragon This is ready for another look. There is something wrong with the current duckdb version for CI checks - looks like it no longer exists on PyPI?

sdk/poetry.lock

Line 547 in 63bb73b

version = "0.9.3.dev2938"

@edgarrmondragon
Copy link
Collaborator

@edgarrmondragon This is ready for another look. There is something wrong with the current duckdb version for CI checks - looks like it no longer exists on PyPI?

sdk/poetry.lock

Line 547 in 63bb73b

version = "0.9.3.dev2938"

Fixed on main by #2206 😮‍💨

@edgarrmondragon edgarrmondragon changed the title feat: Generate fake data with stream maps feat(taps): Generate fake data with stream maps Jan 31, 2024
Copy link
Collaborator

@edgarrmondragon edgarrmondragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ReubenFrankel!

@edgarrmondragon edgarrmondragon added this pull request to the merge queue Jan 31, 2024
Merged via the queue into meltano:main with commit 2fbe530 Jan 31, 2024
31 checks passed
@ReubenFrankel
Copy link
Contributor Author

@edgarrmondragon Thanks for finishing off the tests and docs. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: enhance mapper to produce realistic test data
2 participants