Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sideinputs to the RunInference Transform #25200

Merged
merged 54 commits into from
Feb 2, 2023
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
e54a2de
Add model pcoll param to the RunInference Ptransform
AnandInguva Jan 6, 2023
443120a
Add sklearn side input example
AnandInguva Jan 12, 2023
50b58b1
Add ModeMetadata and some refactoring
AnandInguva Jan 13, 2023
3cb7074
refactor _convert_to_result and add it to the utils.py
AnandInguva Jan 19, 2023
f614a3d
Add tag to the RunInference DoFn
AnandInguva Jan 19, 2023
43b5ca6
Add enable_side_input_loading flag
AnandInguva Jan 20, 2023
c42b903
Add helper functions
AnandInguva Jan 20, 2023
e2c2833
Add doc string, refactor utils code
AnandInguva Jan 21, 2023
f4a6c2b
Fix pytorch inference tests
AnandInguva Jan 21, 2023
530f61f
Fix up sklearn inference
AnandInguva Jan 23, 2023
9bd8d8f
Remove logging
AnandInguva Jan 24, 2023
676600a
Add thread Lock when there is an update to side input
AnandInguva Jan 25, 2023
0a2a56e
Check if side input is EmptySideInput
AnandInguva Jan 25, 2023
7a1ed15
Add unit test for side input loading
AnandInguva Jan 25, 2023
eae1837
Remove examples
AnandInguva Jan 25, 2023
af19536
Add log when side input path is updated
AnandInguva Jan 25, 2023
3d0821e
Add test to Dataflow
AnandInguva Jan 26, 2023
fe434df
Refactor side input loading code
AnandInguva Jan 26, 2023
2a3e6a4
Add documentation, changelog
AnandInguva Jan 27, 2023
dd6d494
Add Singleton view doc
AnandInguva Jan 27, 2023
037c80a
Fix whitespace, tests
AnandInguva Jan 27, 2023
f9a61a1
fix weird spacing
AnandInguva Jan 27, 2023
d2edaa9
Remove beam website udpate
AnandInguva Jan 27, 2023
73ed494
Revert "fix weird spacing"
AnandInguva Jan 27, 2023
ae91ffe
Add WatchFilePattern transform
AnandInguva Jan 27, 2023
b796560
undo changes to beam wesbite
AnandInguva Jan 27, 2023
ca178a2
Pass side inputs only in streaming mode
AnandInguva Jan 27, 2023
0d96c77
Revert "Add WatchFilePattern transform"
AnandInguva Jan 27, 2023
2d97fdb
Add lines to website page
AnandInguva Jan 27, 2023
cf3893f
Add test
AnandInguva Jan 28, 2023
9295dce
Add unit test to catch --streaming flag and Singleton SideInput
AnandInguva Jan 30, 2023
3ce25f8
Addressing PR comments
AnandInguva Jan 30, 2023
912e36f
Add logic to detect windows on side inputs
AnandInguva Jan 30, 2023
0191cdd
Add more tests
AnandInguva Jan 30, 2023
a860b98
Remove redundant code
AnandInguva Jan 30, 2023
fba66b3
Update test
AnandInguva Jan 31, 2023
babcd24
Fix lint
AnandInguva Jan 31, 2023
016bbc5
Add postcommit markers.
AnandInguva Jan 31, 2023
88cb09d
Remove `and`
AnandInguva Jan 31, 2023
53f5a7c
fixup lint
AnandInguva Jan 31, 2023
d1abcc0
Modify message
AnandInguva Jan 31, 2023
b2c40db
Add check for default model
AnandInguva Jan 31, 2023
60b33fa
Update message
AnandInguva Jan 31, 2023
036f321
Add validates runner
AnandInguva Jan 31, 2023
5b40d1f
Fix test
AnandInguva Jan 31, 2023
0b58f10
Add PipelineVisitor for RunInference during construction time
AnandInguva Feb 1, 2023
bcdb871
Address comments based on PR
AnandInguva Feb 1, 2023
e6f8eaa
Remove restriction on the side inputs
AnandInguva Feb 1, 2023
ec3ca9b
Remove/add tests
AnandInguva Feb 1, 2023
6d07b6f
Modify logging
AnandInguva Feb 1, 2023
25c662c
Add tests
AnandInguva Feb 1, 2023
f7ab2d7
Merge branch 'master' into model-updates-api
damccorm Feb 2, 2023
0d325fe
Add 2.46.0 change log
AnandInguva Feb 2, 2023
637d3c3
fix typo
AnandInguva Feb 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@
present in 2.43.0 (up to 1.8.0_342, 11.0.16, 17.0.2 for respective Java versions). This is accompanied
by an explicit re-enabling of TLSv1 and TLSv1.1 for Java 8 and Java 11.
* Add UDF metrics support for Samza portable mode.
* RunInference PTransform will accept Singleton SideInputs in Python SDK. ([#24042](https://github.com/apache/beam/issues/24042))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, actually this won't be released in 2.45 - could you please create a new section for 2.46? (same thing Luke did here - 8ec0568#diff-d975bf659606195d2165918f93e1cf680ef68ea3c9cab994f033705fea8238b2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, thanks for catching it.



## Breaking Changes

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

"""
Used for internal testing. No backwards compatibility.
"""

import argparse
import logging
import time
from typing import Iterable
from typing import Optional
from typing import Sequence

import apache_beam as beam
from apache_beam.ml.inference import base
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.transforms import trigger
from apache_beam.transforms import window
from apache_beam.transforms.periodicsequence import PeriodicImpulse
from apache_beam.transforms.userstate import CombiningValueStateSpec


# create some fake models which returns different inference results.
class FakeModelDefault:
def predict(self, example: int) -> int:
return example


class FakeModelAdd(FakeModelDefault):
def predict(self, example: int) -> int:
return example + 1


class FakeModelSub(FakeModelDefault):
def predict(self, example: int) -> int:
return example - 1


class FakeModelHandlerReturnsPredictionResult(
base.ModelHandler[int, base.PredictionResult, FakeModelDefault]):
def __init__(self, clock=None, model_id='model_default'):
self.model_id = model_id
self._fake_clock = clock

def load_model(self):
if self._fake_clock:
self._fake_clock.current_time_ns += 500_000_000 # 500ms
if self.model_id == 'model_add.pkl':
return FakeModelAdd()
elif self.model_id == 'model_sub.pkl':
return FakeModelSub()
return FakeModelDefault()

def run_inference(
self,
batch: Sequence[int],
model: FakeModelDefault,
inference_args=None) -> Iterable[base.PredictionResult]:
for example in batch:
yield base.PredictionResult(
model_id=self.model_id,
example=example,
inference=model.predict(example))

def update_model_path(self, model_path: Optional[str] = None):
self.model_id = model_path if model_path else self.model_id


def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
first_ts = time.time()
side_input_interval = 60
main_input_interval = 20
# give some time for dataflow to start.
last_ts = first_ts + 1200
mid_ts = (first_ts + last_ts) / 2

_, pipeline_args = parser.parse_known_args(argv)
options = PipelineOptions(pipeline_args)
options.view_as(SetupOptions).save_main_session = save_main_session

test_pipeline = beam.Pipeline(options=options)

class GetModel(beam.DoFn):
def process(self, element) -> Iterable[base.ModelMetdata]:
if time.time() > mid_ts:
yield base.ModelMetdata(
model_id='model_add.pkl', model_name='model_add')
else:
yield base.ModelMetdata(
model_id='model_sub.pkl', model_name='model_sub')

class _EmitSingletonSideInput(beam.DoFn):
COUNT_STATE = CombiningValueStateSpec('count', combine_fn=sum)

def process(self, element, count_state=beam.DoFn.StateParam(COUNT_STATE)):
_, path = element
counter = count_state.read()
if counter == 0:
count_state.add(1)
yield path

def validate_prediction_result(x: base.PredictionResult):
model_id = x.model_id
if model_id == 'model_sub.pkl':
assert (x.example == 1 and x.inference == 0)

if model_id == 'model_add.pkl':
assert (x.example == 1 and x.inference == 2)
AnandInguva marked this conversation as resolved.
Show resolved Hide resolved

if model_id == 'model_default':
assert (x.example == 1 and x.inference == 1)

side_input = (
test_pipeline
| "SideInputPColl" >> PeriodicImpulse(
first_ts, last_ts, fire_interval=side_input_interval)
| "GetModelId" >> beam.ParDo(GetModel())
| "AttachKey" >> beam.Map(lambda x: (x, x))
# due to periodic impulse, which has a start timestamp before
# Dataflow pipeline process data, it can trigger in multiple
# firings, causing an Iterable instead of singleton. So, using
# the _EmitSingletonSideInput DoFn will ensure unique path will be
# fired only once.
| "GetSingleton" >> beam.ParDo(_EmitSingletonSideInput())
| "ApplySideInputWindow" >> beam.WindowInto(
window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING))

model_handler = FakeModelHandlerReturnsPredictionResult()
inference_pcoll = (
test_pipeline
| "MainInputPColl" >> PeriodicImpulse(
first_ts,
last_ts,
fire_interval=main_input_interval,
apply_windowing=True)
| beam.Map(lambda x: 1)
| base.RunInference(
model_handler=model_handler, model_metadata_pcoll=side_input))

_ = inference_pcoll | "AssertPredictionResult" >> beam.Map(
validate_prediction_result)

_ = inference_pcoll | "Logging" >> beam.Map(logging.info)

test_pipeline.run().wait_until_finish()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The with Pipeline() as pipeline syntax is generally preferred to a manual .run().wait_until_finish() call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it. thanks



if __name__ == '__main__':
run()
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@

# pytype: skip-file

import re
import unittest
from io import StringIO

Expand All @@ -44,40 +43,40 @@

def check_torch_keyed_model_handler():
expected = '''[START torch_keyed_model_handler]
('first_question', PredictionResult(example=tensor([105.]), inference=tensor([523.6982])))
('second_question', PredictionResult(example=tensor([108.]), inference=tensor([538.5867])))
('third_question', PredictionResult(example=tensor([1000.]), inference=tensor([4965.4019])))
('fourth_question', PredictionResult(example=tensor([1013.]), inference=tensor([5029.9180])))
('first_question', PredictionResult(example=tensor([105.]), inference=tensor([523.6982]), model_id='gs://apache-beam-samples/run_inference/five_times_table_torch.pt'))
('second_question', PredictionResult(example=tensor([108.]), inference=tensor([538.5867]), model_id='gs://apache-beam-samples/run_inference/five_times_table_torch.pt'))
('third_question', PredictionResult(example=tensor([1000.]), inference=tensor([4965.4019]), model_id='gs://apache-beam-samples/run_inference/five_times_table_torch.pt'))
('fourth_question', PredictionResult(example=tensor([1013.]), inference=tensor([5029.9180]), model_id='gs://apache-beam-samples/run_inference/five_times_table_torch.pt'))
[END torch_keyed_model_handler] '''.splitlines()[1:-1]
return expected


def check_sklearn_keyed_model_handler(actual):
expected = '''[START sklearn_keyed_model_handler]
('first_question', PredictionResult(example=[105.0], inference=array([525.])))
('second_question', PredictionResult(example=[108.0], inference=array([540.])))
('third_question', PredictionResult(example=[1000.0], inference=array([5000.])))
('fourth_question', PredictionResult(example=[1013.0], inference=array([5065.])))
('first_question', PredictionResult(example=[105.0], inference=array([525.]), model_id='gs://apache-beam-samples/run_inference/five_times_table_sklearn.pkl'))
('second_question', PredictionResult(example=[108.0], inference=array([540.]), model_id='gs://apache-beam-samples/run_inference/five_times_table_sklearn.pkl'))
('third_question', PredictionResult(example=[1000.0], inference=array([5000.]), model_id='gs://apache-beam-samples/run_inference/five_times_table_sklearn.pkl'))
('fourth_question', PredictionResult(example=[1013.0], inference=array([5065.]), model_id='gs://apache-beam-samples/run_inference/five_times_table_sklearn.pkl'))
[END sklearn_keyed_model_handler] '''.splitlines()[1:-1]
assert_matches_stdout(actual, expected)


def check_torch_unkeyed_model_handler():
expected = '''[START torch_unkeyed_model_handler]
PredictionResult(example=tensor([10.]), inference=tensor([52.2325]))
PredictionResult(example=tensor([40.]), inference=tensor([201.1165]))
PredictionResult(example=tensor([60.]), inference=tensor([300.3724]))
PredictionResult(example=tensor([90.]), inference=tensor([449.2563]))
PredictionResult(example=tensor([10.]), inference=tensor([52.2325]), model_id='gs://apache-beam-samples/run_inference/five_times_table_torch.pt')
PredictionResult(example=tensor([40.]), inference=tensor([201.1165]), model_id='gs://apache-beam-samples/run_inference/five_times_table_torch.pt')
PredictionResult(example=tensor([60.]), inference=tensor([300.3724]), model_id='gs://apache-beam-samples/run_inference/five_times_table_torch.pt')
PredictionResult(example=tensor([90.]), inference=tensor([449.2563]), model_id='gs://apache-beam-samples/run_inference/five_times_table_torch.pt')
[END torch_unkeyed_model_handler] '''.splitlines()[1:-1]
return expected


def check_sklearn_unkeyed_model_handler(actual):
expected = '''[START sklearn_unkeyed_model_handler]
PredictionResult(example=array([20.], dtype=float32), inference=array([100.], dtype=float32))
PredictionResult(example=array([40.], dtype=float32), inference=array([200.], dtype=float32))
PredictionResult(example=array([60.], dtype=float32), inference=array([300.], dtype=float32))
PredictionResult(example=array([90.], dtype=float32), inference=array([450.], dtype=float32))
PredictionResult(example=array([20.], dtype=float32), inference=array([100.], dtype=float32), model_id='gs://apache-beam-samples/run_inference/five_times_table_sklearn.pkl')
PredictionResult(example=array([40.], dtype=float32), inference=array([200.], dtype=float32), model_id='gs://apache-beam-samples/run_inference/five_times_table_sklearn.pkl')
PredictionResult(example=array([60.], dtype=float32), inference=array([300.], dtype=float32), model_id='gs://apache-beam-samples/run_inference/five_times_table_sklearn.pkl')
PredictionResult(example=array([90.], dtype=float32), inference=array([450.], dtype=float32), model_id='gs://apache-beam-samples/run_inference/five_times_table_sklearn.pkl')
[END sklearn_unkeyed_model_handler] '''.splitlines()[1:-1]
assert_matches_stdout(actual, expected)

Expand All @@ -103,22 +102,14 @@ def test_check_torch_keyed_model_handler(self, mock_stdout):
runinference.torch_keyed_model_handler()
predicted = mock_stdout.getvalue().splitlines()
expected = check_torch_keyed_model_handler()
actual_stdout = [line.split(':')[0] for line in predicted]
replace_fn = lambda x: re.sub(r"<UnbindBackward\d*>", "<UnbindBackward>", x)
actual_stdout = [replace_fn(x) for x in actual_stdout]
expected_stdout = [line.split(':')[0] for line in expected]
self.assertEqual(actual_stdout, expected_stdout)
self.assertEqual(predicted, expected)

@pytest.mark.uses_pytorch
def test_check_torch_unkeyed_model_handler(self, mock_stdout):
runinference.torch_unkeyed_model_handler()
predicted = mock_stdout.getvalue().splitlines()
expected = check_torch_unkeyed_model_handler()
actual_stdout = [line.split(':')[0] for line in predicted]
replace_fn = lambda x: re.sub(r"<UnbindBackward\d*>", "<UnbindBackward>", x)
actual_stdout = [replace_fn(x) for x in actual_stdout]
expected_stdout = [line.split(':')[0] for line in expected]
self.assertEqual(actual_stdout, expected_stdout)
self.assertEqual(predicted, expected)


if __name__ == '__main__':
Expand Down
Loading