Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make OT binding support explicit timestamps. #824

Conversation

AndreiPashkin
Copy link

@AndreiPashkin AndreiPashkin commented May 8, 2020

What does this pull request do?

OpenTracing API allows to explicitly set start and end timestamps of spans but the implementation in this project does not allow that. That patch fixes it.

Closes #1126

@cla-checker-service
Copy link

cla-checker-service bot commented May 8, 2020

💚 CLA has been signed

@AndreiPashkin AndreiPashkin force-pushed the explicit-timestamps-opentracing branch from fc0541b to af058fc Compare May 8, 2020 20:29
@apmmachine
Copy link
Contributor

apmmachine commented May 8, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-10-13T16:27:38.163+0000

  • Duration: 31 min 37 sec

  • Commit: d16b3cc

Test stats 🧪

Test Results
Failed 0
Passed 10007
Skipped 8977
Total 18984

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

Copy link
Contributor

@basepi basepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me but I'm going to wait on a review from @beniwohli because he wrote the opentracing bridge. (Also need to wait on the tests)

@basepi basepi requested a review from beniwohli May 8, 2020 21:49
@basepi
Copy link
Contributor

basepi commented May 8, 2020

jenkins run the tests please

@basepi
Copy link
Contributor

basepi commented May 8, 2020

Also, @AndreiPashkin, please sign the CLA when you get a chance.

@AndreiPashkin
Copy link
Author

AndreiPashkin commented May 9, 2020

Also, @AndreiPashkin, please sign the CLA when you get a chance.

Hmm, but I've signed it, here is the PDF:
https://www.dropbox.com/s/oufsg1mc17hlqvy/Please_DocuSign_Individual_Contributor_Licen%20%281%29.pdf?dl=0

Why doesn't the check turn green?

I'm also puzzled - how can I run tests so that I can use PDB for debugging locally? Could you please advise? I'd like to add one test case.

@basepi
Copy link
Contributor

basepi commented May 11, 2020

Looks like CLA went green, maybe it was delayed.

I'm also puzzled - how can I run tests so that I can use PDB for debugging locally?

I don't tend to use PDB, but running the tests should be as simple as installing the requirements file that is most relevant (at minimum, reqs-base.txt), and then run py.test to run all the tests. I'm guessing PDB can attach to pytest runs without issue -- most of my debugging is either via logging or via the VSCode debugger.

Copy link
Contributor

@beniwohli beniwohli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Ideally, we'd have a test for this, but I'm fine with merging and adding tests after that.

@AndreiPashkin
Copy link
Author

Let me actually write a test.

@AndreiPashkin
Copy link
Author

AndreiPashkin commented May 16, 2020

@basepi, @beniwohli, I've observed some strange when tested my changes.

Here is my test-script:

import logging
import sys
import time


from elasticapm import Client
from elasticapm.contrib.opentracing import Tracer


client = Client({
    'SERVICE_NAME': 'test-with_changes',
    'SERVER_URL': 'http://localhost:8888'
})
tracer = Tracer(client_instance=client)

logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)

START = time.time()

span_1 = tracer.start_span('parent', start_time=START)
span_2 = tracer.start_span('child', child_of=span_1, start_time=START + 2)
span_2.finish(finish_time=START + 10)
span_1.finish(finish_time=START + 15)

When I install the latest release with pip install -I elastic-apm and execute it - I see such a picture:
image
Both parent and child are visible but they have incorrect timestamps recorded (and making it recorded correctly is the purpose of this PR).

When I install my changes (pip install -I https://github.com/AndreiPashkin/apm-agent-python/archive/explicit-timestamps-opentracing.zip) and re-run my script (changing the service name to better observe the differences) - I see the following:
image

  1. Only the parent is seen.
  2. Timestamps have gigantic values.

I've also captured payload that was sent to the Elastic APM instance in both cases.

Without changes:

{"metadata": {"service": {"name": "test-without_changes", "environment": null, "version": null, "agent": {"name": "python", "version": "5.6.0"}, "language": {"name": "python", "version": "3.7.7"}, "runtime": {"name": "CPython", "version": "3.7.7"}}, "process": {"pid": 30721, "ppid": 2071, "argv": ["test_elastic.py"], "title": null}, "system": {"hostname": "andrei-yoga514", "architecture": "x86_64", "platform": "linux"}}}
{"span": {"id": "da6e65e8787355b7", "transaction_id": "f5c2f6d6740a0cd4", "trace_id": "4366c9a4a3cbc6c54abbf84e10d4d8a5", "parent_id": "f5c2f6d6740a0cd4", "name": "child", "type": "code", "subtype": "custom", "action": null, "sync": true, "timestamp": 1589642748818084, "duration": 0.07035498856566846}}
{"transaction": {"id": "f5c2f6d6740a0cd4", "trace_id": "4366c9a4a3cbc6c54abbf84e10d4d8a5", "name": "parent", "type": "custom", "duration": 0.24842699349392205, "result": "", "timestamp": 1589642748818004, "sampled": true, "span_count": {"started": 1, "dropped": 0}, "context": {"tags": {}}}}

With changes:

{"metadata": {"service": {"name": "test-with_changes", "environment": null, "version": null, "agent": {"name": "python", "version": "5.6.0"}, "language": {"name": "python", "version": "3.7.7"}, "runtime": {"name": "CPython", "version": "3.7.7"}}, "process": {"pid": 31801, "ppid": 2071, "argv": ["test_elastic.py"], "title": null}, "system": {"hostname": "andrei-yoga514", "architecture": "x86_64", "platform": "linux"}}}
{"span": {"id": "a1a2d1d77c17595e", "transaction_id": "0af7966685c1e27a", "trace_id": "7adccfb5c60bad87b80e290334e6ee53", "parent_id": "0af7966685c1e27a", "name": "child", "type": "code", "subtype": "custom", "action": null, "sync": true, "timestamp": 3179173112604596, "duration": 8000.0, "stacktrace": [{"abs_path": "test_elastic.py", "filename": "test_elastic.py", "module": "__main__", "function": "<module>", "lineno": 21, "library_frame": false}]}}
{"transaction": {"id": "0af7966685c1e27a", "trace_id": "7adccfb5c60bad87b80e290334e6ee53", "name": "parent", "type": "custom", "duration": 1589530226905.7947, "result": "", "timestamp": 1589642898698801, "sampled": true, "span_count": {"started": 1, "dropped": 0}, "context": {"tags": {}}}}

Does anyone have any ideas about what could be wrong?

@AndreiPashkin
Copy link
Author

AndreiPashkin commented May 16, 2020

@beniwohli, can you also please explain this code:
http://github.com/elastic/apm-agent-python/blob/master/elasticapm/traces.py#L140-L140

Why is start provided both self.timestamp and self.start_time are assigned to it but if start is not provided self.timestamp is assigned epoch timestamp and self.start_time is assigned result of time.perf_time() which is usually equals to time.monotonic()?

@AndreiPashkin AndreiPashkin requested a review from beniwohli May 17, 2020 01:14
@AndreiPashkin AndreiPashkin force-pushed the explicit-timestamps-opentracing branch from af058fc to 36cece9 Compare May 21, 2020 20:51
limit=limit
)
return (
clocks_measurements.destination +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line break after binary operator

if self.__measurements_ttl is None or self.__measurements_ttl == 0:
return
if ((self.__clocks_measurement_cache[0] + self.__measurements_ttl) >
time.time()):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visually indented line with same indent as next logical line

return
if self.__measurements_ttl is None or self.__measurements_ttl == 0:
return
if ((self.__clocks_measurement_cache[0] + self.__measurements_ttl) >

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line break after binary operator

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@basepi, I'm not sure, how should I re-format this line? Without breaks it is going to become too long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised that black isn't reformatting this for you...

Our enforced line length limit is actually 120 columns, so I think you can just do this on one line. I try to keep my code down to around 90 columns, but if it hurts readability I think it's not worth chasing.

src_now = dst_now = None
epsilon = sys.float_info.max
i = 0
for i in range(limit):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loop control variable 'i' not used within the loop body. If this is intended, start the name with an underscore.

@AndreiPashkin AndreiPashkin requested a review from basepi May 21, 2020 21:14
@AndreiPashkin AndreiPashkin force-pushed the explicit-timestamps-opentracing branch 3 times, most recently from 5e29e6f to c3034ef Compare May 24, 2020 19:29
@@ -124,10 +125,15 @@ def set_tag(self, key, value):
return self

def finish(self, finish_time=None):
if (finish_time is not None and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line break after binary operator

Copy link
Contributor

@beniwohli beniwohli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreiPashkin thanks a lot for your contribution!

Just as a quick rationale why we didn't ensure start / end timestamps to be monotonic when explicitly provided: these arguments have been introduced mainly for testing, so that we can override durations and test for exact values. They've not been intended to use publicly, exactly because mixing monotonic and non-monotonic time sources is problematic.

While I think your TimestampConversionUtil is a smart way to kinda reconcile the two time sources, it does add a lot of complexity, and it's not bulletproof as far as I can tell (e.g. the system clock could jump right after you measured the delta, but before applying it to the non-monotonic timestamp).

As such, I'm a bit wary to introduce this into a major code path of the agent. However, I'd be open to it if you moved the usage of TimestampConversionUtil to the opentracing bridge.


SYSTEM_TO_MONOTONIC = TimestampConversionUtil(
time.time,
time.monotonic,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.monotonic is not available on Python 2.7. To ensure Python 2.7 compatibility, timeit.default_timer is the best option. On Python 3, it defaults to time.perf_counter, on Python 2 to the best platform dependent time source (non-monotonic)

@AndreiPashkin
Copy link
Author

AndreiPashkin commented Jun 6, 2020

@beniwohli

it's not bulletproof as far as I can tell (e.g. the system clock could jump right after you measured the delta, but before applying it to the non-monotonic timestamp).

But if system clock is jumping then the agent is going to send skewed data in any case - so there is no bullet-proofness against that in current master already and my PR does not change that.
TimestampConversionUtil simply converts timestamps between clocks without adding or removing anything in regards to protection against clock jumps.

As such, I'm a bit wary to introduce this into a major code path of the agent. However, I'd be open to it if you moved the usage of TimestampConversionUtil to the opentracing bridge.

Should I add another private argument for system-clock timestamp and pass it from the OT bridge along with the monotonic timestamp?

@AndreiPashkin
Copy link
Author

@beniwohli
Could you please reply on my prev. comment?

@AndreiPashkin AndreiPashkin force-pushed the explicit-timestamps-opentracing branch from c3034ef to 4df630f Compare August 31, 2020 06:45
@apmmachine
Copy link
Contributor

apmmachine commented Oct 27, 2020

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 9189
Skipped 6837
Total 16026

@AndreiPashkin AndreiPashkin force-pushed the explicit-timestamps-opentracing branch 2 times, most recently from e8170e1 to 3e18dd7 Compare March 6, 2021 08:12
:return: the started transaction object
"""
if self.config.is_recording:
return self.tracer.begin_transaction(transaction_type, trace_parent=trace_parent, start=start)
return self.tracer.begin_transaction(transaction_type, trace_parent=trace_parent, start=start, start_timestamp=start_timestamp)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long (139 > 120 characters)

@@ -102,7 +106,7 @@ def start_span(
if parent_context and parent_context.span and not parent_context.span.is_transaction
else None
)
span = transaction._begin_span(operation_name, None, parent_span_id=parent_span_id)
span = transaction._begin_span(operation_name, None, parent_span_id=parent_span_id, start=start_time_monotonic, start_timestamp=start_time)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long (151 > 120 characters)

trace_parent = parent_context.trace_parent if parent_context else None
transaction = self._agent.begin_transaction("custom", trace_parent=trace_parent)
transaction = self._agent.begin_transaction("custom", trace_parent=trace_parent, start=start_time_monotonic, start_timestamp=start_time)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long (148 > 120 characters)

@@ -123,12 +123,15 @@ def set_failure(self):

class Transaction(BaseSpan):
def __init__(
self, tracer, transaction_type="custom", trace_parent=None, is_sampled=True, start=None, sample_rate=None
self, tracer, transaction_type="custom", trace_parent=None, is_sampled=True, start=None, sample_rate=None, start_timestamp=None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long (135 > 120 characters)

@AndreiPashkin AndreiPashkin force-pushed the explicit-timestamps-opentracing branch 3 times, most recently from 0f07ae6 to 5283ce1 Compare March 7, 2021 08:20
A simple class that makes possible to take a timestamp from one type
of clock and convert it to as if it was taken from another type of
clock.
For example it makes possible to convert timestamps taken from
the monotonic clock to representation in terms of the system clock.
OpenTracing API allows to explicitly set start and end timestamps of
spans but the implementation in this project does not allow that. That
patch fixes it.
@AndreiPashkin AndreiPashkin force-pushed the explicit-timestamps-opentracing branch from 5283ce1 to c6db1a7 Compare March 7, 2021 11:47
@AndreiPashkin
Copy link
Author

@basepi, @beniwohli, could you please look into it?

@basepi
Copy link
Contributor

basepi commented Jun 10, 2021

@beniwohli As far as I can tell, the new TimestampConversionUtil is only used within the OT bridge, and even then only if a start time is defined, so I don't think this is actually going to introduce that additional complexity across core transaction/span duration measurements.

For my sanity, @AndreiPashkin, can you give me the tl;dr of why that class is necessary? Are we just trying to avoid the unlikely, but possible chance that we could measure the difference between the clocks at the point of a daylight saving change, and thus convert with an hour mistake in the duration? Are there other examples of jumps in non-monotonic clocks that we're trying to detect/prevent?

@elastic elastic deleted a comment from houndci-bot Jul 19, 2021
@basepi
Copy link
Contributor

basepi commented Oct 29, 2021

Hey @AndreiPashkin -- I'd still like to get this in. I ended up writing a similar (but a bit simpler) timestamp conversion util for some AWS lambda work I was doing. It was just merged in #1366. Can you merge from upstream/master and switch to that new util? We can then get this merged.

Thanks!

@basepi
Copy link
Contributor

basepi commented Aug 25, 2022

Closing this due to lack of reply.

@basepi basepi closed this Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenTracing binding ignores start_time argument
5 participants