Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support version skew between Antrea Agent and Flow Aggregator #6777

Closed
antoninbas opened this issue Oct 29, 2024 · 2 comments · Fixed by #6912
Closed

Support version skew between Antrea Agent and Flow Aggregator #6777

antoninbas opened this issue Oct 29, 2024 · 2 comments · Fixed by #6912
Assignees
Labels
area/flow-visibility/aggregator Issues or PRs related to Flow Aggregator area/flow-visibility Issues or PRs related to flow visibility support in Antrea kind/feature Categorizes issue or PR as related to a new feature.

Comments

@antoninbas
Copy link
Contributor

antoninbas commented Oct 29, 2024

Describe the problem/challenge you have
At the moment, we do not support version skew between Antrea Agent and Flow Aggregator. Let's take a concrete example:
In Antrea v2.0, we introduced a new Information Element (IE), egressNodeName (#6012). It so happened that this was introduced in a major version release, but we also routinely introduce new IEs in minor version releases. If one tries to update the Antrea Agent (from v1.15 to v2.0) before the Flow Aggregator, the existing Flow Aggregator (v1.15) will reject the new IPFIX templates sent by the new Antrea Agent (v2.0) - see https://github.com/vmware/go-ipfix/blob/main/pkg/collector/process.go. If one tries to update the Flow Aggregator first it will also not work, as the aggregation process reuses the IPFIX data "record" received from the Agent, which will not match the template sent by the Flow Aggregator to the external IPFIX collector.

For large clusters, a rolling update of the antrea-agent DaemonSet can take a while, so a version mismatch between some Agents and the Flow Aggregator is expected and that situation will remain until the update completes.

Describe the solution you'd like
We should tolerate some version skew between the Antrea Agent and the Flow Aggregator (N-2/N+2), for graceful updates and to ensure that connection data can still be exported during the update window.

For example:

  1. The Flow Aggregator could gracefully discards unknown IEs in the records received from the Agents, in order to support "newer" Agents
  2. The Flow Aggregator could add missing IEs using a default value in the records received from the Agents, in order to support "older" Agents

Because item 2) is more problematic than 1) (what's an appropriate default value?), we could specify than in order to achieve graceful update, the Flow Aggregator should be updated last. In that case, we would have version(FlowAggregator) <= version(Agent), and 1) would be sufficient. Once the FlowAggregator is itself updated, it will be able to "forward" the newly introduced IEs.

@antoninbas antoninbas added kind/feature Categorizes issue or PR as related to a new feature. area/flow-visibility Issues or PRs related to flow visibility support in Antrea area/flow-visibility/aggregator Issues or PRs related to Flow Aggregator labels Oct 29, 2024
@antoninbas
Copy link
Contributor Author

@tnqn @heanlan @yuntanghsu for visibility

@antoninbas
Copy link
Contributor Author

Example reproduction (updating Agent first):

helm install -n kube-system antrea antrea/antrea --version=v1.15.2 --set featureGates.FlowExporter=true --set flowExporter.enable=true
helm install -n flow-aggregator --create-namespace flow-aggregator antrea/flow-aggregator --version=v1.15.2 --set flowLogger.enable=true
kubectl apply -f https://github.com/antrea-io/antrea/releases/download/v2.1.0/antrea-crds.yml
helm upgrade -n kube-system antrea antrea/antrea --version=v2.1.0

The FlowAggregator will panic once:

E1031 22:21:12.586562       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: Information element with elementID 157 in registry with enterpriseID 56506 cannot be found."
panic: runtime error: index out of range [3] with length 1

goroutine 281 [running]:
encoding/binary.bigEndian.Uint32(...)
	/usr/local/go/src/encoding/binary/binary.go:157
github.com/vmware/go-ipfix/pkg/entities.DecodeAndCreateInfoElementWithValue(0xc00040f960, {0xc000402594, 0x1, 0x50?})
	/go/pkg/mod/github.com/vmware/[email protected]/pkg/entities/ie.go:307 +0xa76
github.com/vmware/go-ipfix/pkg/collector.(*CollectingProcess).decodeDataSet(0xc0005700f0, 0xc000b3b920, 0x2ba9db8?, 0x0?)
	/go/pkg/mod/github.com/vmware/[email protected]/pkg/collector/process.go:311 +0x2ad
github.com/vmware/go-ipfix/pkg/collector.(*CollectingProcess).decodePacket(0xc0005700f0, 0xc000584180?, {0xc00012c0b0, 0xe})
	/go/pkg/mod/github.com/vmware/[email protected]/pkg/collector/process.go:208 +0x405
github.com/vmware/go-ipfix/pkg/collector.(*CollectingProcess).handleTCPClient.func1()
	/go/pkg/mod/github.com/vmware/[email protected]/pkg/collector/tcp.go:90 +0x2c5
created by github.com/vmware/go-ipfix/pkg/collector.(*CollectingProcess).handleTCPClient in goroutine 280
	/go/pkg/mod/github.com/vmware/[email protected]/pkg/collector/tcp.go:70 +0x1bf

The panic is when decoding a data set: the same template ID is used by the (new) Agent, the new template is rejected because of an unknown registry element (egressNodeName), a data set is received when is decoding according to the old template, leading to a panic.

After the FlowAggregator restarts, the template is never accepted and all messages end up being dropped.

E1031 22:21:28.008510       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: Information element with elementID 157 in registry with enterpriseID 56506 cannot be found."
E1031 22:21:28.009507       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:21:28.034283       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:21:28.034458       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:21:33.005261       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:21:35.143176       1 tcp.go:79] "Error when retrieving message length" err="remote error: tls: bad certificate"
E1031 22:21:40.336647       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: Information element with elementID 157 in registry with enterpriseID 56506 cannot be found."
E1031 22:21:40.336880       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:21:40.337823       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:21:40.339180       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:21:42.599769       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:21:45.338420       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:21:45.339168       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:21:45.339384       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:21:47.594854       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:21:52.598860       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:21:55.707240       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:00.709295       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:00.709554       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:00.711745       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:02.597536       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:22:02.597697       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:22:05.736245       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:05.739705       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:07.603212       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:22:07.604518       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:22:12.604668       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:22:15.706549       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:20.708864       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:22.602384       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:22:25.713077       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:27.600238       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:22:27.708349       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 801257890 does not exist"
E1031 22:22:30.704750       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:30.817125       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: template 256 with obsDomainID 3502291508 does not exist"
E1031 22:22:32.518646       1 tcp.go:92] "Error when decoding packet" err="error in decoding message: Information element with elementID 157 in registry with enterpriseID 56506 cannot be found."

@antoninbas antoninbas self-assigned this Oct 31, 2024
@antoninbas antoninbas added this to the Antrea v2.3 release milestone Nov 5, 2024
antoninbas added a commit to antoninbas/antrea that referenced this issue Jan 8, 2025
When a new IPFIX Information Element (IE) is introduced, a version
mismatch between the Agent and the Flow Aggregator can be
problematic. A "new" Agent can send an IE which is unknown to the "old"
Flow Aggregator, or the "new" Flow Aggregator may expect an IE which is
not sent by an "old" Agent.

Prior to this change, we required the list of IEs sent by the Agent to
be the same as the list of IEs expected by the Flow Aggregator. This is
impossible to ensure during upgrade, as it may take a long time for all
Agents in the cluster to be upgraded.

After this change, Agents and Flow Aggregator can be upgraded in any
order (although we would recommend the Flow Aggregator to be upgraded
last). To achieve this, we introduce a new "process" between IPFIX
collection and aggregation in the Flow Aggregator: the
"preprocessor". The preprocessor is in charge of processing messages
received from the IPFIX collector, prior to handling records over to the
aggregation process. At the moment, its only task is to ensure that all
records have the expected fields. If a record has extra fields, they
will be discarded. If some fields are missing, they will be "appended"
to the record with a "zero" value. For example, we will use 0 for
integral types, "" for strings, 0.0.0.0 for IPv4 address, etc. Note that
we are able to keep the implementation simple by assuming that a record
either has missing fields or extra fields (not a combination of both),
and that such fields are always at the tail of the field list. This
assumption is based on implementation knowledge of the FlowExporter and
the FlowAggregator. When we introduce a new IE, it always comes after
all existing IEs, and we never deprecate / remove an existing IE across
versions.

Note that when the preprocessor adds a missing field, it is no longer
possible to determine whether the field was originally missing, or was
sent by the Agent with a zero value. This is why we recommend upgrading
the Flow Aggregator last (to avoid this situation altogether). However,
we do not believe that it is a significant drawback based on current
usage.

Fixes antrea-io#6777

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Jan 9, 2025
When a new IPFIX Information Element (IE) is introduced, a version
mismatch between the Agent and the Flow Aggregator can be
problematic. A "new" Agent can send an IE which is unknown to the "old"
Flow Aggregator, or the "new" Flow Aggregator may expect an IE which is
not sent by an "old" Agent.

Prior to this change, we required the list of IEs sent by the Agent to
be the same as the list of IEs expected by the Flow Aggregator. This is
impossible to ensure during upgrade, as it may take a long time for all
Agents in the cluster to be upgraded.

After this change, Agents and Flow Aggregator can be upgraded in any
order (although we would recommend the Flow Aggregator to be upgraded
last). To achieve this, we introduce a new "process" between IPFIX
collection and aggregation in the Flow Aggregator: the
"preprocessor". The preprocessor is in charge of processing messages
received from the IPFIX collector, prior to handling records over to the
aggregation process. At the moment, its only task is to ensure that all
records have the expected fields. If a record has extra fields, they
will be discarded. If some fields are missing, they will be "appended"
to the record with a "zero" value. For example, we will use 0 for
integral types, "" for strings, 0.0.0.0 for IPv4 address, etc. Note that
we are able to keep the implementation simple by assuming that a record
either has missing fields or extra fields (not a combination of both),
and that such fields are always at the tail of the field list. This
assumption is based on implementation knowledge of the FlowExporter and
the FlowAggregator. When we introduce a new IE, it always comes after
all existing IEs, and we never deprecate / remove an existing IE across
versions.

Note that when the preprocessor adds a missing field, it is no longer
possible to determine whether the field was originally missing, or was
sent by the Agent with a zero value. This is why we recommend upgrading
the Flow Aggregator last (to avoid this situation altogether). However,
we do not believe that it is a significant drawback based on current
usage.

Fixes antrea-io#6777

Signed-off-by: Antonin Bas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/flow-visibility/aggregator Issues or PRs related to Flow Aggregator area/flow-visibility Issues or PRs related to flow visibility support in Antrea kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant