Time gap detector for AIS position messages.
Features:
- ✅ Gaps detection core process.
- ✅ Gaps detection pipeline.
- ✅ command-line interface.
- ✅ JSON inputs/outputs.
- ✅ BigQuery inputs/outputs.
- ✅ Apache Beam integration.
- ✅ Incremental (daily) processing.
- ✅ Full backfill processing.
Table of contents:
- Introduction
- Definition of gap
- Some results
⚠️ Important note on grouping messages byssvid
⚠️ Important note on not filteringoverlapping_and_short
- Usage
- Implementation details
- References
Not all AIS messages that are broadcast by vessels are recorded by receivers for technical reasons, such as signal interference in crowded waters, spatial variability of terrestrial reception, spatial and temporal variability of satellite reception, and dropped signals as vessels move from terrestrial coverage to areas of poor satellite reception. So, as a result, it’s not uncommon to see gaps in AIS data, for hours or perhaps even days [1].
Other causes of AIS gaps are:
- The AIS transponder is turned off or otherwise disabled while at sea.
- The AIS transponder has a malfunction.
- The ships systems are powered off while the vessel is at anchor or in port.
AIS gaps detection is essential to identify possible intentional disabling events, which can obscure illegal activities, such as unauthorized fishing activity or unauthorized transshipments [1][2].
We create an AIS gap event when the period of time between
consecutive AIS positions from a single vessel exceeds a configured threshold in hours.
The start/end
position messages of the gap are called OFF/ON
messages,
respectively.
When the period of time between last known position
and the last time of the current day exceeds the threshold,
we create an open gap event.
In that case, the gap will not have an ON
message,
until it is closed in the future when new data arrives.
Input position messages are filtered by good_seg2
field
of the segments table in order to remove noise.
This denoising process happens in the segment pipeline.
These are some results for 2021-2023.
The gap detection pipeline fetches AIS position messages,
filtering them by good_seg2
, and groups them by ssvid
.
Since we know different vessels can broadcast with the same ssvid
,
this can potentially lead to the situation in which we have a gap
with an OFF
message from one vessel
and a ON
message from another vessel (or viceversa).
This can happen because currently,
the gap detection process just orders
all the messages from the same ssvid
by timestamp
and then evaluates each pair of consecuive messages.
In consequence,
it will just pick the last OFF
(or the first ON
)
message in the chain when constructing a gap,
and we could have have "inconsistent" gaps
in the sense we described above.
We believe those will be a very small
amount of the total gaps,
and we aim in the future to find a solution to this problem.
One option could be to use e.g. vessel_id
which has a much higher chance of separating messages from different vessels
that broadcast under the same ssvid
.
For analyses,
this requires taking care when using gaps for ssvids
that have multiple vessel_ids
.
For usage in products,
a choice has to be made of which of the (potentially) two vessel_ids
to use
or whether to attribute a gap to both vessel_ids
(in which case there would have to be a distinction between OFF/ON
message),
and also on how to highlight that the gap might refer to multiple vessels.
Gaps are generated including AIS position messages
that are inside overlapping_and_short
segments.
This was a deliberate choice because removing
overlapping_and_short
segments can create gaps
where there actually shouldn’t be any.
An analysis of gaps generated with messages from [2021, 2022, 2023]
with and without the overlapping_and_short
filter
showed that the results are very similar on an aggregate.
However, within products and many of our analyses we remove:
- segments that are
overlapping_and_short
, and vessel_ids
if all their segments areoverlapping_and_short
(by usingproduct_vessel_info_summary
).
This means that showing gaps on a map can lead to inconsistent results since the track or even the vessel might not exist where a gap starts and/or ends.
We still don't have a package in PYPI.
First, clone the repository.
Create virtual environment and activate it:
python -m venv .venv
. ./.venv/bin/activate
Install dependencies
make install
Make sure you can run unit tests
make test
Make sure you can build the docker image:
make build
In order to be able to connect to BigQuery, authenticate and configure the project:
make gcp
The gap detection core process takes as input a list of AIS messages. Mandatory fields are:
ssvid
timestamp
lat
lon
receiver_type
Any other fields are not mandatory but can be passed to be included in the gap event.
Note
Currently, the algorithm takes about (3.62 ± 0.03)
seconds to process 10M messages.
Tested on a i7-1355U 5.0GHz processor. You can check the profiling results.
import json
from datetime import timedelta, datetime
from pipe_gaps.core import GapDetector
messages = [
{
"ssvid": "226013750",
"msgid": "295fa26f-cee9-1d86-8d28-d5ed96c32835",
"timestamp": datetime(2024, 1, 1, 0).timestamp(),
"receiver_type": "terrestrial",
"lat": 30.5,
"lon": 50.6,
},
{
"ssvid": "226013750",
"msgid": "295fa26f-cee9-1d86-8d28-d5ed96c32835",
"timestamp": datetime(2024, 1, 1, 1).timestamp(),
"receiver_type": "terrestrial",
"lat": 30.5,
"lon": 50.6,
}
]
gd = GapDetector(threshold=timedelta(hours=0, minutes=50))
gaps = gd.detect(messages)
print(json.dumps(gaps, indent=4))
The gaps detection pipeline is described in the following diagram:
flowchart LR;
subgraph tables [**BigQuery Tables**]
A[(messages)]
B[(segments)]
C[(gaps)]
end
subgraph inputs [**Inputs**]
D[/**AIS** Messages/]
E[/Open gaps/]
end
F[Detect Gaps]
subgraph outputs [**Outputs**]
direction TB
G[/gaps/]
end
A ==> D
B ==> D
C ==> E
D ==> F
E ==> F
F ==> outputs
outputs ==> C
The schema for the output gap events table is defined in pipe_gaps/pipeline/schemas/ais-gaps.json.
When an open gap is closed, a new gap event is created.
This means that we are using a persistence pattern that matches
the slowly changing dimension type 2 (always add new rows).
In consequence, the output table can contain two gap events with the same gap_id
:
the old open gap and the current closed active gap.
The versioning of gaps is done with a timestamp field version
with second precision.
To query all active gaps,
you will just need to query the last versions for every gap_id
.
For example,
SELECT *
FROM (
SELECT
*,
MAX(version)
OVER (PARTITION BY gap_id)
AS last_version,
FROM `world-fishing-827.scratch_tomas_ttl30d.pipe_ais_gaps_filter_no_overlapping_and_short`
)
WHERE version = last_version
Another alternative:
SELECT *
FROM `world-fishing-827.scratch_tomas_ttl30d.pipe_ais_gaps_filter_no_overlapping_and_short`
QUALIFY ROW_NUMBER() OVER (PARTITION BY gap_id ORDER BY version DESC) = 1
The easiest way of running the gaps detection pipeline is to use the provided command-line interface:
(.venv) $ pipe-gaps
usage: pipe-gaps [-h] [-c ] [-v] [--no-rich-logging] [--only-render] [--pipe-type ] [-i ] [-s ] [--bq-input-messages ] [--bq-input-segments ] [--bq-input-open-gaps ] [--bq-output-gaps ]
[--open-gaps-start-date ] [--filter-not-overlapping-and-short ] [--filter-good-seg ] [--skip-open-gaps] [--mock-db-client | --no-mock-db-client]
[--save-json | --no-save-json] [--work-dir ] [--ssvids ] [--date-range ] [--min-gap-length ] [--window-period-d ] [--eval-last | --no-eval-last] [--n-hours-before ]
Detects time gaps in AIS position messages.
The definition of a gap is configurable by a time threshold 'min-gap-length'.
For more information, check the documentation at
https://github.com/GlobalFishingWatch/pipe-gaps/tree/develop.
You can provide a configuration file or command-line arguments.
The latter take precedence, so if you provide both, command-line arguments
will overwrite options in the config file provided.
Besides the arguments defined here, you can also pass any pipeline option
defined for Apache Beam PipelineOptions class. For more information, see
https://cloud.google.com/dataflow/docs/reference/pipeline-options#python.
options:
-h, --help show this help message and exit
-c , --config-file JSON file with pipeline configuration (default: None).
-v, --verbose Set logger level to DEBUG.
--no-rich-logging Disable rich logging (useful prof production environments).
--only-render Only render command-line call equivalent to provided config file.
general pipeline configuration:
--pipe-type Pipeline type: ['naive', 'beam'].
-i , --json-input-messages JSON file with input messages [Useful for development].
-s , --json-input-open-gaps JSON file with open gaps [Useful for development].
--bq-input-messages BigQuery table with with input messages.
--bq-input-segments BigQuery table with with input segments.
--bq-input-open-gaps BigQuery table with open gaps.
--bq-output-gaps BigQuery table in which to store the gap events.
--open-gaps-start-date Fetch open gaps starting from this date range e.g., '2012-01-01'.
--filter-not-overlapping-and-short Fetch messages that do not belong to 'overlapping_and_short' segments.
--filter-good-seg Fetch messages that belong to 'good_seg2' segments.
--skip-open-gaps If passed, pipeline will not fetch open gaps [Useful for development].
--mock-db-client, --no-mock-db-client If passed, mocks the DB client [Useful for development].
--save-json, --no-save-json If passed, saves the results in JSON file [Useful for development].
--work-dir Directory to use as working directory.
--ssvids Detect gaps for this list of ssvids, e.g., «412331104,477334300».
--date-range Detect gaps within this date range, e.g., «2024-01-01,2024-01-02».
gap detection process:
--min-gap-length Minimum time difference (hours) to start considering gaps.
--window-period-d Period (in days) of time windows used to parallelize the process.
--eval-last, --no-eval-last If passed, evaluates last message of each SSVID to create an open gap.
--n-hours-before Count messages this amount of hours before each gap.
Example:
pipe-gaps -c config/sample-from-file.json --min-gap-length 1.3
Caution
Note that the separator for words in parameters is hyphen -
and not underscore _
.
Caution
Date ranges are inclusive for the start date and exclusive for the end date.
Note
The pipe-type can be "naive" (without parallelization, was useful for development) or "beam" (allows parallelization through Apache Beam & Google Dataflow).
Note
Any option passed to the CLI not explicitly declared will not be parsed by the main CLI, and will be passed to the pipeline that will parse it internally. For example, you can pass any option available in the Apache Beam Pipeline Options. For those parameters, the convention is to use underscores instead of hyphens as words separator.
This is an example of a JSON config file:
{
"bq_input_messages": "pipe_ais_v3_published.messages",
"bq_input_segments": "pipe_ais_v3_published.segs_activity",
"bq_input_open_gaps": "pipe_ais_v3_published.product_events_ais_gaps",
"bq_output_gaps": "scratch_tomas_ttl30d.pipe_ais_gaps",
"open_gaps_start_date": "2023-12-31",
"ssvids": [
"412331104",
"477334300"
],
"date_range": [
"2024-01-01",
"2024-01-02"
],
"min_gap_length": 6,
"n_hours_before": 12,
"window_period_d": 1,
"pipeline_options": {
"project": "world-fishing-827",
"runner": "direct"
}
}
Caution
Inside pipeline_options
you can put, for example, any Apache Beam Pipeline Options.
In that case, be careful with flags. You should put the option name and not the flag name,
e.g., put "use_public_ips": false
instead of "no_use_public_ips": true
,
otherwise the parameter will be ignored. This is different from using it from the CLI,
where you should pass --no_use_public_ips
argument.
You can see more configuration examples here.
The pipeline is implemented over a (mostly) generic structure that supports:
- Grouping all main inputs by SSVID and optionally overlapping time intervals (TI) with arbitrary period. For example, you can group by time windows of size 31 days with period of 30 days, meaning that each time window will have an overlap of 1 day.
- Grouping side inputs by SSVID.
- Construct boundaries (first and last N AIS messages of each group) and group them by SSVID.
- Processing main inputs groups from 1.
- Processing boundaries from 4 together with side inputs from 2, both grouped by SSVID.
Below there is a diagram that describes this work flow.
Note
In the case of the Apache Beam integration with Dataflow runner, the groups can be processed in parallel accross different workers.
Module | Description |
---|---|
ais-gaps.py | Encapsulates AIS gaps query. |
ais-messages.py | Encapsulates AIS position messages query. |
core.py | Defines core pTransform that integrates detect_gaps.py to Apache Beam. |
detect_gaps.py | Defines DetectGaps class (core processing step of the pipeline). |
gap_detector.py | Defines lower level GapDetector class that computes gaps in a list of AIS messages. |
flowchart TD;
A[Read Main Inputs] ==> |**AIS Messages**| B[Group Inputs]
C[Read Side Inputs] ==> |**Open Gaps**| D[Group Side Inputs]
subgraph **Detect Gaps**
B ==> |**AIS Messages <br/> by SSVID & TI**| E[Process Groups]
B ==> |**AIS Messages <br/> by SSVID & TI**| F[Process Boundaries]
D ==> |**Open Gaps <br/> by SSVID**| F
E ==> |**Gaps inside groups**| H[Join Outputs]
F ==> |**Gaps between groups <br/> New open gaps <br/> Closed open gaps**| H
end
H ==> |**New closed gaps <br/> New open gaps <br/> Closed open gaps**| K[Write Outputs]
[1] Welch H., Clavelle T., White T. D., Cimino M. A., Van Osdel J., Hochberg T., et al. (2022). Hot spots of unseen fishing vessels. Sci. Adv. 8 (44), eabq2109. doi: 10.1126/sciadv.abq2109
[2] J. H. Ford, B. Bergseth, C. Wilcox, Chasing the fish oil—Do bunker vessels hold the key to fisheries crime networks? Front. Mar. Sci. 5, 267 (2018).