-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
af60d20
commit b9ef5a3
Showing
88 changed files
with
1,578 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
old-dc.yml | ||
.DS_Store |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
## Streamming data | ||
+ Streaming data is a data that is continuously generated by thousands of data sources, which typically send the data records in small sizes. | ||
+ In this project, we use Kafka as a streaming data source and Pyflink to handle this streaming data | ||
+ The data is generated from the nyc taxi data in the datalake( you can get data from ```data/stream/stream.parquet```) | ||
### Streaming data source | ||
+ Nyc taxi streaming data is generated based on data from datalake | ||
+ Each newly created data sample is stored in a table in PostgreSQL | ||
+ Debezium then acts as a connector with PostgreSQL and will scan the table to check if the database has newly updated data. | ||
+ Newly created data will be pushed to corresponding topics in kafka | ||
+ Any consumer can receive messages from the topic to which the consumer subscribes | ||
#### How to guide | ||
First, we change directory to `stream_processing/kafka`` | ||
+ ```bash run.sh register_connector configs/postgresql-cdc.json```to send PostgreSQL config to Debezium | ||
 | ||
+ ```python create_table.py``` to create a new table on PostgreSQL | ||
+ ```python insert_table.py``` to insert data to the table | ||
+ We can access Kafka at port 9021 to check the results | ||
 | ||
+ Then click **Topics** bar to get all existing topics on Kafka | ||
 | ||
+ **nyc_taxi.public.nyc_taxi** is my created topic | ||
+ Choose **Messages** to observe streaming messages | ||
 | ||
+ Finally, you can create kafka service for streaming data | ||
``` | ||
cd stream_processing/kafka | ||
docker build -t nyc_producer:latest . | ||
docker image tag nyc_producer:latest ${name}/nyc_producer:latest | ||
docker push ${name}/nyc_producer:latest #name is your docker hub name | ||
``` | ||
### Streaming processing | ||
+ To handle this streaming datasource, Pyflink or Kafka can be used, but in this project, we use Pyflink to process the data | ||
#### How to guide | ||
+ ```cd stream_processing/scripts``` | ||
+ ```python datastream_api.py && python window_datastream_api.py``` | ||
+ These scripts will extract the necessary information fields in the message and aggregate the data to serve many purposes | ||
+ Processed data samples will be stored in kafka in the specified sink | ||
 | ||
+ **nyc_taxi.sink.datastream** and **nyc_taxi.sink_window.datastream** is the defined sink and window sink in my case | ||
+ ```python kafka_consumer.py``` | ||
+ Messages from topic will be stored and used for further processing(analyse, visualize,cost prediction,...) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
FROM python:3.8-slim | ||
|
||
# Copy app handler code | ||
|
||
COPY kafka_producer/produce.py produce.py | ||
COPY kafka_producer/generate_schemas.py generate_schemas.py | ||
COPY kafka_producer/streamming_data.parquet streamming_data.parquet | ||
COPY run.sh . | ||
|
||
# Install dependencies | ||
RUN pip3 install kafka-python==2.0.2 | ||
RUN pip3 install avro==1.11.1 | ||
RUN pip3 install pandas==1.5.1 | ||
RUN pip3 install pyarrow==10.0.1 | ||
RUN pip3 install python-schema-registry-client==2.4.1 | ||
RUN pip3 install pymongo==4.5.0 | ||
RUN pip3 install pandas==1.5.3 | ||
|
||
# Uncomment this to generate a random schema | ||
RUN chmod +x /run.sh && ./run.sh generate_schemas | ||
|
||
CMD [ "python", "-u", "produce.py", "--mode", "setup", "--bootstrap_servers", "broker:29092"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
{ | ||
"doc": "Sample schema to help you get started.", | ||
"fields": [ | ||
{ | ||
"name": "nyc_taxi_id", | ||
"type": "int" | ||
}, | ||
{ | ||
"name": "created", | ||
"type": "string" | ||
}, | ||
{ | ||
"name": "vendorid", | ||
"type": "int" | ||
}, | ||
{ | ||
"name": "tpep_pickup_datetime", | ||
"type": "string" | ||
}, | ||
{ | ||
"name": "tpep_dropoff_datetime", | ||
"type": "string" | ||
}, | ||
{ | ||
"name": "passenger_count", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "trip_distance", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "ratecodeid", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "store_and_fwd_flag", | ||
"type": "string" | ||
}, | ||
{ | ||
"name": "pulocationid", | ||
"type": "int" | ||
}, | ||
{ | ||
"name": "dolocationid", | ||
"type": "int" | ||
}, | ||
{ | ||
"name": "payment_type", | ||
"type": "int" | ||
}, | ||
{ | ||
"name": "fare_amount", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "extra", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "mta_tax", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "tip_amount", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "tolls_amount", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "improvement_surcharge", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "total_amount", | ||
"type": "float" | ||
}, | ||
{ | ||
"name": "congestion_surcharge", | ||
"type": "float" | ||
} | ||
], | ||
"name": "nyctaxi", | ||
"namespace": "example.avro", | ||
"type": "record" | ||
} |
12 changes: 12 additions & 0 deletions
12
stream_processing/flink/configs/connect-timescaledb-sink.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
{ | ||
"name": "nyctaxi-sink", | ||
"config": { | ||
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector", | ||
"tasks.max": "1", | ||
"topics": "sink_nyctaxi_0", | ||
"connection.url": "jdbc:postgresql://host.docker.internal:5432/k6", | ||
"connection.user": "k6", | ||
"connection.password": "k6", | ||
"auto.create": true | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
--- | ||
version: '3.8' | ||
services: | ||
zookeeper: | ||
image: confluentinc/cp-zookeeper:7.5.0 | ||
# hostname: zookeeper | ||
container_name: flink-zookeeper | ||
ports: | ||
- "2181:2181" | ||
healthcheck: | ||
test: echo srvr | nc zookeeper 2181 || exit 1 | ||
start_period: 10s | ||
retries: 20 | ||
interval: 10s | ||
environment: | ||
ZOOKEEPER_CLIENT_PORT: 2181 | ||
ZOOKEEPER_TICK_TIME: 2000 | ||
|
||
# Kafka broker | ||
broker: | ||
image: confluentinc/cp-server:7.5.0 | ||
# hostname: broker | ||
container_name: flink-broker | ||
depends_on: | ||
- zookeeper | ||
ports: | ||
- "9092:9092" | ||
- "9101:9101" | ||
healthcheck: | ||
test: nc -z localhost 9092 || exit -1 | ||
start_period: 15s | ||
interval: 5s | ||
timeout: 10s | ||
retries: 10 | ||
environment: | ||
KAFKA_BROKER_ID: 1 | ||
KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' | ||
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT | ||
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092 | ||
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 | ||
|
||
# For managing Avro schemas | ||
schema-registry: | ||
image: confluentinc/cp-schema-registry:7.5.0 | ||
# hostname: schema-registry | ||
container_name: flink-schema-registry | ||
depends_on: | ||
- broker | ||
ports: | ||
- "8081:8081" | ||
healthcheck: | ||
start_period: 10s | ||
interval: 10s | ||
retries: 20 | ||
test: curl --user superUser:superUser --fail --silent --insecure http://localhost:8081/subjects --output /dev/null || exit 1 | ||
environment: | ||
SCHEMA_REGISTRY_HOST_NAME: schema-registry | ||
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: 'broker:29092' | ||
SCHEMA_REGISTRY_LISTENERS: http://0.0.0.0:8081 | ||
|
||
# For connecting to offline store | ||
connect: | ||
image: confluentinc/cp-kafka-connect:7.5.0 | ||
# hostname: connect | ||
container_name: flink-connect | ||
depends_on: | ||
broker: | ||
condition: service_healthy | ||
schema-registry: | ||
condition: service_healthy | ||
zookeeper: | ||
condition: service_healthy | ||
ports: | ||
- "8083:8083" | ||
environment: | ||
CONNECT_BOOTSTRAP_SERVERS: 'broker:29092' | ||
CONNECT_REST_ADVERTISED_HOST_NAME: connect | ||
CONNECT_REST_PORT: 8083 | ||
CONNECT_GROUP_ID: compose-connect-group | ||
CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs | ||
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1 | ||
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000 | ||
CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets | ||
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1 | ||
CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status | ||
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1 | ||
CONNECT_KEY_CONVERTER: io.confluent.connect.avro.AvroConverter | ||
CONNECT_VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter | ||
CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:8081 | ||
CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:8081 | ||
CONNECT_PLUGIN_PATH: '/usr/share/java,/etc/kafka-connect/jars' | ||
volumes: | ||
- $PWD/data_ingestion/kafka_connect/jars/:/etc/kafka-connect/jars | ||
|
||
# Confluent control center to manage Kafka | ||
control-center: | ||
image: confluentinc/cp-enterprise-control-center:7.5.0 | ||
# hostname: control-center | ||
container_name: flink-control-center | ||
depends_on: | ||
- broker | ||
- schema-registry | ||
- connect | ||
ports: | ||
- "9021:9021" | ||
healthcheck: | ||
test: ["CMD", "curl", "-f", "http://localhost:9021/healthcheck"] # Adjust the URL and options as needed | ||
interval: 30s | ||
timeout: 10s | ||
retries: 3 | ||
environment: | ||
CONTROL_CENTER_BOOTSTRAP_SERVERS: 'broker:29092' | ||
CONTROL_CENTER_CONNECT_CONNECT-DEFAULT_CLUSTER: 'connect:8083' | ||
# CONTROL_CENTER_KSQL_KSQLDB1_URL: "http://ksqldb-server:8088" | ||
# CONTROL_CENTER_KSQL_KSQLDB1_ADVERTISED_URL: "http://localhost:8088" | ||
CONTROL_CENTER_SCHEMA_REGISTRY_URL: "http://schema-registry:8081" | ||
CONTROL_CENTER_REPLICATION_FACTOR: 1 | ||
CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1 | ||
# CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1 | ||
CONTROL_CENTER_CONNECT_HEALTHCHECK_ENDPOINT: '/connectors' | ||
CONFLUENT_METRICS_TOPIC_REPLICATION: 1 | ||
# PORT: 9021 | ||
|
||
# Offline store | ||
timescaledb: | ||
image: timescale/timescaledb:latest-pg13 | ||
command: postgres -c shared_preload_libraries=timescaledb | ||
container_name: flink-timescaledb | ||
ports: | ||
- "5432:5432" | ||
healthcheck: | ||
test: ['CMD', 'psql', '-U', 'k6', '-c', 'SELECT 1'] | ||
interval: 10s | ||
timeout: 5s | ||
retries: 5 | ||
environment: | ||
- PGDATA=/var/lib/postgresql/data/timescaledb | ||
- POSTGRES_DB=k6 | ||
- POSTGRES_USER=k6 | ||
- POSTGRES_PASSWORD=k6 | ||
volumes: | ||
- pgdata:/var/lib/postgresql/data | ||
|
||
# Simulation of sending messages to Kafka topics | ||
kafka_producer: | ||
build: | ||
context: data_ingestion | ||
dockerfile: kafka_producer/Dockerfile | ||
depends_on: | ||
broker: | ||
condition: service_healthy | ||
timescaledb: | ||
condition: service_healthy | ||
container_name: flink-kafka-producer | ||
|
||
volumes: | ||
# Volume for TimescaleDB | ||
pgdata: |
Oops, something went wrong.