The goal of this project is to create a very lightweight framework to quickly generate data for demo purposes. The framework uses Akka to simplify concurrency and messaging, and is simple to extend.
This project provides a generic simulation framework which is capable for simulating data based on configurations. This simulator have capbilities to display the simulated data in console for verifying if the simulated data is correct or not . At present this have implementation for pushing the simulated data to kafka or to a file. You are welcome to contribute to create collectors which can push the data to other sources like kafka or blob storages.
You need to have jdk 1.8 installed in the machine where you would like to run the simulator. Please note jre is not enough for executing all the features provided by this simulator. Refer the link on how to install jdk in your machine
For pushing the data to kafka you need to have a kafka cluster and broker nodes and zookeeper should be accessible from the node where you are running the simulator.
- Setup maven to build binaries from the source. maven download
- Download the source code from github. DataSimulator
- Execute the following maven command to build the binaries.
mvn package -DskipTests
- The binaries are created with in the distribution/target/data-simulator-x.x.x-bin.zip
- Unzip this binary to get started .
Imply employees can download the binaries from s3 location: s3://imply-ps-tools/datasimulator/
Generic simulator have the capbility to
- infer the schema from a json file and generate the data based on that.
- This has the capability to generate a required cardinality based on the metadata mentioned in the config.properties file . See the below section to to get more details on how to configure the metadata to generate the required cardinality
- Generate the data based on the data type and condition required.
Running the simulator in multiple machine (Cluster deploy and run)
Install the simulator in all the nodes ansible-playbook -u ec2-user --private-key ~/Documents/tt_support_NV.pem install.yaml
start simulator in all nodes ansible-playbook -u ec2-user --private-key ~/Documents/tt_NV.key ops.yaml --tags "start" Stop the simulator in all nodes ansible-playbook -u ec2-user --private-key ~/Documents/tt_NV.key ops.yaml --tags "stop" Restart simulator in all nodes ansible-playbook -u ec2-user --private-key ~/Documents/tt_NV.key ops.yaml --tags "start,stop"
Property | Details | Default |
---|---|---|
sim.kafka.bootstrap.servers | Kafka bootstrap server ip/hostname with port no eg:localhost:9092 |
This parameter is considered only if the Collector is com.tijo.streaming.impl.collectors.UnsecuredKafkaEventCollector mentioned in the start script |
sim.kafka.topicName | The data will be pushed to the topic mentioned | This parameter is considered only if the Collector is com.tijo.streaming.impl.collectors.UnsecuredKafkaEventCollector mentioned in the start script |
sim.generic.infer.packageName | Used by inferjson and refering to java package naming. | none |
sim.generic.infer.jsonPath | Name of the sample json file used for infering the schema | The infered java code will be created with the same name as json file name , excluding the extension |
sim.generic.infer.generateCode | True if json to java code generation is required. This can be set to false if any modification is done in the java code generated in earlier tries. | true |
sim.generic.infer.pojoOutputDir | Folder path for generating the java code corresponding to the sample json | pojoOutput |
sim.cardinality.generator.folder | If the metadata configuration is to use fixed no of cardinatiles , this will be generated and saved as file in this folder. | none |
sim.file.maxrows | If the collector is to save the data to file , then this property is limit the no of records in the file | none |
sim.file.outputdir | The folder in which the generated files are saved | none |
sim.file.fileNameLength | random file names are generated for storing the files in disk. The length of the file name is controlled with this | none |
sim.generic.eventClass | Name of the class for the json file . If infer json is used to generate the java class, same class name need to be mentioned here. | none |
sim.generic.metadata | Refer the below metata data section | none |
Type | Details | Options |
---|---|---|
FIXED |
Used to create random data from a fixed list of values. | 1. "file":name of the file where the list of data is saved. 2. "limit" :the size of the data ( this is used when gencardinality.sh is used to create the list of random data) 3. "cardinality": cardinality of the data |
STRING |
Used to generate random String | 1. "limit" : the length of the string to be generated |
INT |
Used for creating random integer value | 1. "limit" : Max value of the Integer to be generated. This should not be more than Integer.MAX_VALUE. |
LONG |
Used for creating random Long value | 1. "limit" : Max value of the Long to be generated. This should not be more than Long.MAX_VALUE. |
DOUBLE |
Used for creating random Double value | 1. "limit" : Max value of the Double to be generated. This should not be more than Double.MAX_VALUE. |
DATE_RANGE |
Used to generate date from a start time to end time . | 1. "start" : start time considering for simulation, 2. "end": end time in date format mentioned . eg : {"dimension" : "start_time" ,"type":"DATE_RANGE","start":"2010-06-01 00:00:00","end":"2020-06-01 00:00:00","limit":10000,"format":"yyyy-MM-dd mm:HH:ss"} |
DATE |
Generate the current time at which the simulation happens. | With the option limit you can give a variance to simulate the late arrival of data. eg : {"dimension" : "time" ,"type":"DATE","limit":10000,"format":"yyyy-MM-dd HH:mm:ss"} |
SCRIPT |
Make use of Beanshell where java code can be written . The value inthe variable data is returned by the simulator |
eg : "script": "import javax.xml.bind.DatatypeConverter; byte[] myBytes = row.get(\\\"deviceid\\\").getBytes(\\\"UTF-8\\\"); data=DatatypeConverter.printHexBinary(myBytes).substring(1,6);"} |
LOOKUP |
Can be used to lookup in a file, based on a referenced dimension . The referenced dimesion should be defined before the lookup dimension | eg : {"dimension" : "country" ,"type":"LOOKUP","file":"lookupfile.csv","referenceDim":"city"} |
Please refer the example configs to write a your own metadata.
This simulator can approximately generate 50,000 data points in a 8 core machine with high speed Hard disk. There is no limit to generate data to push the data to kafka by running the simulator to multiple machine.
If you want to simulate specific scenarios , please refer the advance guide. advance simulation