Syntetic data generators for creating datasets with Gaussian distribution. The code was taken from official website and slightly modified for modern compilers.
You'll need g++
compiler.
- Debian:
apt-get install build-essential
should be enough
and run
$ make
Currently does not take any parameters, all settings is hard-coded in constants:
#define DIM 2 // dimensionality of the data
#define NUM 40 // number of clusters
#define MAXMU 10 // mean in each dimension is in range [0,MAXMU]
#define MINMU -10
#define MINSIGMA 0
#define MAXSIGMA 20*sqrt(DIM) // standard deviation (to be added on top
// of row sum in each dimension is in range [0,MAXSIGMA]
#define MAXSIZE 100 // size of each cluster is in range [MINSIZE,MAXSIZE]
#define MINSIZE 10
#define RUNS 10 // number of data sets to be generated
simply run:
$ ./mult_generator
Ellipsoid generator
$ ./elly [-k <nclust>] [-d <dimension>] [-s <seed>]
where all parameters are optional and:
<nclust>
is a positive int >= 2<dimension>
is a positive int >= 2<seed>
is a long int.
CURE data sets generator. See Guha, Sudipto, Rajeev Rastogi, and Kyuseok Shim. "CURE: an efficient clustering algorithm for large databases." ACM SIGMOD Record. Vol. 27. No. 2. ACM, 1998. for more details.
The distribution of data points is just approximated
$ ./cure -n <npoints> [-d <dimension>] [-s <seed>] [-l <x/y min>] [-m <x/y max>] [-t type of data]
where:
-l
minimal x/y value-m
maximal x/y value-t
type of dataset, currently supports values 0-2
Disk in disk dataset are two clusters formed by a circle and an annulus around it.
- Julia Handl
- Joshua Knowles
- John Burkardt
- Tomas Barton