Datagristle is a toolbox of tough and flexible data connectors and analyzers.
It's kind of an interactive mix between ETL and data analysis optimized for
rapid analysis and manipulation of a wide variety of data.
It's neither an enterprise ETL tool, nor an enterprise analysis, reporting, or data mining tool. It's intended to be an easily-adopted tool for technical analysts that combines the most useful subset of data transformation and analysis capabilities necessary to do 80% of the work. Its open source python codebase allows it to be easily extended to with custom code to handle that always challenging last 20%.
Current Status: Strong support for easy analysis and simple transformations of csv files.
#Next Steps:
- attractive PDF output of gristle_determinator.py
- metadata database population
#Its objectives include:
- multi-platform (unix, linux, mac os, windows with effort)
- multi-language (primarily python)
- free - no cripple-licensing
- primary audience is programming data analysts - not non-technical analysts
- primary environment is command-line rather than windows, graphical desktop or eclipse
- extensible
- allow a bi-directional iteration between ETL & data analysis
- can quickly perform initial data analysis prior to longer-duration, deeper analysis with heavier-weight tools.
#Installation
- Using pip (preferred) or easyinstall:
```pip install datagristle```
```easy_install datagristle```
- Or download tarball from pypi
#Dependencies
- Python 2.6 or Python 2.7
#Mature Utilities Provided in This Release:
- gristle_determinator.py
- Identifies file formats, generates metadata, prints file analysis report
- This is the most mature - and also used by the other utilities so that you generally do not need to enter file structure info.
- gristle_freaker.py
- Produces a frequency distribution of multiple columns from input file.
- gristle_slicer.py
- Used to extract a subset of columns and rows out of an input file.
- gristle_viewer.py
- Shows one record from a file at a time - formatted based on metadata.
#Immature Utilities Provided in This Release:
- gristle_differ.py
- Shows differences between two files
- gristle_file_converter.py
- Converts a csv from one dialect to another. Can handle multi-character field delimiters as well as record delimiters.
- gristle_filter.py
- Applies simple filter logic to file.
- gristle_scalar.py
- Performs scalar operations (min, max, avg, count unique, etc) on a file
- gristle_validator.py
- Validates a file - currently just confirms number of fields for each row.
#Future utilities:
- gristle_metadata.py
- Manages metadata - allows users to query, add, update, delete file, field, transformation, reporting descriptions.
- gristle_generator
- Generates test data based on gristle metadata
- gristle_validator
- Confirms validity of database and file structure and contents.
- gristle_file_joiner.py
- joins two files on their common keys and produces a new file
- gristle_grouper.py
- reads a file, aggregates on a given set of fields, produces a new file
- gristle_db_loader.py
- loads a file into a database
- gristle_db_extractor.py
- extracts data from a database into a file
- gristle_field_merge.py
- prints the matched values from multiple files side by side along with counts
#Licensing
- Gristle uses the BSD license - see the separate LICENSE file for further information