Releases · robertzhao2002/database-management-system

01 Dec 05:28

robertzhao2002

16d4075

v4 (12/1/2022) Latest

Latest

Optimizing Selections

Union Find

We used a union-find data structure to implement selection pushing. We use the visitor pattern to read the join condition and transitively invoke the equality and inequality conditions on attributes, which can be found here. Each attribute is mapped to a union-find element in the union-find data structure. This contains logic that performs interval arithmetic with conditions to make the selection filter our more unnecessary data. With this, we could write a new condition that makes selection much easier. This removed the need for a separate join visitor.

Selection Implementation

Based on the selectivity of the conditions and indexing costs, we chose to either use index or normal scanning. This means we no longer need to tell our DBMS through I/O which type of scan to use.

Optimizing Joins

Join Order

We used dynamic programming figure out the join costs for all possible subsets of tables, as well as the optimal join order for that given subset. For subsets that are 1 larger, we simply add the cost of the current subset and the cost of adding the current table. This information was populated into a table, where each entry contains the best join order, its associated cost, the size of the join (including the root), and the V-Values for each attribute in the join. Now, we leverage our union-find to extract sets of equal attributes.

Join Type

This was a relatively simple change. If the join condition was an equi-join, we used Sort-Merge Join. Otherwise, we used Block-Nested Loop Join. We removed in-memory sort, and we are exclusively using external sort. Also, we removed Tuple-Nested Loop Join.

Other

Printing Query Plans

We gave all operators a toString() function, so we could easily output which operations we are using for our query plan.

Removing Benchmarking

Because we are no longer comparing independent variables in this project milestone, we removed all benchmarking-related files. ad27f1b

Package Reorganization

We created a new package for our classes that handle optimizing the query plan. The old visitors package has been removed. ad27f1b

Assets 3

01 Dec 05:23

robertzhao2002

f5f2ad8

v3 (11/4/2022)

Data Indexing

Our DBMS now supports data indexing to allow for faster scan operations. We use the B+ Tree, which is a balanced tree that allows us to store useful information about the ordering of the data. Furthermore, we support clustered indexing, which pre-process the data and sorts it before any query evaluation. We order the tuples by using an Record Identifier (RID), which we use RID.java for.

Bulk Loading and Serializing

TreeIndexBuilder.java handles the logic for serializing a B+ tree to a file by using the Bulk Loading Algorithm. We first determine whether or not to sort the data by checking if our configuration file specifies to cluster the data.

Next, we get the tuples by table and attribute name and return a sorted list of DataEntry,java objects. DataEntry.java holds the RID and the key, which we use as a comparator for building our B+ tree.

Finally, we use the NodeWriter.java to write the nodes at the correct pages (byte-locations) in the index file by using the Bulk Loading Algorithm.

Tree Deserialization and Index Scan Operator

TreeDeserializer.java helps do the dirty work of an indexed scan operation. Because the serialized file properly sorts the tuples, we know exactly where to look when performing a selection operation, which makes us check fewer tuples to filter from. We use a lowKey and highKey to limit the range of tuples we need to scan through.

IndexScanOperator.java uses TreeDeserializer.java to obtain the correct tuples of a scan operation.

Other

There is now only 1 init method in Catalog.java. We store all of our file-paths and configuration information in a .txt config file and pass it in as a runtime argument. However, we can programmatically set the plan builder config and indexing config.
We now have 2 benchmarking tasks: comparing different join types and comparing different indexing strategies. Run the commands below to perform each task, respectively. Furthermore, we added parameters to the generate function in TupleGenerator.java in order to make our random data generation more flexible to work with.

./gradlew benchmarkJoin --args="[number-of-trials]"

./gradlew benchmarkIndexing --args="[number-of-trials]"

We completely removed the JUnit4 dependency from our codebase and fully converted to using JUnit5.
We created a catalog for all our dependencies and their respective versions here.

Assets 4

01 Dec 05:22

robertzhao2002

de47fb3

v2 (10/20/2022)

Using Java NIO

We created 2 new classes to help us with I/O operations with data using Java NIO. These are better than our old I/O operations because it allows us to read and write data chunks at a time, instead of only 1 at a time. Our data is now stored in byte files because it is easier to programmatically read bytecode, and they take up less disk space than human readable strings. TupleReader.java helps with obtaining the data from the relations, and TupleWriter.java helps with writing the results of the queries into a byte-code file.

External Sort

For more efficient sorting, we implemented an external sorting algorithm that uses the file storage as a temporary buffer. This allows us to sort larger amounts of data.

New Join Types

In order to join tables more efficiently, we implemented 2 new types of join methods: Block Nested Loop Join and Sort Merge Join. These are an improvement to Tuple Nested Loop Join (formerly com.dbms.operators.JoinOperator.java) because we are cleverly using memory and properties of the relations to perform faster join operations.

Block Nested Loop Join (BNLJ)

This is an algorithm that joins 2 or more relations by checking the tuples 1 block at a time, instead of a singular tuple at a time. This is an enhanced version of Tuple Nested Loop Join because it essentially does the same, but with chunks of tuples at a time.

Sort Merge Join (SMJ)

This is an algorithm that specializes in performing an equijoin with 2 or more sorted relations.

Logical and Physical Operators

Because we have different implementations of the "same operation," we need to split between the abstract concept of the operations and the implementations of those operations. We separated the operators (formerly com.dbms.operators) package into com.dbms.operators.logical and com.dbms.operators.physical to distinguish these functionalities. The Logical Plan Builder (formerly QueryPlanBuilder.java) decides which operation to do on a higher level, and the Physical Plan Builder uses the config file to determine which specific implementation of an operation to use.

Benchmarking

To compare the join methods with each other, we created a package called com.dbms.analytics that contained a class to randomly generate tuples, as well as another class to handle the experimentation. Attached is a PDF of our results. Check the README for instructions on how to run experiments.

Other Quality of Life Changes

com.dbms.utils.ColumnName: wraps the aliased table and column named into an object, so we no longer to use an arbitrary delimiter, as well as string manipulation functions, to save the combined table and column name
Parameterized Tests: created QueryTestSetBuilder.java in the end-to-end testing file to isolate each query into its own parameterized test, so we can figure out which query caused errors, if any. Furthermore, the entire test suite was refactored to use parameterized testing, so we can see the status of each individual unit test case.
Programmatic Configurations: when we want to isolate the testing of a specific join or sort type, we want to be able to dictate which method we want to use programmatically, so we don't need to manually modify an external file.

Assets 4

01 Dec 05:21

robertzhao2002

0a12d1c

v1 (9/18/2022)

Execution

To execute our application, run the command below with the input folder following the structure shown below and the destination folder, which will have the results of all the queries made, each in their own file corresponding to the line numbers in queries.sql.

java -jar [jar-file-name] [absolute-or-relative-path-of-inputs] [absolute-or-relative-path-of-outputs]

Inputs

The input folder must be in this structure (shown below).

└── input folder
    ├── queries.sql (contains all the queries, each line corresponds to an output file)
    └── db
        ├── schema.txt (describes the structure of tables)
        └── data
            └── text files containing data (each file corresponds to a table)

SQL Support

Integer data ONLY
Scan Queries
Select Queries
Project Queries
Join Queries with WHERE clause only
Order By Queries
Distinct Queries
Aliasing

Methodology

Read files with data and store it to memory. While the data is in memory, our Java application will perform the above SQL queries on it.

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Selections

Union Find

Selection Implementation

Optimizing Joins

Join Order

Join Type

Other

Printing Query Plans

Removing Benchmarking

Package Reorganization

Data Indexing

Bulk Loading and Serializing

Tree Deserialization and Index Scan Operator

Other

Using Java NIO

External Sort

New Join Types

Block Nested Loop Join (BNLJ)

Sort Merge Join (SMJ)

Logical and Physical Operators

Benchmarking

Other Quality of Life Changes

Execution

Inputs

SQL Support

Methodology

Releases: robertzhao2002/database-management-system

v4 (12/1/2022)

Optimizing Selections

Union Find

Selection Implementation

Optimizing Joins

Join Order

Join Type

Other

Printing Query Plans

Removing Benchmarking

Package Reorganization

v3 (11/4/2022)

Data Indexing

Bulk Loading and Serializing

Tree Deserialization and Index Scan Operator

Other

v2 (10/20/2022)

Using Java NIO

External Sort

New Join Types

Block Nested Loop Join (BNLJ)

Sort Merge Join (SMJ)

Logical and Physical Operators

Benchmarking

Other Quality of Life Changes

v1 (9/18/2022)

Execution

Inputs

SQL Support

Methodology