Skip to content

AndyGreenwell/NDSparseData.jl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

codecov.io

NDSparseData.jl

N-dimensional sparse array structure in julia

Introduction

This package provides a data structure called NDSparse. This structure maps tuples of indices to data values. Hence, it is similar to a hash table mapping tuples to values, but with a few key differences. First, the index tuples are stored columnwise, with one vector per index position: there is a vector of first indices, a vector of second indices, and so on. The index vectors are expected to be homogeneous to allow more efficient storage. Second, the indices must have a total order, and are stored lexicographically sorted (first by the first index, then by the second index, and so on, left-to-right). While the indices must have totally-ordered types, the data values can be anything. Finally, for purposes of many operations an NDSparse acts like an N-dimensional array of its data values, where the number of dimensions is the number of index columns.

Construction

The NDSparse constructor accepts a series of vectors. The last vector contains the data values, and the first N vectors contain the indices for each of the N dimensions. As an example, let's construct an array of the high temperatures for three days in two cities:

julia> using Dates
julia> hitemps = NDSparse([fill("New York",3); fill("Boston",3)],
                          repmat(Date(2016,7,6):Date(2016,7,8), 2),
                          [91,89,91,95,83,76])
NDSparse Tuple{String,Date} => Int64:
───────────────────────┬───
"Boston"    2016-07-06 │ 95
"Boston"    2016-07-07 │ 83
"Boston"    2016-07-08 │ 76
"New York"  2016-07-06 │ 91
"New York"  2016-07-07 │ 89
"New York"  2016-07-08 │ 91

Notice that the data was sorted first by city name, then date, giving a different order than we initially provided. On construction, NDSparse takes ownership of the columns and sorts them in place (the original vectors are modified).

Importing data

Importing data from column-based sources is straightforward. For example, csv files can be imported using CSV.jl with the following snippet:

NDSparse(CSV.read(filename).columns...)

Of course, this assumes the file already has the "data column" in the rightmost position. If not, the columns can be reordered first.

See NDSparseIO for other data import utilities.

Indexing

Most lookup and filtering operations on NDSparse are done via indexing. Our hitemps array behaves like a 2-d array of integers, accepting two indices:

julia> hitemps["Boston", Date(2016,7,8)]
76

If the given indices exactly match the element types of the index columns, then the result is a scalar. In other cases, a new NDSparse is returned, giving data for all matching locations:

julia> hitemps["Boston", :]
NDSparse Tuple{String,Date} => Int64:
─────────────────────┬───
"Boston"  2016-07-06 │ 95
"Boston"  2016-07-07 │ 83
"Boston"  2016-07-08 │ 76

Like other arrays, NDSparse generates its data values when iterated. This allows the usual reduction functions (among others) in Base to work:

julia> maximum(hitemps["Boston", :])
95

Permuting dimensions

As with other multi-dimensional arrays, dimensions can be permuted to change the sort order. With NDSparse the interpretation of this operation is especially natural: simply imagine passing the index columns to the constructor in a different order, and repeating the sorting process:

julia> permutedims(hitemps, [2, 1])
NDSparse Tuple{Date,String} => Int64:
───────────────────────┬───
2016-07-06  "Boston"   │ 95
2016-07-06  "New York" │ 91
2016-07-07  "Boston"   │ 83
2016-07-07  "New York" │ 89
2016-07-08  "Boston"   │ 76
2016-07-08  "New York" │ 91

Now the data is sorted first by date. In some cases such dimension permutations are needed for performance. The leftmost column is esssentially the primary key --- indexing is fastest in this dimension.

Select and aggregate

In some cases one wants to consider a subset of dimensions, for example when producing a simplified summary of data. This can be done by passing dimension (column) numbers to select:

julia> select(hitemps, 2)
NDSparse Tuple{Date} => Int64:
───────────┬───
2016-07-06 │ 95
2016-07-06 │ 91
2016-07-07 │ 83
2016-07-07 │ 89
2016-07-08 │ 76
2016-07-08 │ 91

In this case, the result has multiple values for some indices, and so does not fully behave like a normal array anymore. Operations that might leave the array in such a state accept the keyword argument agg, a function to use to combine all values associated with the same indices:

julia> select(hitemps, 2, agg=max)
NDSparse Tuple{Date} => Int64:
───────────┬───
2016-07-06 │ 95
2016-07-07 │ 89
2016-07-08 │ 91

The NDSparse constructor also accepts the agg argument. The aggregation operation can also be done by itself, in-place, using the function aggregate!.

select also supports filtering columns with arbitrary predicates, by passing column=>predicate pairs:

julia> select(hitemps, 2=>isfriday)
NDSparse Tuple{String,Date} => Int64:
───────────────────────┬───
"Boston"    2016-07-08 │ 76
"New York"  2016-07-08 │ 91

Iterators

Indexing makes a copy of the selected data, and therefore can be expensive. As an alternative, it is possible to construct an iterator over a subset of an NDSparse. The where function accepts the same arguments as indexing, but instead returns an iterator that generates the data values at the selected locations:

julia> bos = where(hitemps, "Boston", :);
julia> first(bos)
95

The pairs function is similar, except yields index=>value pairs (where index is a tuple).

Broadcasting

broadcast is used to combine data with slightly different dimensions. For example, say we have an array of low temperatures for Boston broken down by zip code:

julia> lotemps = NDSparse(fill("Boston",6),
                          repeat(Date(2016,7,6):Date(2016,7,8), inner=2),
                          repmat([02108,02134], 3),
                          [71,70,67,66,65,66])
NDSparse Tuple{String,Date,Int64} => Int64:
───────────────────────────┬───
"Boston"  2016-07-06  2108 │ 71
"Boston"  2016-07-06  2134 │ 70
"Boston"  2016-07-07  2108 │ 67
"Boston"  2016-07-07  2134 │ 66
"Boston"  2016-07-08  2108 │ 65
"Boston"  2016-07-08  2134 │ 66

We want to compute the daily temperature range (high minus low). Since we don't have high temperatures available per zip code, we will assume the high temperatures are city-wide averages applicable to every zip code. The broadcast function implements this interpretation of the data, automatically repeating data along missing dimensions:

julia> broadcast(-, hitemps, lotemps)
NDSparse Tuple{String,Date,Int64} => Int64:
───────────────────────────┬───
"Boston"  2016-07-06  2108 │ 24
"Boston"  2016-07-06  2134 │ 25
"Boston"  2016-07-07  2108 │ 16
"Boston"  2016-07-07  2134 │ 17
"Boston"  2016-07-08  2108 │ 11
"Boston"  2016-07-08  2134 │ 10

Notice that broadcast also automatically performs an inner join, selecting only rows that match.

broadcast currently matches dimensions based on element type. Specifying dimensions to match manually, or based on column names, is planned future functionality.

Converting dimensions

A location in the coordinate space of an array often has multiple possible descriptions. This is especially common when describing data at different levels of detail. For example, a point in time can be expressed at the level of seconds, minutes, or hours. In our toy temperature dataset, we might want to look at monthly instead of daily highs.

This can be accomplished using the convertdim function. It accepts an array, a dimension number to convert, a function or dictionary to apply to indices in that dimension, and an aggregation function (the aggregation function is needed in case the mapping is many-to-one). The following call therefore gives monthly high temperatures:

julia> convertdim(hitemps, 2, month, agg=max)
NDSparse Tuple{String,Int64} => Int64:
──────────────┬───
"Boston"    7 │ 95
"New York"  7 │ 91

Assignment

NDSparse supports indexed assignment just like other arrays, but there are caveats. Since data is stored in a compact, sorted representation, inserting a single element is potentially very inefficient (O(n), since it requires moving up to half of the existing elements). Therefore single-element insertions are accumulated into a temporary buffer to amortize cost.

When the next whole-array operation (e.g. indexing or broadcast) is performed, the temporary buffer is merged into the main storage. This operation is called flush!, and can also be invoked explicitly. The cost of this operation is O(n*log(n)) + O(m), where n is the number of inserted items and m is the number of existing items. This means that the worst case occurs when alternating between inserting a small number of items, and performing whole-array operations. To the extent possible, insertions should be batched, and in general done rarely.

Named columns

NDSparse is built on a simpler data structure called Columns that groups a set of vectors together. This structure is used to store the index part of an NDSparse, and an NDSparse can be constructed by passing one of these objects directly. Columns allows names to be associated with its constituent vectors. Together, these features allow NDSparse arrays with named dimensions:

julia> hitemps = NDSparse(Columns(city = [fill("New York",3); fill("Boston",3)],
                                  date = repmat(Date(2016,7,6):Date(2016,7,8), 2)),
                          [91,89,91,95,83,76])
NDSparse Tuple{String,Date} => Int64:
city        date       │ 
───────────────────────┼───
"Boston"    2016-07-06 │ 95
"Boston"    2016-07-07 │ 83
"Boston"    2016-07-08 │ 76
"New York"  2016-07-06 │ 91
"New York"  2016-07-07 │ 89
"New York"  2016-07-08 │ 91

Now dimensions (e.g. in select operations) can be identified by symbol (e.g. :city) as well as integer index.

A Columns object itself behaves like a vector, and so can be used to represent the data part of an NDSparse. This provides one possible way to store multiple columns of data:

julia> NDSparse(Columns(x = rand(4), y = rand(4)),
                Columns(observation = rand(1:2,4), confidence = rand(4)))
NDSparse Tuple{Float64,Float64} => NamedTuples._NT_observationconfidence{Int64,Float64}:
x          y        │ observation  confidence
────────────────────┼────────────────────────
0.0400914  0.385859 │ 1            0.983784
0.165966   0.915532 │ 1            0.206534
0.532029   0.631039 │ 2            0.196016
0.932271   0.350075 │ 1            0.716692

In this case the data elements are structs with fields observation and confidence, and can be used as follows:

julia> filter(d->d.confidence > 0.90, ans)
NDSparse Tuple{Array{Float64,1},Array{Float64,1}} => NamedTuples._NT_observationconfidence{Int64,Float64}:
x          y        │ observation  confidence
────────────────────┼────────────────────────
0.0400914  0.385859 │ 1            0.983784

About

N-dimensional sparse array structure in julia

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Julia 100.0%