-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Obtained approval from Climate Corporation to incorporate
methods from their core.Data objects. Those methods are included here, but merging is not entirely complete. Renamed the package to polyglot. Added a bunch of tests, also not entirely complete.
- Loading branch information
Alex Kleeman
committed
Nov 18, 2013
1 parent
25c3ae4
commit 08f8f29
Showing
6 changed files
with
1,720 additions
and
195 deletions.
There are no files selected for viewing
Empty file.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from data import Dataset, Variable |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
import numpy as np | ||
import unicodedata | ||
|
||
NULL = '\x00' | ||
NC_BYTE = '\x00\x00\x00\x01' | ||
NC_CHAR = '\x00\x00\x00\x02' | ||
NC_SHORT = '\x00\x00\x00\x03' | ||
# netCDF-3 only supports 32-bit integers | ||
NC_INT = '\x00\x00\x00\x04' | ||
NC_FLOAT = '\x00\x00\x00\x05' | ||
NC_DOUBLE = '\x00\x00\x00\x06' | ||
|
||
# Map between netCDF type and numpy dtype and vice versa. Due to a bug | ||
# in the __hash__() method of numpy dtype objects (fixed in development | ||
# release of numpy), we need to explicitly match byteorder for dict | ||
# lookups to succeed. Here we normalize to native byte order. | ||
# | ||
# NC_CHAR is a special case because netCDF represents strings as | ||
# character arrays. When NC_CHAR is encountered as the type of an | ||
# attribute value, this TYPEMAP is not consulted and the data is read | ||
# as a string. However, when NC_CHAR is encountered as the type of a | ||
# variable, then the data is read is a numpy array of 1-char elements | ||
# (equivalently, length-1 raw "strings"). There is no support for numpy | ||
# arrays of multi-character strings. | ||
TYPEMAP = { | ||
# we could use np.dtype's as key/values except __hash__ comparison of | ||
# numpy.dtype is broken in older versions of numpy. If you must compare | ||
# and cannot upgrade, use __eq__.This bug is | ||
# known to be fixed in numpy version 1.3 | ||
NC_BYTE: 'int8', | ||
NC_CHAR: '|S1', | ||
NC_SHORT: 'int16', | ||
NC_INT: 'int32', | ||
NC_FLOAT: 'float32', | ||
NC_DOUBLE: 'float64', | ||
} | ||
for k in TYPEMAP.keys(): | ||
TYPEMAP[TYPEMAP[k]] = k | ||
|
||
# Special characters that are permitted in netCDF names except in the | ||
# 0th position of the string | ||
_specialchars = '_.@+- !"#$%&\()*,:;<=>?[]^`{|}~' | ||
|
||
# The following are reserved names in CDL and may not be used as names of | ||
# variables, dimension, attributes | ||
_reserved_names = set([ | ||
'byte', | ||
'char', | ||
'short', | ||
'ushort', | ||
'int', | ||
'uint', | ||
'int64', | ||
'uint64', | ||
'float' | ||
'real', | ||
'double', | ||
'bool', | ||
'string', | ||
]) | ||
|
||
def coerce_type(arr): | ||
"""Coerce a numeric data type to a type that is compatible with | ||
netCDF-3 | ||
netCDF-3 can not handle 64-bit integers, but on most platforms | ||
Python integers are int64. To work around this discrepancy, this | ||
helper function coerces int64 arrays to int32. An exception is | ||
raised if this coercion is not safe. | ||
netCDF-3 can not handle booleans, but booleans can be trivially | ||
(albeit wastefully) represented as bytes. To work around this | ||
discrepancy, this helper function coerces bool arrays to int8. | ||
""" | ||
# Comparing the char attributes of numpy dtypes is inelegant, but this is | ||
# the fastest test of equivalence that is invariant to endianness | ||
if arr.dtype.char == 'l': # np.dtype('int64') | ||
cast_arr = arr.astype( | ||
np.dtype('int32').newbyteorder(arr.dtype.byteorder)) | ||
if not (cast_arr == arr).all(): | ||
raise ValueError("array contains integer values that " + | ||
"are not representable as 32-bit signed integers") | ||
return cast_arr | ||
elif arr.dtype.char == '?': # np.dtype('bool') | ||
# bool | ||
cast_arr = arr.astype( | ||
np.dtype('int8').newbyteorder(arr.dtype.byteorder)) | ||
return cast_arr | ||
else: | ||
return arr | ||
|
||
def _isalnumMUTF8(c): | ||
"""Return True if the given UTF-8 encoded character is alphanumeric | ||
or multibyte. | ||
Input is not checked! | ||
""" | ||
return (c.isalnum() or (len(c.encode('utf-8')) > 1)) | ||
|
||
def is_valid_name(s): | ||
"""Test whether an object can be validly converted to a netCDF | ||
dimension, variable or attribute name | ||
Earlier versions of the netCDF C-library reference implementation | ||
enforced a more restricted set of characters in creating new names, | ||
but permitted reading names containing arbitrary bytes. This | ||
specification extends the permitted characters in names to include | ||
multi-byte UTF-8 encoded Unicode and additional printing characters | ||
from the US-ASCII alphabet. The first character of a name must be | ||
alphanumeric, a multi-byte UTF-8 character, or '_' (reserved for | ||
special names with meaning to implementations, such as the | ||
"_FillValue" attribute). Subsequent characters may also include | ||
printing special characters, except for '/' which is not allowed in | ||
names. Names that have trailing space characters are also not | ||
permitted. | ||
""" | ||
if not isinstance(s, basestring): | ||
return False | ||
if not isinstance(s, unicode): | ||
s = unicode(s, 'utf-8') | ||
num_bytes = len(s.encode('utf-8')) | ||
return ((unicodedata.normalize('NFC', s) == s) and | ||
(s not in _reserved_names) and | ||
(num_bytes >= 0) and | ||
('/' not in s) and | ||
(s[-1] != ' ') and | ||
(_isalnumMUTF8(s[0]) or (s[0] == '_')) and | ||
all((_isalnumMUTF8(c) or c in _specialchars for c in s)) | ||
) |
Oops, something went wrong.
08f8f29
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
08f8f29
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
08f8f29
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just updated the README which may answer some questions.
right now nc4 underlying data sets are loaded into memory, but thats not how I want it to behave. I'd like to see data from an object using NetCDF4 as a backed to lazy load the data (the TCC code current supports that). This way you can point to an openDAP data set, inspect it etc before any of the data is downloaded. Then only once you've asked for views on the data will it actually be loaded into memory.
I'm not a huge fan of the name scidata. One thought was polyglot (ie, speaking many languages) or maybe something that is more specific to the Common Data Model? cdm? Any other suggestions?
08f8f29
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we write cdm; then it seems like we may be on the hook to support the various formats unidata cdm provides, and the interface (direct hdf5, grib, etc). See:
http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/CDM/
if that's the case, then we certainly have some work to do; alternatively if we're focusing spcifically on support for both NetCDF 3 and 4, then my suggestion is to give it a name that refers to NetCDF specifically.