turicreate.SArray

class turicreate.SArray(data=[], dtype=None, ignore_cast_failure=False, _proxy=None)

An immutable, homogeneously typed array object backed by persistent storage.

SArray is scaled to hold data that are much larger than the machine’s main memory. It fully supports missing values and random access. The data backing an SArray is located on the same machine as the Turi Server process. Each column in an SFrame is an SArray.

Parameters:
data : list | numpy.ndarray | pandas.Series | string | generator | map | range | filter

The input data. If this is a list or can generate a list from map, filter, and generator, numpy.ndarray, or pandas.Series, the data in the list is converted and stored in an SArray. Alternatively if this is a string, it is interpreted as a path (or url) to a text file. Each line of the text file is loaded as a separate row. If data is a directory where an SArray was previously saved, this is loaded as an SArray read directly out of that directory.

dtype : {None, int, float, str, list, array.array, dict, datetime.datetime, turicreate.Image}, optional

The data type of the SArray. If not specified (None), we attempt to infer it from the input. If it is a numpy array or a Pandas series, the dtype of the array/series is used. If it is a list, the dtype is inferred from the inner list. If it is a URL or path to a text file, we default the dtype to str.

ignore_cast_failure : bool, optional

If True, ignores casting failures but warns when elements cannot be casted into the specified dtype.

Notes

Examples

SArray can be constructed in various ways:

Construct an SArray from list.

>>> from turicreate import SArray
>>> sa = SArray(data=[1,2,3,4,5], dtype=int)

Construct an SArray from numpy.ndarray.

>>> sa = SArray(data=numpy.asarray([1,2,3,4,5]), dtype=int)
or:
>>> sa = SArray(numpy.asarray([1,2,3,4,5]), int)

Construct an SArray from pandas.Series.

>>> sa = SArray(data=pd.Series([1,2,3,4,5]), dtype=int)
or:
>>> sa = SArray(pd.Series([1,2,3,4,5]), int)

Construct an SArray from range (xrange for py2): .. warning:

if no step is provided from range, SArray.from_sequence is preferred in terms of performance.
>>> sa = SArray(data=range(1, 100, 2), dtype=int)
or:
>>> sa = SArray(data=range(1, 100, 2))

Construct an SArray from map:

>>> sa = SArray(data=map(lambda x : x**2, [1, 2, 3]), dtype=int)
or:
>>> sa = SArray(data=map(lambda x : x**2, [1, 2, 3]))

Construct an SArray from filter:

>>> sa = SArray(data=filter(lambda x : x > 2, [1, 2, 3]), dtype=int)
or:
>>> sa = SArray(data=filter(lambda x : x > 2, [1, 2, 3]))

Construct an SArray from generator:

def gen():

x = 0 while x < 10:

yield x x += 1
>>> sa = SArray(data=gen(), dtype=int)
or:
>>> sa = SArray(data=gen())

If the type is not specified, automatic inference is attempted:

>>> SArray(data=[1,2,3,4,5]).dtype
int
>>> SArray(data=[1,2,3,4,5.0]).dtype
float

The SArray supports standard datatypes such as: integer, float and string. It also supports three higher level datatypes: float arrays, dict and list (array of arbitrary types).

Create an SArray from a list of strings:

>>> sa = SArray(data=['a','b'])

Create an SArray from a list of float arrays;

>>> sa = SArray([[1,2,3], [3,4,5]])

Create an SArray from a list of lists:

>>> sa = SArray(data=[['a', 1, {'work': 3}], [2, 2.0]])

Create an SArray from a list of dictionaries:

>>> sa = SArray(data=[{'a':1, 'b': 2}, {'b':2, 'c': 1}])

Create an SArray from a list of datetime objects:

>>> sa = SArray(data=[datetime.datetime(2011, 10, 20, 9, 30, 10)])

Construct an SArray from local text file. (Only works for local server).

>>> sa = SArray('/tmp/a_to_z.txt.gz')

Construct an SArray from a text file downloaded from a URL.

>>> sa = SArray('http://s3-us-west-2.amazonaws.com/testdatasets/a_to_z.txt.gz')

Numeric Operators

SArrays support a large number of vectorized operations on numeric types. For instance:

>>> sa = SArray([1,1,1,1,1])
>>> sb = SArray([2,2,2,2,2])
>>> sc = sa + sb
>>> sc
dtype: int
Rows: 5
[3, 3, 3, 3, 3]
>>> sc + 2
dtype: int
Rows: 5
[5, 5, 5, 5, 5]

Operators which are supported include all numeric operators (+,-,*,/), as well as comparison operators (>, >=, <, <=), and logical operators (&, | ).

For instance:

>>> sa = SArray([1,2,3,4,5])
>>> (sa >= 2) & (sa <= 4)
dtype: int
Rows: 5
[0, 1, 1, 1, 0]

The numeric operators (+,-,*,/) also work on array types:

>>> sa = SArray(data=[[1.0,1.0], [2.0,2.0]])
>>> sa + 1
dtype: list
Rows: 2
[array('f', [2.0, 2.0]), array('f', [3.0, 3.0])]
>>> sa + sa
dtype: list
Rows: 2
[array('f', [2.0, 2.0]), array('f', [4.0, 4.0])]

The addition operator (+) can also be used for string concatenation:

>>> sa = SArray(data=['a','b'])
>>> sa + "x"
dtype: str
Rows: 2
['ax', 'bx']

This can be useful for performing type interpretation of lists or dictionaries stored as strings:

>>> sa = SArray(data=['a,b','c,d'])
>>> ("[" + sa + "]").astype(list) # adding brackets make it look like a list
dtype: list
Rows: 2
[['a', 'b'], ['c', 'd']]

All comparison operations and boolean operators are supported and emit binary SArrays.

>>> sa = SArray([1,2,3,4,5])
>>> sa >= 2
dtype: int
Rows: 3
[0, 1, 1, 1, 1]
>>> (sa >= 2) & (sa <= 4)
dtype: int
Rows: 3
[0, 1, 1, 1, 0]

Element Access and Slicing SArrays can be accessed by integer keys just like a regular python list. Such operations may not be fast on large datasets so looping over an SArray should be avoided.

>>> sa = SArray([1,2,3,4,5])
>>> sa[0]
1
>>> sa[2]
3
>>> sa[5]
IndexError: SFrame index out of range

Negative indices can be used to access elements from the tail of the array

>>> sa[-1] # returns the last element
5
>>> sa[-2] # returns the second to last element
4

The SArray also supports the full range of python slicing operators:

>>> sa[1000:] # Returns an SArray containing rows 1000 to the end
>>> sa[:1000] # Returns an SArray containing rows 0 to row 999 inclusive
>>> sa[0:1000:2] # Returns an SArray containing rows 0 to row 1000 in steps of 2
>>> sa[-100:] # Returns an SArray containing last 100 rows
>>> sa[-100:len(sa):2] # Returns an SArray containing last 100 rows in steps of 2

Logical Filter

An SArray can be filtered using

>>> array[binary_filter]

where array and binary_filter are SArrays of the same length. The result is a new SArray which contains only elements of ‘array’ where its matching row in the binary_filter is non zero.

This permits the use of boolean operators that can be used to perform logical filtering operations. For instance:

>>> sa = SArray([1,2,3,4,5])
>>> sa[(sa >= 2) & (sa <= 4)]
dtype: int
Rows: 3
[2, 3, 4]

This can also be used more generally to provide filtering capability which is otherwise not expressible with simple boolean functions. For instance:

>>> sa = SArray([1,2,3,4,5])
>>> sa[sa.apply(lambda x: math.log(x) <= 1)]
dtype: int
Rows: 3
[1, 2]

This is equivalent to

>>> sa.filter(lambda x: math.log(x) <= 1)
dtype: int
Rows: 3
[1, 2]

Iteration

The SArray is also iterable, but not efficiently since this involves a streaming transmission of data from the server to the client. This should not be used for large data.

>>> sa = SArray([1,2,3,4,5])
>>> [i + 1 for i in sa]
[2, 3, 4, 5, 6]

This can be used to convert an SArray to a list:

>>> sa = SArray([1,2,3,4,5])
>>> l = list(sa)
>>> l
[1, 2, 3, 4, 5]

Methods

SArray.abs() Returns a new SArray containing the absolute value of each element.
SArray.all() Return True if every element of the SArray evaluates to True.
SArray.any() Return True if any element of the SArray evaluates to True.
SArray.append(other) Append an SArray to the current SArray.
SArray.apply(fn[, dtype, skip_na]) Transform each element of the SArray by a given function.
SArray.argmax() Get the index of the maximum numeric value in SArray.
SArray.argmin() Get the index of the minimum numeric value in SArray.
SArray.astype(dtype[, undefined_on_failure]) Create a new SArray with all values cast to the given type.
SArray.clip([lower, upper]) Create a new SArray with each value clipped to be within the given bounds.
SArray.clip_lower(threshold) Create new SArray with all values clipped to the given lower bound.
SArray.clip_upper(threshold) Create new SArray with all values clipped to the given upper bound.
SArray.contains(item) Performs an element-wise search of “item” in the SArray.
SArray.countna() Number of missing elements in the SArray.
SArray.cumulative_max() Return the cumulative maximum value of the elements in the SArray.
SArray.cumulative_mean() Return the cumulative mean of the elements in the SArray.
SArray.cumulative_min() Return the cumulative minimum value of the elements in the SArray.
SArray.cumulative_std() Return the cumulative standard deviation of the elements in the SArray.
SArray.cumulative_sum() Return the cumulative sum of the elements in the SArray.
SArray.cumulative_var() Return the cumulative variance of the elements in the SArray.
SArray.date_range(start_time, end_time, freq) Returns a new SArray that represents a fixed frequency datetime index.
SArray.datetime_to_str([format]) Create a new SArray with all the values cast to str.
SArray.dict_has_all_keys(keys) Create a boolean SArray by checking the keys of an SArray of dictionaries.
SArray.dict_has_any_keys(keys) Create a boolean SArray by checking the keys of an SArray of dictionaries.
SArray.dict_keys() Create an SArray that contains all the keys from each dictionary element as a list.
SArray.dict_trim_by_keys(keys[, exclude]) Filter an SArray of dictionary type by the given keys.
SArray.dict_trim_by_values([lower, upper]) Filter dictionary values to a given range (inclusive).
SArray.dict_values() Create an SArray that contains all the values from each dictionary element as a list.
SArray.dropna() Create new SArray containing only the non-missing values of the SArray.
SArray.element_slice([start, stop, step]) This returns an SArray with each element sliced accordingly to the slice specified.
SArray.explore([title]) Explore the SArray in an interactive GUI.
SArray.fillna(value) Create new SArray with all missing values (None or NaN) filled in with the given value.
SArray.filter(fn[, skip_na, seed]) Filter this SArray by a function.
SArray.filter_by(values[, exclude]) Filter an SArray by values inside an iterable object.
SArray.from_const(value, size[, dtype]) Constructs an SArray of size with a const value.
SArray.from_sequence([start]) Create an SArray from sequence
SArray.hash([seed]) Returns an SArray with a hash of each element.
SArray.head([n]) Returns an SArray which contains the first n rows of this SArray.
SArray.is_in(other) Performs an element-wise search for each row in ‘other’.
SArray.is_materialized() Returns whether or not the sarray has been materialized.
SArray.is_topk([topk, reverse]) Create an SArray indicating which elements are in the top k.
SArray.item_length() Length of each element in the current SArray.
SArray.materialize() For a SArray that is lazily evaluated, force persist this sarray to disk, committing all lazy evaluated operations.
SArray.max() Get maximum numeric value in SArray.
SArray.mean() Mean of all the values in the SArray, or mean image.
SArray.median([approximate]) Median of all the values in the SArray.
SArray.min() Get minimum numeric value in SArray.
SArray.nnz() Number of non-zero elements in the SArray.
SArray.pixel_array_to_image(width, height, …) Create a new SArray with all the values cast to turicreate.image.Image of uniform size.
SArray.plot([title, xlabel, ylabel]) Create a Plot object representing the SArray.
SArray.random_integers(size[, seed]) Returns an SArray with random integer values.
SArray.random_split(fraction[, seed]) Randomly split the rows of an SArray into two SArrays.
SArray.read_json(filename) Construct an SArray from a json file or glob of json files.
SArray.rolling_count(window_start, window_end) Count the number of non-NULL values of different subsets over this SArray.
SArray.rolling_max(window_start, window_end) Calculate a new SArray of the maximum value of different subsets over this SArray.
SArray.rolling_mean(window_start, window_end) Calculate a new SArray of the mean of different subsets over this SArray.
SArray.rolling_min(window_start, window_end) Calculate a new SArray of the minimum value of different subsets over this SArray.
SArray.rolling_stdv(window_start, window_end) Calculate a new SArray of the standard deviation of different subsets over this SArray.
SArray.rolling_sum(window_start, window_end) Calculate a new SArray of the sum of different subsets over this SArray.
SArray.rolling_var(window_start, window_end) Calculate a new SArray of the variance of different subsets over this SArray.
SArray.sample(fraction[, seed, exact]) Create an SArray which contains a subsample of the current SArray.
SArray.save(filename[, format]) Saves the SArray to file.
SArray.show([title, xlabel, ylabel]) Visualize the SArray.
SArray.shuffle() Randomly shuffles the elements of the SArray.
SArray.sort([ascending]) Sort all values in this SArray.
SArray.split_datetime([column_name_prefix, …]) Splits an SArray of datetime type to multiple columns, return a new SFrame that contains expanded columns.
SArray.stack([new_column_name, drop_na, …]) Convert a “wide” SArray to one or two “tall” columns in an SFrame by stacking all values.
SArray.std([ddof]) Standard deviation of all the values in the SArray.
SArray.str_to_datetime([format]) Create a new SArray with all the values cast to datetime.
SArray.sum() Sum of all values in this SArray.
SArray.summary([background, sub_sketch_keys]) Summary statistics that can be calculated with one pass over the SArray.
SArray.tail([n]) Get an SArray that contains the last n elements in the SArray.
SArray.to_numpy() Converts this SArray to a numpy array
SArray.unique() Get all unique values in the current SArray.
SArray.unpack([column_name_prefix, …]) Convert an SArray of list, array, or dict type to an SFrame with multiple columns.
SArray.value_counts() Return an SFrame containing counts of unique values.
SArray.var([ddof]) Variance of all the values in the SArray.
SArray.vector_slice(start[, end]) If this SArray contains vectors or lists, this returns a new SArray containing each individual element sliced, between start and end (exclusive).
SArray.where(condition, istrue, isfalse[, dtype]) Selects elements from either istrue or isfalse depending on the value of the condition SArray.