Hatlab: a WIP-library inspired by Matlab and NumPy

Hume2 · June 15, 2022, 9:06pm

I wrote about this earlier to the mailing list haskel-cafe and I was advised to post it here. In the meantime, I added some more not so important functions and fixed a few bugs.

Hatlab is supposed to be a library which works with multi-dimensional arrays in the similar fashion as Matlab or NumPy. I like these frameworks a lot because it allows to concatenate matrices, replace parts of matrices, update matrices and even perform simple queries in one line without the need of writing any kind of loops. Given that, the execution can be accelerated a lot because the executor/compiler does not need to interpret any kind of complicated nested loops. And it is also more convenient for the programmer because the code is more clean and therefore less buggy. So I decided to write a similar framework for Haskell.

The matrices can have arbitrary dimensions, starting from 0D matrices, which are basically scalars. Operations like + - * / are element-wise. I also implemented broadcasting of these operations.

The core of this framework is indexing. It is a bit more complicated here because the syntax differs based on what we need the index for. If we want to retrieve a single element, we use the !# operator. If we want to retrieve a slice (submatrix), we use the !$ operator (remember $ like Slice). Examples:

a = matrix2D [[1,2,3],[4,5,6],[7,8,9]]
a !# [2,2] -- element at index [2,2]
a !$ [At 1, All] -- 1st row
a !$ [To (-1), All] -- all rows except the last one
a !$ [Where $ a ># 5] -- all elements bigger than 5 (The ># operator returns a matrix of bools, I will come to that later.)

The matrix cannot be changed but it is possible to make a new matrix with certain elements updated. It is done using the !: operator. This operator is followed by the slice index and then the way how the slice shall be updated. Some examples:

a = eye 5
b = matrix2D [[1,2,3],[4,5,6],[7,8,9]]
a !:[FromTo 1 4, FromTo 1 4]:= b -- replace the middle square by a different matrix
a !:[FromTo 1 4, FromTo 1 4]:= 1 -- set all values in the middle square to 1
a !:[FromTo 1 4, FromTo 1 4]:+= 1 -- add one to all values in the middle square
a !:[FromTo 1 4, FromTo 1 4]:$= cos -- replace all values in the middle square by their cosine

Operators are broadcasted the same way as in Matlab or NumPy. If one operand has size 1 and the other one has size N or vice-versa in given dimension, the operator acts as if the thinner matrix was repeated in the given dimension N times. Example:

a = matrix1D [10, 20, 30] -- row vector
b = matrix2D [[1], [2], [3]] -- column vector
a + b -- returns [[11, 21, 31], [12, 22, 32], [13, 23, 33]]

Matlab and NumPy also allow adding a scalar to a matrix, which is then interpreted as element-wise addition of that scalar. Something similar can be done in Hatlab too. There is a function scalar which creates a 0D matrix out of any value, so we can do:

a + (scalar 5) -- returns [15, 25, 35]
a * (scalar 2) -- returns [20, 40, 60]

If a type is in the class Num, Fractional or Floating, the matrix of the given type is also in the given class. Conversions are performed using the function scalar, so we can write:

a + 5
a * 2

because the 5 and 2 are then interpreted as 0D matrices. However, this does not work for example in the case a + b where a is a matrix and b is a variable of type Int, because addition of Matrix Int and Int is not defined. In such case, the syntax a + (scalar b) must be used.

Operator * is interpreted as element-wise multiplication. To do matrix multiplication, use the × operator. It can be also used to multiply matrices with vectors. Examples:

v = matrix1D [1,2,3]
m = matrix2D [[1,0,1],[2,-1,1],[0,1,-1]]
v × m -- multiply row vector v with matrix m
m × v -- multiply matrix m with column vector v
m × m -- multiply matrix m by itself

There are also functions for element-wise boolean operations. They work similarly and they have the # suffix. Examples:

a = matrix2D [[True, True], [False, False]]
b = matrix2D [[True, False], [True, False]]
a &# b -- element-wise and
a |# b -- element-wise or

There are also element-wise comparison functions in a similar fashion. Examples:

a = eye 3
b = ones [3,3]
a ==# b -- returns a matrix of bools saying where the element of matrix a is equal to the corresponding element in matrix b.
a <# b -- similar as above except that it tests for being less than instead of equality.
a ==# 1 -- Broadcasting works for these operators too, so this syntax is valid also.

This thread is not supposed to serve as a documentation, so I listed only some of the functions.

What I’m going to add:

linear equation solving (at least using Gaussian elimination)
inverse matrices, determinants
SVD decomposition, QR decomposition, eigenvalue decomposition, …
maybe more fancy functions to operate with matrices

Do you have any other idea what this library shall be able to do? Don’t hesitate to share it.

romes · June 16, 2022, 1:14am

Looks very cool! I’m looking forward to seeing more of it, matlab and numpy are good references

jodak · June 17, 2022, 9:55pm

Good job Hume. I’m happy to see that there are other people using Haskell for numerical computations. Since I wrote a similar library (nuha), I am interested in some aspects:

Have you done performance comparisons with other libraries from other languages like matlab or numpy?
I’m also using Data.Vector as the underlyling data structure, currently the unboxed version. Do you think a different underlying structure like bytestring could bring performance improvements?

I’ve also implemented QR decomposition which I use for a linear equation solver. Maybe there are some useful ideas for you. Pretty sure it works correct, but performance drops off noticeably for matrices larger than 50 x 50.

jaror · June 17, 2022, 10:16pm

I would recommend contiguous or just plain primitive arrays instead of bytestrings if you want a plain array type without the fusion mechanism that vector uses.

qbit · June 18, 2022, 4:57pm

Really nice work!!!

I think it would be nice to have the backend abstracted away, so users can easily switch between implementations (I think that would be the advantage over numpy). Haskell has quite a bit of array processing libs (accelerate, arrayfire, repa, massiv, etc) that can serve as backends.

turion · June 20, 2022, 7:22am

Time-series operations like moving averages, band filter, interpolation, …
Interoperability with standard formats like CSV, JSON
Higher dimensional tensors, and axis-polymorphic aggregation (sum, max, min, count, etc.)
Maybe streaming support? Basically, let one of the axes be a very long time axis and provide a way to manipulate that data without having all of it in memory at the same time

Hume2 · June 21, 2022, 2:25pm

I tried to implement Gaussian elimination. It works well, it can also work with rational numbers, but it is quite slow for matrices 50×50 or so. To achieve better performance, I have to transform the matrix to Data.Vector and back to force-evaluate the elements. But it is still slower than I expected.

Separating the backed from the framework is a good idea. I thought that it won’t be necessary but when I saw the performance of GEM, it would be great to have the ability to change the framework.

Adding moving averages, band filters, interpolation etc… is a good idea.
Interoperability with CSV and JSON seems to be useful also.
Aggregation is half-way implemented, there is a function reduce which can for example sum the array in a given axis.
Streaming is a good idea. That wouldn’t be hard to attain if the backed is separated from the framework. Currently, the matrix does not need to allocate memory for all its element if there is a way to calculate i-th element. For example, you can create a huge array of zeros with just a little memory allocated.

jodak · June 21, 2022, 5:25pm

After taking a closer look at your matrix data type, it appears to be purely algebraically defined with no underlying memory data structure. I think this approach is better suited for sparse matrices. There are also special libraries that take this approach. I’m not entirely sure, but I think general purpose matrix libraries like Matlab and Numpy define matrices in a dense way, either in row major or column major order.

Hume2 · June 23, 2022, 1:41pm

I decided for this structure to save memory. If I make a huge matrix and I do a local change, it does not copy whole the matrix. All elements except the changed ones point to the old matrix, only the changed elements point to a smaller data structure. When the matrix is created from a vector, the matrix points to the elements of the vector, so the vector becomes the underlying data structure. That is why transforming the matrix to vector and back improves performance. I hope, it’s understandable from my comment. Please, ask if I wasn’t clear enough.

jodak · June 23, 2022, 3:28pm

Sounds somehow reasonable to me. There is also MutableArray in Data.Primitive which may be used to avoid copying a whole matrix if one only wants to change a few elements. I find it difficult to decide which structure is most suitable for the different matrix types (dense/sparse) and operations you can have. My intuition tells me that the fastest way to do matrix operations is to go low level (think of Fortran wich is unbeaten in terms of speed). I think that in principle you could achieve similar speeds with Haskell, but you have to program Haskell in an uncommon low level way. On the other side, if you go low level, it seems to me you have to use internal GHC structures that in my opinion should be avoided.

The only way to really find out how a datastructure and implementation perform is measuring. There is a benchmark for existing matrix libraries here. Haven’t found the time yet to compare my library. Although I know my library is in some aspects slow, it’s fast enough for my purpose. At least until know.

jodak · June 23, 2022, 9:38pm

Thank you for your recommendations. What irritates me about the primitive package is that the Array datatype there just wraps the internal Array# type of GHC. It’s also not intuitive to me how to initialize an array from a list in primitive.

When I think of C and how arrays are defined there, I tend to use Foreign.Ptr and Foreign.Marshal.Array from base to built an own array type. Do you think this would be a viable option to achieve better performance? Or do you know of some existing array library that is based on Data.Foreign? I saw you did some speed optimized low level programing for the benchmarksgame page. So maybe you can give me some advice on this.

jaror · June 23, 2022, 10:18pm

Can you elaborate what you don’t like about this? I think it is nice because it is really as primitive as you can get. This has the least amount of overhead. Other libraries like array and vector add some extra layers on top of this but in the end they are also based on Array#.

For Array you can just use fromList because it is an instance of IsList (so you can also use normal list notation if you enable the OverloadedLists extension) and there is also a function arrayFromList. But for a numpy alternative you should probably be using PrimArray which is unboxed (like arrays in C). It has a function primArrayFromList.

I guess you could use ForeignPtr but I don’t see any clear advantages over PrimArray.

jodak · June 23, 2022, 11:03pm

I just thought that the types with an # at the end are internals of GHC and not part of the language standard. I only want to use types that are independent of the compiler, but maybe my fear is unfounded.

jaror · June 23, 2022, 11:15pm

Ah, that is a fair concern. GHC is so ubiquitous that it is very common for libraries to depend on GHC specific functionality. I think at this point such a choice is purely ideological because it doesn’t seem like there will be an alternative compiler that can rival GHC any time soon. So depending on GHC specific functionality doesn’t really have practical consequences, at least for the next few years.

But of course you are free to choose if you want to depend only on the standard or not.

jodak · June 23, 2022, 11:54pm

If Array# and PrimArray# are the basic types most other higher level array libraries depend on, would it be possible to decouple these types from GHC and move them to base? After all, the discussion about a missing standard array/vector type has been going on for a long time.

jaror · June 24, 2022, 1:15am

They are actually in base, but base is also for a large part GHC specific (certainly the GHC.* modules but also others).

There is ongoing work in splitting up base into GHC-specific and standard parts. And there are also plans to move some of primitive's wrapper types to base.

RCHG · February 24, 2023, 10:00pm

This is a nice projec, congratulations. Apologies that I am adding comments so late!

Here I suggest to have a look to the work done in OCaml for a complete scientific programming environment, maybe you can find there other ideas about what the library should do.

Here you have some links:

https://github.com/owlbarn/owl
https://ocaml.xyz/

Actually around this library there are 2 books that explains both the decisions taken to develop the tool, and the capabilities themselves. One of the books is open-source:

Architecture of Advanced Numerical Analysis Systems

In the case of owl, it relies on the bigarray library to handle multidimensional arrays. The choice of bigarray aparently allows a C-type representation which can be good for the interoperability with C external methods. An important aspect to achieve good speed (and done also in Julia and numpy is an interface to OpenBlas and Lapack), The documentation of owl has also a discussion of the implementation of sparse vs dense matrices.

Beyond the linear algebra itself. A good implementation of the multidimensional arrays together with the potential of haskell can be great for automatic differenciation and ODE solvers.