-
Notifications
You must be signed in to change notification settings - Fork 0
Roadmap
I've sketched out a four-step development plan which I believe will maximize the potential for adoption of this library. Careful attention has been paid to maximizing modularity, and therefore generality. Ultimately, I hope this solution could be dropped into any existing project in any language in a near-seamless manner.
This first step will build out the core functionality of the library in a pure Haskell context. The key deliverable will be a packaged Haskell library containing (at least a minimal demonstrative set of) the target functions of the library. Secondary deliverables will include comprehensive tests and documentation, and benchmarks. My main learning focus will be on the Haskell language itself and on its software packaging and distribution ecosystem.
This implies an initial research phase which will begin to identify the functions the library will implement. I'll be looking largely to NumPy [4] and Scikit-learn [5] for inspiration and outlines here.
Part of the value proposition of this project (which I hope will improve adoption) will be compatibility with non-Haskell projects. Other than the (un)availability of libraries, the learning curve of the language itself is a hindrance to the adoption of Haskell in the data science community 2. By exposing this library to other projects in other languages, we can remove the language learning curve from the equation and offer a more gentle introduction to the applications of Haskell within data science.
The ultimate goal (i.e. primary deliverable) would be a library of bindings in a target language that allow for core library functions to be called as native functions while executing in the Haskell runtime. I don't know just yet how I'll go about this, but two options I've started investigating are outlined below:
-
A combination of Haskell's native Foreign Function Interface [6] with nh2's
call-haskell-from-anything
tool [7]. Once the library in the target language is configured and/ or generated, its functions appear to behave normally. However, this approach would require (in addition to regular maintenance of the core library) maintenance of the FFI (which may live in a separate module which simply re-exports core functions) and maintenance of of serialization handling in the target language. -
Wrapping the library in an Apache Thrift [8] service. This option is attractive because of the robust existing ecosystem of projects (and online documentation) under this framework. I also have a little personal experience with it. The problem I foresee from my initial research is that using Thrift will impose a heavy and opinionated structure on the project and make it more difficult to generate the client-language libraries.
On the other hand, this RPC/ service-oriented approach may lend itself better to a more diverse set of clients (e.g. being able to use HDSK from within browser-side JavaScript [9], though see next section for more).
To evaluate these (and any other potential approaches), I will consider 1) the performance (particularly in the serialization options) and 2) the ease-of-use from the point of view of the target-language client.
At this point, it will be possible to begin testing for one of the secondary goals of the project: to provide more performant data science utilities than the current de facto solutions in the target languages.
In this stage, I hope to learn about how Haskell interoperates with other languages and to develop and understanding of remote procedure call frameworks and where their most appropriate applications are.
That is, set the library down behind a web server and wire up each function to an endpoint. By my current conception, every top-level function in the library will have the same domain and range: data values [10]. Specifically, data serializable to the JSON format (numbers and lists of numbers primarily).
This interface translates quite seamlessly to the Web, and by providing the library as a service, we provide an alternative to the above architecture which makes no assumptions about or requests of the client. It simply provides the operations as a web service (perhaps you could see it as a Web-generalized version of the above Thrift proposal).
I suggest this 1) because improving portability may improve adoption, and 2) because I'm interested in learning how to do this.
By publishing the library in a container, we open the door for developers to
(for example) deploy the generated Thrift server with a single docker
command
for use in their projects. We really want to make the barrier to entry as low
as possible, and I see this as a way of doing that.