diff --git a/posts/2024-06-30-clojurists-together-update-may-jun-2024.md b/posts/2024-06-30-clojurists-together-update-may-jun-2024.md
new file mode 100644
index 0000000..a93e16e
--- /dev/null
+++ b/posts/2024-06-30-clojurists-together-update-may-jun-2024.md
@@ -0,0 +1,57 @@
+Title: OSS Updates May and June 2024
+Date: 2024-06-30
+Tags: open source, clojure, clojurists together, oss updates
+
+This is a summary of the open source work I've spent my time on throughout May and June, 2024. There were lots of small bug fixes and reports, driven by work on the Clojure Data Cookbook. This work was also the impetus for my initial release of [`tcutils`](https://github.com/scicloj/tcutils), a library of utility functions for working with tablecloth datasets. I also had the wonderful opportunity to attend PyData London in June and found it really insightful and inspiring. Read on for more details.
+
+## Sponsors
+This work is made possible by the generous ongoing support of my sponsors. I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing me with lucrative enough grants that I can reduce my client work significantly and afford to spend more time on these projects.
+
+If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my [GitHub sponsors page](https://github.com/sponsors/kiramclean). On to the updates!
+
+## Ecosystem issue reports and bug fixes
+Working on the cookbook these last couple of months turned up a few small issues in ecosystem libraries. The other developers of Clojure's data science tools are such a pleasure to work with, it's so rare and nice to have a distributed team of people capable of getting cool things built asynchronously. Here are some details of a few particular issues that came up:
+- Small problem loading .xls/.xlsx files as datasets if they had a number as a column name: [discussed here](https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/xlsx.20column.20parsing/near/437313810), [reported here](https://github.com/techascent/tech.ml.dataset/issues/408), and graciously [fixed by Chris Nuernberger](https://github.com/techascent/tech.ml.dataset/commit/24c0e646f289210aa95c1ac9998cb2ddd5c9f836).
+- Unexpected behaviour when comparing certain numeric types in `dtype-next`: [discussed here](https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/numeric.20datatypes/near/438617694%5D(https://clojurians.zulipchat.com/%23narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/numeric.20datatypes/near/438617694), [reported here](https://github.com/cnuernber/dtype-next/issues/99), and again [fixed by Chris](https://github.com/cnuernber/dtype-next/commit/563fe9c13797feb206391cd951655942e3e6cf0f). This one sadly had some unintended consequences that [generateme found and reported here](https://github.com/cnuernber/dtype-next/issues/103).
+- [Many improvements to Clay](https://github.com/scicloj/clay/blob/b299d060c3edbce789a55fee3efedce42fbd2ab4/CHANGELOG.md) by Daniel Slutsky, especially a couple of ones that make the quarto publications it produces much nicer: [fixing too-wide tables in quarto pages](https://github.com/scicloj/clay/pull/102) and [supporting limiting the number of table rows that get displayed](https://clojurians.zulipchat.com/#narrow/stream/321125-noj-dev/topic/kindly.20options/near/440663980).
+- Some good discussions about how best to incorporate the myriad of dependencies required to use Java machine learning libraries in Clojure libs, including sorting out what to do about [transitive dependencies in our tribuo wrapper](https://github.com/scicloj/scicloj.ml.tribuo/issues/1), led by Carsten Behring.
+
+## Initial release of tcutils
+In my explorations of other languages' tools for working data I often come across nice utility functions that are super simple but have a big impact on the ergonomics of using the tools. I wanted to start bringing some of these convenience utilities to Clojure, so for now I'm putting them in [`tcutils`](https://github.com/scicloj/tcutils). So far only a handful of helpers are implemented (`lag`, `lead`, `cumsum`, and `clean-column-names`). The goal is to eventually fill out more utilities that save people from having to dig into the documentation of half a dozen different libraries to figure out how to implement things like these. The goal is not to achieve feature parity or to exactly copy similar libraries, like pandas or dplyr, but rather to take inspiration from them and make our tools easier to use for people who are used to these conveniences.
+
+## Progress on Clojure Data Cookbook
+I spent a lot of time on the Clojure Data Cookbook over these last two months. Notable progress includes:
+- The introductory chapters bear some resemblance now to the final form they'll take.
+- The overall structure of the book is much more clear now.
+- I started the example analysis that will serve as the high-level introductory section of the book.
+- The publishing and deployment process is finally working.
+
+It's still very much in progress, but in the interest of transparency the work-in-progress version is [available online now](https://github.com/scicloj/clojure-data-cookbook). It will continue to evolve and change as I fill out more and more of the chapters, but there's enough of it available now to hopefully give a sense of the style and tone I'm going for. I also finally have the publishing workflow set up and it's generating a nice-looking Quarto book, thanks to all of Daniel Slutsky's amazing work on Clay and Quarto integration recently.
+
+## Progress on high-level goals
+The high-level goal of my work in general remains to steward Clojure's data science ecosystem to a state of maturity and flourishing so that data practitioners can use it to get real work done. Toward this end, I set up a [project board](https://github.com/users/kiramclean/projects/4) to track progress toward what I see as the main components of this project.
+
+Over the last couple of months, beginning with a prototype demoed at my [London Clojurians talk in April](https://www.youtube.com/watch?v=eUFf3-og_-Y), Daniel Slutsky has made tremendous progress on our goal of implementing a grammar of graphics in Clojure in the new [hanamicloth library](https://github.com/scicloj/hanamicloth). The near-term goal is to stabilize the API of this library enough that it can be used to provide a user-friendly way to accomplish all of the simple data visualization tasks that are currently possible with our other tools. The long term goal is to take the lessons we learn from this library and build a JVM-only grammar of graphics library for doing data visualization "right" in Clojure.
+
+The development and surrounding discussions of hanamicloth have also made me realize it would be useful to write an overview of the current state of dataviz options for Clojure and why we're working on building something new. That's on my list for the coming months, but lower priority than actual development work.
+
+## Impressions from PyData London
+I got to attend PyData London this year thanks to a client of mine who was sponsoring the conference. I learned a lot and found the talks very interesting. My overall impression is that data science is maturing as a discipline, with more polished methods and robust theory backing up different approaches to data-related problems. With this maturation, though, comes higher expectations for production-ready, professional quality results. Most of the talks focused on high-level concerns like observability, scalability, and long-term stewardship of large open-source projects.
+
+There are a lot of reasons why Python is just not ideal for building highly available, high-performance systems, and I really believe this is a good time to be building alternative tools for data science. Python is obviously entrenched as the current default language for working with data, but it is difficult and slow to write code that can take full advantage of modern hardware (because of the infamous global interpreter lock, reference counting, slow I/O, among other reasons). And to be fair, the Python community knows this. It's why virtually all of the libraries that do the heavy lifting for data science in Python are actually implemented in C (numpy, pandas) or Rust (Polars, Pydantic), or are wrappers around C++ (PyTorch, TensorFlow, matplotlib) or Java (PySpark, Pydoop, confluent-kafka) libraries.
+
+I think this provides a lot of insights into what data practitioners want. It's clear that users _want_ approachable, simple, human-readable interfaces for all of these tools, and that any new tool needs to interoperate with the rest of the ones currently in use. People are also [tired of churn](https://news.ycombinator.com/item?id=40815097) and are craving stability. I think Clojure has a lot to offer in all of these areas and is well placed to become more widely adopted for data science.
+
+## Ongoing work
+My focus over the next two months will remain on the cookbook. My main goal is to finish the introductory chapter with the housing price analysis and to continue putting together the data import section with instructions and examples for all file formats that can reasonably be supported easily at this time.
+
+I'll continue to support and contribute to all of the ecosystem libraries I come across in my writings and analysis work in hopes of smoothing out all the rough edges I find.
+
+Thanks for reading. I always love hearing from people who are interested in any of the things I'm working on. If that's you, don't hesitate to be in touch :)
+
+
+
+
+
+
+
diff --git a/public/2024-06-30-clojurists-together-update-may-jun-2024.html b/public/2024-06-30-clojurists-together-update-may-jun-2024.html
new file mode 100644
index 0000000..8590d5f
--- /dev/null
+++ b/public/2024-06-30-clojurists-together-update-may-jun-2024.html
@@ -0,0 +1,102 @@
+
+
+
This is a summary of the open source work I've spent my time on throughout May and June, 2024. There were lots of small bug fixes and reports, driven by work on the Clojure Data Cookbook. This work was also the impetus for my initial release of tcutils, a library of utility functions for working with tablecloth datasets. I also had the wonderful opportunity to attend PyData London in June and found it really insightful and inspiring. Read on for more details.
Sponsors
This work is made possible by the generous ongoing support of my sponsors. I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing me with lucrative enough grants that I can reduce my client work significantly and afford to spend more time on these projects.
If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my GitHub sponsors page. On to the updates!
Ecosystem issue reports and bug fixes
Working on the cookbook these last couple of months turned up a few small issues in ecosystem libraries. The other developers of Clojure's data science tools are such a pleasure to work with, it's so rare and nice to have a distributed team of people capable of getting cool things built asynchronously. Here are some details of a few particular issues that came up:
Some good discussions about how best to incorporate the myriad of dependencies required to use Java machine learning libraries in Clojure libs, including sorting out what to do about transitive dependencies in our tribuo wrapper, led by Carsten Behring.
Initial release of tcutils
In my explorations of other languages' tools for working data I often come across nice utility functions that are super simple but have a big impact on the ergonomics of using the tools. I wanted to start bringing some of these convenience utilities to Clojure, so for now I'm putting them in tcutils. So far only a handful of helpers are implemented (lag, lead, cumsum, and clean-column-names). The goal is to eventually fill out more utilities that save people from having to dig into the documentation of half a dozen different libraries to figure out how to implement things like these. The goal is not to achieve feature parity or to exactly copy similar libraries, like pandas or dplyr, but rather to take inspiration from them and make our tools easier to use for people who are used to these conveniences.
Progress on Clojure Data Cookbook
I spent a lot of time on the Clojure Data Cookbook over these last two months. Notable progress includes:
The introductory chapters bear some resemblance now to the final form they'll take.
The overall structure of the book is much more clear now.
I started the example analysis that will serve as the high-level introductory section of the book.
The publishing and deployment process is finally working.
It's still very much in progress, but in the interest of transparency the work-in-progress version is available online now. It will continue to evolve and change as I fill out more and more of the chapters, but there's enough of it available now to hopefully give a sense of the style and tone I'm going for. I also finally have the publishing workflow set up and it's generating a nice-looking Quarto book, thanks to all of Daniel Slutsky's amazing work on Clay and Quarto integration recently.
Progress on high-level goals
The high-level goal of my work in general remains to steward Clojure's data science ecosystem to a state of maturity and flourishing so that data practitioners can use it to get real work done. Toward this end, I set up a project board to track progress toward what I see as the main components of this project.
Over the last couple of months, beginning with a prototype demoed at my London Clojurians talk in April, Daniel Slutsky has made tremendous progress on our goal of implementing a grammar of graphics in Clojure in the new hanamicloth library. The near-term goal is to stabilize the API of this library enough that it can be used to provide a user-friendly way to accomplish all of the simple data visualization tasks that are currently possible with our other tools. The long term goal is to take the lessons we learn from this library and build a JVM-only grammar of graphics library for doing data visualization "right" in Clojure.
The development and surrounding discussions of hanamicloth have also made me realize it would be useful to write an overview of the current state of dataviz options for Clojure and why we're working on building something new. That's on my list for the coming months, but lower priority than actual development work.
Impressions from PyData London
I got to attend PyData London this year thanks to a client of mine who was sponsoring the conference. I learned a lot and found the talks very interesting. My overall impression is that data science is maturing as a discipline, with more polished methods and robust theory backing up different approaches to data-related problems. With this maturation, though, comes higher expectations for production-ready, professional quality results. Most of the talks focused on high-level concerns like observability, scalability, and long-term stewardship of large open-source projects.
There are a lot of reasons why Python is just not ideal for building highly available, high-performance systems, and I really believe this is a good time to be building alternative tools for data science. Python is obviously entrenched as the current default language for working with data, but it is difficult and slow to write code that can take full advantage of modern hardware (because of the infamous global interpreter lock, reference counting, slow I/O, among other reasons). And to be fair, the Python community knows this. It's why virtually all of the libraries that do the heavy lifting for data science in Python are actually implemented in C (numpy, pandas) or Rust (Polars, Pydantic), or are wrappers around C++ (PyTorch, TensorFlow, matplotlib) or Java (PySpark, Pydoop, confluent-kafka) libraries.
I think this provides a lot of insights into what data practitioners want. It's clear that users want approachable, simple, human-readable interfaces for all of these tools, and that any new tool needs to interoperate with the rest of the ones currently in use. People are also tired of churn and are craving stability. I think Clojure has a lot to offer in all of these areas and is well placed to become more widely adopted for data science.
Ongoing work
My focus over the next two months will remain on the cookbook. My main goal is to finish the introductory chapter with the housing price analysis and to continue putting together the data import section with instructions and examples for all file formats that can reasonably be supported easily at this time.
I'll continue to support and contribute to all of the ecosystem libraries I come across in my writings and analysis work in hopes of smoothing out all the rough edges I find.
Thanks for reading. I always love hearing from people who are interested in any of the things I'm working on. If that's you, don't hesitate to be in touch :)
diff --git a/public/atom.xml b/public/atom.xml
index 71f398c..0dad078 100644
--- a/public/atom.xml
+++ b/public/atom.xml
@@ -3,11 +3,18 @@
Code with Kira
- 2024-05-01T18:10:13+00:00
+ 2024-07-01T03:36:37+00:00https://codewithkira.comKira McLean
+
+ https://codewithkira.com/2024-06-30-clojurists-together-update-may-jun-2024.html
+
+ OSS Updates May and June 2024
+ 2024-06-30T23:59:59+00:00
+ This is a summary of the open source work I've spent my time on throughout May and June, 2024. There were lots of small bug fixes and reports, driven by work on the Clojure Data Cookbook. This work was also the impetus for my initial release of tcutils, a library of utility functions for working with tablecloth datasets. I also had the wonderful opportunity to attend PyData London in June and found it really insightful and inspiring. Read on for more details.
Sponsors
This work is made possible by the generous ongoing support of my sponsors. I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing me with lucrative enough grants that I can reduce my client work significantly and afford to spend more time on these projects.
If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my GitHub sponsors page. On to the updates!
Ecosystem issue reports and bug fixes
Working on the cookbook these last couple of months turned up a few small issues in ecosystem libraries. The other developers of Clojure's data science tools are such a pleasure to work with, it's so rare and nice to have a distributed team of people capable of getting cool things built asynchronously. Here are some details of a few particular issues that came up:
Some good discussions about how best to incorporate the myriad of dependencies required to use Java machine learning libraries in Clojure libs, including sorting out what to do about transitive dependencies in our tribuo wrapper, led by Carsten Behring.
Initial release of tcutils
In my explorations of other languages' tools for working data I often come across nice utility functions that are super simple but have a big impact on the ergonomics of using the tools. I wanted to start bringing some of these convenience utilities to Clojure, so for now I'm putting them in tcutils. So far only a handful of helpers are implemented (lag, lead, cumsum, and clean-column-names). The goal is to eventually fill out more utilities that save people from having to dig into the documentation of half a dozen different libraries to figure out how to implement things like these. The goal is not to achieve feature parity or to exactly copy similar libraries, like pandas or dplyr, but rather to take inspiration from them and make our tools easier to use for people who are used to these conveniences.
Progress on Clojure Data Cookbook
I spent a lot of time on the Clojure Data Cookbook over these last two months. Notable progress includes:
The introductory chapters bear some resemblance now to the final form they'll take.
The overall structure of the book is much more clear now.
I started the example analysis that will serve as the high-level introductory section of the book.
The publishing and deployment process is finally working.
It's still very much in progress, but in the interest of transparency the work-in-progress version is available online now. It will continue to evolve and change as I fill out more and more of the chapters, but there's enough of it available now to hopefully give a sense of the style and tone I'm going for. I also finally have the publishing workflow set up and it's generating a nice-looking Quarto book, thanks to all of Daniel Slutsky's amazing work on Clay and Quarto integration recently.
Progress on high-level goals
The high-level goal of my work in general remains to steward Clojure's data science ecosystem to a state of maturity and flourishing so that data practitioners can use it to get real work done. Toward this end, I set up a project board to track progress toward what I see as the main components of this project.
Over the last couple of months, beginning with a prototype demoed at my London Clojurians talk in April, Daniel Slutsky has made tremendous progress on our goal of implementing a grammar of graphics in Clojure in the new hanamicloth library. The near-term goal is to stabilize the API of this library enough that it can be used to provide a user-friendly way to accomplish all of the simple data visualization tasks that are currently possible with our other tools. The long term goal is to take the lessons we learn from this library and build a JVM-only grammar of graphics library for doing data visualization "right" in Clojure.
The development and surrounding discussions of hanamicloth have also made me realize it would be useful to write an overview of the current state of dataviz options for Clojure and why we're working on building something new. That's on my list for the coming months, but lower priority than actual development work.
Impressions from PyData London
I got to attend PyData London this year thanks to a client of mine who was sponsoring the conference. I learned a lot and found the talks very interesting. My overall impression is that data science is maturing as a discipline, with more polished methods and robust theory backing up different approaches to data-related problems. With this maturation, though, comes higher expectations for production-ready, professional quality results. Most of the talks focused on high-level concerns like observability, scalability, and long-term stewardship of large open-source projects.
There are a lot of reasons why Python is just not ideal for building highly available, high-performance systems, and I really believe this is a good time to be building alternative tools for data science. Python is obviously entrenched as the current default language for working with data, but it is difficult and slow to write code that can take full advantage of modern hardware (because of the infamous global interpreter lock, reference counting, slow I/O, among other reasons). And to be fair, the Python community knows this. It's why virtually all of the libraries that do the heavy lifting for data science in Python are actually implemented in C (numpy, pandas) or Rust (Polars, Pydantic), or are wrappers around C++ (PyTorch, TensorFlow, matplotlib) or Java (PySpark, Pydoop, confluent-kafka) libraries.
I think this provides a lot of insights into what data practitioners want. It's clear that users want approachable, simple, human-readable interfaces for all of these tools, and that any new tool needs to interoperate with the rest of the ones currently in use. People are also tired of churn and are craving stability. I think Clojure has a lot to offer in all of these areas and is well placed to become more widely adopted for data science.
Ongoing work
My focus over the next two months will remain on the cookbook. My main goal is to finish the introductory chapter with the housing price analysis and to continue putting together the data import section with instructions and examples for all file formats that can reasonably be supported easily at this time.
I'll continue to support and contribute to all of the ecosystem libraries I come across in my writings and analysis work in hopes of smoothing out all the rough edges I find.
Thanks for reading. I always love hearing from people who are interested in any of the things I'm working on. If that's you, don't hesitate to be in touch :)
This is a summary of the open source work I've spent my time on throughout March and April, 2024. Overall it was a really insightful couple of months for me, with lots of productive discussions and meetings happening among key contributors to Clojure's data science ecosystem and great progress toward some of our most ambitious goals.
Sponsors
This work is made possible by the generous ongoing support of my sponsors. I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing me with lucrative enough grants that I can reduce my client work significantly and afford to spend more time on these projects.
If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my GitHub sponsors page. On to the updates!
Grammar of graphics in Clojure
With help from Daniel Slutsky and others in the community, I started some concrete work on implementing a grammar of graphics in Clojure. I'm convinced this is the correct long-term solution for dataviz in Clojure, but it is a big project that will take time, including a lot of hammock time. It's still useful to play around with proofs of concept whilst thinking through problems, though, and in the interest of transparency I'm making all of those experiments public.
The discussions around this development are all also happening in public. There were two visual tools meetups focused on this over the last two months (link 1, link 2). And at the London Clojurians talk I just gave today I demonstrated an example of one proposed implementation of a grammar-of-graphics-like API on top of hanami implemented by Daniel.
There are more meetups planned for the coming months and work in this area for the foreseeable future will look like researching and understanding the fundamentals of the grammar of graphics in order to design a simple implementation in Clojure.
Clojure's ML and statistics tools
I spent a lot of time these last couple of months documenting and testing out Clojure's current ML tools, leading to many great conversations and one blog post that generated many more interesting discussions. The takeaway is that the tools themselves in this area are all quite mature and stable, but there are still ongoing discussions around how to best accommodate the different ways that people want to work with them. The overall goal in this area of my work is to stabilize the solutions so we can start advocating for specific ways of using them.
Below are some key takeaways from my research into all this stuff. Note none of these are my decisions to make alone, but represent my current opinions and what I will be advocating for within the community:
Smile will be slowly sunsetted from the ecosystem. The switch to GPL licensing was made in bad faith and many of the common models don't work on Apple chips. Given the abundance of suitable alternatives, the easiest option is to move away from depending on it.
A greater distinction between statistical modelling and machine learning workflows will be helpful. Right now there are many uses of the various models that are available in Clojure, and the wrappers and tools surrounding them are usually designed with a specific type of user in mind. For example machine learning people almost always have separate training and testing datasets, whereas statisticians "train" their models on an entire dataset. The highest-level APIs for these different usages (among others) look quite different, and we would benefit from having APIs that are ergonomic and familiar to our target users of various backgrounds.
We should agree on standards for accomplishing certain very common and basic tasks and propose a recommended usage for users. For example, there are almost a dozen ways to do linear regression in Clojure and it's not obvious which is "the best" way to someone not deeply familiar with the ecosystem.
Everything should work with tablecloth datasets and expect them as inputs. This is mostly the case already, but there is still some progress to be made.
Foundations of Clojure's data science stack
I continue to work on guides and tutorials for the parts of Clojure's data science stack that I feel are ready for prime time, mainly tablecloth and all of the amazing underlying libraries it leverages. Every once in a while this turns up surprises, for example this month I was surprised at how column header processing is handled for nippy files specifically. I also fixed one bug in tablecloth itself, which I discovered in the process of writing a tutorial earlier in March. I have a pile of in-progress guides focusing on some more in-depth topics from developing the London Clojurians talk that I'm going to tidy up and publish in the coming months.
The overarching goal in this area is to create a unified data science stack with libraries for processing, modelling, and visualization that all interoperate seamlessly and work with tablecloth datasets, like the tidyverse in R. Part of achieving that is making sure that tablecloth is rock solid, which just takes a lot of poking and prodding.
London Clojurians talk
This talk was a big inspiration for diving deep into Clojure's data science ecosystem. I experimented with a ton of different datasets for the workshop and discovered tons of potential areas for future development. Trying to put together a polished data workflow really exposed many of the key areas I think we should be focusing on and gave me a lot of inspiration for future work. I spent a ton of time exploring all of the possible ways to demonstrate a broad sample of data science tools and learned a lot along the way.
The resources from the talk are all available in this repo and the video will be posted soon.
Summary of future work
I mentioned a few areas of focus above, below is a summary of the ongoing work as I see it. A framework for organizing this work is starting to emerge, and I've been thinking about in terms of four key areas:
Visualisation
Priority here is to release a stable dataviz API using the tools and wrappers we currently have so that we can start releasing guides and tutorials that follow a consistent style.
The long-term goal is to develop a robust, flexible, and stable data visualization library in Clojure itself based on the grammar of graphics.
Machine learning
Priority is to decide which APIs we will commit to supporting in the long term and stabilize the "glue" libraries that provide the high-level APIs for data-first users.
Long term goal is to support the full spectrum of libraries and models that are in everyday use by data science professionals.
Statistics
Priority is to document the current options for accomplishing basic statistical modelling tasks, including Clojure libraries we do have, Java libs, and Python interop.
Long term goal is to have tablecloth-compatible stats libraries implemented in pure Clojure.
Foundations
Priority is to build a tidyverse for Clojure. This includes battle-testing tablecloth, fully documenting its capabilities, and fixing remaining, small, sharp edges.
Going forward
My overarching goal (personally) is still to write a canonical resource for working with Clojure's data science stack (the Clojure Data Cookbook), and I'm still working on finding the right balance of documenting "work-in-progress" tools and libraries vs. delaying progress until I feel they are more "ready". Until now I've let the absence of stable or ideal APIs in certain areas hinder development of this book, but I'm starting to feel very confident in my understanding of the current direction of the ecosystem, enough so that I would feel good about releasing something a little bit more formal than a tutorial or guide and recommending usages with the caveat that development is ongoing in some areas. And while it will take a while to get where we want to go, I feel like I can finally see the path to getting there. It just takes a lot of work and lot of collaboration, but with your support we'll make it happen! Thanks for reading.
This is a summary of the open source work I've spent my time on throughout May and June, 2024. There were lots of small bug fixes and reports, driven by work on the Clojure Data Cookbook. This work was also the impetus for my initial release of tcutils, a library of utility functions for working with tablecloth datasets. I also had the wonderful opportunity to attend PyData London in June and found it really insightful and inspiring. Read on for more details.
Sponsors
This work is made possible by the generous ongoing support of my sponsors. I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing me with lucrative enough grants that I can reduce my client work significantly and afford to spend more time on these projects.
If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my GitHub sponsors page. On to the updates!
Ecosystem issue reports and bug fixes
Working on the cookbook these last couple of months turned up a few small issues in ecosystem libraries. The other developers of Clojure's data science tools are such a pleasure to work with, it's so rare and nice to have a distributed team of people capable of getting cool things built asynchronously. Here are some details of a few particular issues that came up:
Some good discussions about how best to incorporate the myriad of dependencies required to use Java machine learning libraries in Clojure libs, including sorting out what to do about transitive dependencies in our tribuo wrapper, led by Carsten Behring.
Initial release of tcutils
In my explorations of other languages' tools for working data I often come across nice utility functions that are super simple but have a big impact on the ergonomics of using the tools. I wanted to start bringing some of these convenience utilities to Clojure, so for now I'm putting them in tcutils. So far only a handful of helpers are implemented (lag, lead, cumsum, and clean-column-names). The goal is to eventually fill out more utilities that save people from having to dig into the documentation of half a dozen different libraries to figure out how to implement things like these. The goal is not to achieve feature parity or to exactly copy similar libraries, like pandas or dplyr, but rather to take inspiration from them and make our tools easier to use for people who are used to these conveniences.
Progress on Clojure Data Cookbook
I spent a lot of time on the Clojure Data Cookbook over these last two months. Notable progress includes:
The introductory chapters bear some resemblance now to the final form they'll take.
The overall structure of the book is much more clear now.
I started the example analysis that will serve as the high-level introductory section of the book.
The publishing and deployment process is finally working.
It's still very much in progress, but in the interest of transparency the work-in-progress version is available online now. It will continue to evolve and change as I fill out more and more of the chapters, but there's enough of it available now to hopefully give a sense of the style and tone I'm going for. I also finally have the publishing workflow set up and it's generating a nice-looking Quarto book, thanks to all of Daniel Slutsky's amazing work on Clay and Quarto integration recently.
Progress on high-level goals
The high-level goal of my work in general remains to steward Clojure's data science ecosystem to a state of maturity and flourishing so that data practitioners can use it to get real work done. Toward this end, I set up a project board to track progress toward what I see as the main components of this project.
Over the last couple of months, beginning with a prototype demoed at my London Clojurians talk in April, Daniel Slutsky has made tremendous progress on our goal of implementing a grammar of graphics in Clojure in the new hanamicloth library. The near-term goal is to stabilize the API of this library enough that it can be used to provide a user-friendly way to accomplish all of the simple data visualization tasks that are currently possible with our other tools. The long term goal is to take the lessons we learn from this library and build a JVM-only grammar of graphics library for doing data visualization "right" in Clojure.
The development and surrounding discussions of hanamicloth have also made me realize it would be useful to write an overview of the current state of dataviz options for Clojure and why we're working on building something new. That's on my list for the coming months, but lower priority than actual development work.
Impressions from PyData London
I got to attend PyData London this year thanks to a client of mine who was sponsoring the conference. I learned a lot and found the talks very interesting. My overall impression is that data science is maturing as a discipline, with more polished methods and robust theory backing up different approaches to data-related problems. With this maturation, though, comes higher expectations for production-ready, professional quality results. Most of the talks focused on high-level concerns like observability, scalability, and long-term stewardship of large open-source projects.
There are a lot of reasons why Python is just not ideal for building highly available, high-performance systems, and I really believe this is a good time to be building alternative tools for data science. Python is obviously entrenched as the current default language for working with data, but it is difficult and slow to write code that can take full advantage of modern hardware (because of the infamous global interpreter lock, reference counting, slow I/O, among other reasons). And to be fair, the Python community knows this. It's why virtually all of the libraries that do the heavy lifting for data science in Python are actually implemented in C (numpy, pandas) or Rust (Polars, Pydantic), or are wrappers around C++ (PyTorch, TensorFlow, matplotlib) or Java (PySpark, Pydoop, confluent-kafka) libraries.
I think this provides a lot of insights into what data practitioners want. It's clear that users want approachable, simple, human-readable interfaces for all of these tools, and that any new tool needs to interoperate with the rest of the ones currently in use. People are also tired of churn and are craving stability. I think Clojure has a lot to offer in all of these areas and is well placed to become more widely adopted for data science.
Ongoing work
My focus over the next two months will remain on the cookbook. My main goal is to finish the introductory chapter with the housing price analysis and to continue putting together the data import section with instructions and examples for all file formats that can reasonably be supported easily at this time.
I'll continue to support and contribute to all of the ecosystem libraries I come across in my writings and analysis work in hopes of smoothing out all the rough edges I find.
Thanks for reading. I always love hearing from people who are interested in any of the things I'm working on. If that's you, don't hesitate to be in touch :)
I had a really enlightening talk with Daniel Slutsky this week (who is an exceptional data scientist, software engineer, and community organizer I highly recommended meeting if you haven't already) about the current state of the machine learning landscape in Clojure. This post is my attempt to distill it into a summary for the community's benefit, so more people can understand where things are at and what the active developers in this space are working on.
It's no secret I love Clojure and especially working with data in Clojure, but it's fair to say that the Clojure for data science ecosystem is not anywhere near as easy to use or understand as reasonable potential users might expect. This is the main problem I'm focusing on this year, and there is significant effort being put into refining our tools to make them more accessible to a wider audience.
There are already people doing "real" machine learning work in Clojure, though, and below is an overview of what the current state of our libraries and tools are in that area, as of April 2024.
Update 2024-04-08: It's worth mentioning that deep learning and LLM libraries have been intentionally left out of this post in order to keep it a "reasonable" length. There is enough separate work happening in that space that it warrants its own, separate overview.
Summary
There are a lot of links in this post. This table is an attempt to aggregate and summarize them. There are more details worth reading below, but in case you don't have time, this is the gist of it. To make a very long story short, current efforts are heavily focused on consolidating all of these amazing libraries into one (or at least a small number of clearly delineated ones) that is/are easy-to-use, providing a comphrehensive toolkit for doing machine learning in Clojure.
Bridge from Clojure to R, less relevance for ML compared to Python interop
EPL-2.0
In addition to all of these libraries, the post mentions the Clojurians Zulip, the main Clojure-for-data-science community discussion forum, where main contributors to the ecosystem are active daily.
Java ML Libraries
There are two (sort of four) popular Java libraries that implement many of the main algorithms and tools used in machine learning today (e.g. classification, regression, clustering, model development, etc.): Tribuo (including XGBoost, more on that in a second) and Smile. We count Smile as two libraries because Smile 2.x is LGPL-licensed, and Smile 3.x is GPL-licensed, which poses some potential conflicts for some end users. The community consensus is converging around moving away from Smile due to the GPL-relicensing issue, focusing instead on Tribuo and hand-rolled solutions.
There is also XGBoost for the JVM, mentioned above, which is an implementation of gradient boosting. XGBoost is a collection of algorithms whereas Tribuo is a more comprehensive framework (including things like data management, model evaluation, and experiment tracking). XGBoost can be used from Tribuo, so I don't exactly count it as a standalone library, although it can also be used in that way.
Clojure wrappers
There are two main "families" of libraries that wrap these Java ML libraries in Clojure.
Fastmath includes statistical as well as machine learning tools for Clojure. Fastmath 2.4.0+ depends on Smile 2, and the forthcoming fastmath 3.x will have no Smile dependency at all. The clustering functionality in fastmath 2.x that depended on smile has been moved to the fastmath-clustering library, which will have a Smile 2.x dependency going forward. There is a strong preference in the community to avoid introducing GPL-licensed libraries into the ecosystem.
Clustering functionality will mostly be provided by scicloj.ml.tribuo going forward which, as you might expect, wraps the Tribuo Java library, and is likely to become the main source of ML algorithms for the ecosystem. This is one of a few libraries in the second family of libraries that wrap the Java libraries mentioned above. Other (self-explanatory) ones include scicloj.ml.smile, which wraps more of Smile than fastmath did (does), and scicloj.ml.xgboost.
It's also worth mentioning tech.ml.dataset (the core dataframe/dataset library underlying tablecloth), which incorporates some of the functionality of tribuo, with the API centred around individual datasets. There also used to be a library called tech.ml, which implements some machine learning tools, but has been deprecated in favour of the various libraries discussed above.
The concept of orienting an API around individual datasets vs something else leads me to the next group of libraries.
Clojure ML Pipelines
Metamorph is a library that implements a function composition mechanism for composing ML pipelines. It arises from the common ML practice of repeatedly running the same set of functions with varied parameters. You might, for example, try many different test/train splits to see how that affects your results, or fit the same data using many different algorithms, or try training your model using different sets of features. This leads to an explosion of pipeline permutations, so it's useful to have machinery to encapsulate the variable components of your ML pipeline into a single function. This is where metamorph.ml comes in.
Metamorph.ml is based on this concept of meta-functions and pipelines. It is currently the central library for orchestrating ML pipelines in Clojure. The API is stable, but there are currently many ways (10+) to achieve the same outcomes. This is great for power users who have complex needs and a clear understanding of the metamorph mental model, but it can be a bit daunting for newcomers, making it more challenging to pick a clear place to start. The community is actively discussing the best approach for consolidating and/or documenting these different approaches in the interest of making Clojure's ML stack more accessible.
Collections/Frameworks
The community is well aware that it is difficult to know where to get started and several efforts have been made in an attempt to make the path more clear for people who want tools that Just Work. scicloj.ml is one such project. It's a collection of libraries (mostly the ones mentioned above) with some lightweight wrappers and efforts in creating documentation.
The community is heading toward deprecating this library, though, in favour of noj, which we are hoping to stabilize in the near future. The goal is to have a single entry-point into the Clojure data science stack, gathering all the tools one would need to work with data consolidated in one place, with seamless interoperability akin to R's tidyverse of libraries.
Interop
It wouldn't be a complete roundup of the state of ML in Clojure without a mention of libpython-clj. This is a library that provides Python bindings for Clojure, so you can call Python code directly from Clojure if necessary. sklearn-clj makes use of this bridge to provide direct access to all of the estimators and models from Python scikit-learn in Clojure, so for cases where something is truly only available in Python, we can still access it.
It's worth also briefly mentioning clojisr here, which is a similar kind of bridge from Clojure to R (and there exist libraries for Julia and Wolframite, too), but these are all less relevant for the specific area of ML, where Python is the overwhelmingly most popular current tool of choice.
More updates
These discussions all happen in the open, on the Clojurian's Zulip instance, which has become the main gathering place of the Clojure-for-data-science community. The #data-science and #noj-dev streams are the most active on these topics at the time of this writing. You can follow along with developments in the trenches over there, or follow the key libraries on github for updates (scicloj.ml.tribuo, metamorph.ml, noj). I will also post periodic updates here and all the other corners of the internet where I lurk. Thank you for reading!
This is a summary of the open source work I've spent my time on throughout March and April, 2024. Overall it was a really insightful couple of months for me, with lots of productive discussions and meetings happening among key contributors to Clojure's data science ecosystem and great progress toward some of our most ambitious goals.
Sponsors
This work is made possible by the generous ongoing support of my sponsors. I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing me with lucrative enough grants that I can reduce my client work significantly and afford to spend more time on these projects.
If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my GitHub sponsors page. On to the updates!
Grammar of graphics in Clojure
With help from Daniel Slutsky and others in the community, I started some concrete work on implementing a grammar of graphics in Clojure. I'm convinced this is the correct long-term solution for dataviz in Clojure, but it is a big project that will take time, including a lot of hammock time. It's still useful to play around with proofs of concept whilst thinking through problems, though, and in the interest of transparency I'm making all of those experiments public.
The discussions around this development are all also happening in public. There were two visual tools meetups focused on this over the last two months (link 1, link 2). And at the London Clojurians talk I just gave today I demonstrated an example of one proposed implementation of a grammar-of-graphics-like API on top of hanami implemented by Daniel.
There are more meetups planned for the coming months and work in this area for the foreseeable future will look like researching and understanding the fundamentals of the grammar of graphics in order to design a simple implementation in Clojure.
Clojure's ML and statistics tools
I spent a lot of time these last couple of months documenting and testing out Clojure's current ML tools, leading to many great conversations and one blog post that generated many more interesting discussions. The takeaway is that the tools themselves in this area are all quite mature and stable, but there are still ongoing discussions around how to best accommodate the different ways that people want to work with them. The overall goal in this area of my work is to stabilize the solutions so we can start advocating for specific ways of using them.
Below are some key takeaways from my research into all this stuff. Note none of these are my decisions to make alone, but represent my current opinions and what I will be advocating for within the community:
Smile will be slowly sunsetted from the ecosystem. The switch to GPL licensing was made in bad faith and many of the common models don't work on Apple chips. Given the abundance of suitable alternatives, the easiest option is to move away from depending on it.
A greater distinction between statistical modelling and machine learning workflows will be helpful. Right now there are many uses of the various models that are available in Clojure, and the wrappers and tools surrounding them are usually designed with a specific type of user in mind. For example machine learning people almost always have separate training and testing datasets, whereas statisticians "train" their models on an entire dataset. The highest-level APIs for these different usages (among others) look quite different, and we would benefit from having APIs that are ergonomic and familiar to our target users of various backgrounds.
We should agree on standards for accomplishing certain very common and basic tasks and propose a recommended usage for users. For example, there are almost a dozen ways to do linear regression in Clojure and it's not obvious which is "the best" way to someone not deeply familiar with the ecosystem.
Everything should work with tablecloth datasets and expect them as inputs. This is mostly the case already, but there is still some progress to be made.
Foundations of Clojure's data science stack
I continue to work on guides and tutorials for the parts of Clojure's data science stack that I feel are ready for prime time, mainly tablecloth and all of the amazing underlying libraries it leverages. Every once in a while this turns up surprises, for example this month I was surprised at how column header processing is handled for nippy files specifically. I also fixed one bug in tablecloth itself, which I discovered in the process of writing a tutorial earlier in March. I have a pile of in-progress guides focusing on some more in-depth topics from developing the London Clojurians talk that I'm going to tidy up and publish in the coming months.
The overarching goal in this area is to create a unified data science stack with libraries for processing, modelling, and visualization that all interoperate seamlessly and work with tablecloth datasets, like the tidyverse in R. Part of achieving that is making sure that tablecloth is rock solid, which just takes a lot of poking and prodding.
London Clojurians talk
This talk was a big inspiration for diving deep into Clojure's data science ecosystem. I experimented with a ton of different datasets for the workshop and discovered tons of potential areas for future development. Trying to put together a polished data workflow really exposed many of the key areas I think we should be focusing on and gave me a lot of inspiration for future work. I spent a ton of time exploring all of the possible ways to demonstrate a broad sample of data science tools and learned a lot along the way.
The resources from the talk are all available in this repo and the video will be posted soon.
Summary of future work
I mentioned a few areas of focus above, below is a summary of the ongoing work as I see it. A framework for organizing this work is starting to emerge, and I've been thinking about in terms of four key areas:
Visualisation
Priority here is to release a stable dataviz API using the tools and wrappers we currently have so that we can start releasing guides and tutorials that follow a consistent style.
The long-term goal is to develop a robust, flexible, and stable data visualization library in Clojure itself based on the grammar of graphics.
Machine learning
Priority is to decide which APIs we will commit to supporting in the long term and stabilize the "glue" libraries that provide the high-level APIs for data-first users.
Long term goal is to support the full spectrum of libraries and models that are in everyday use by data science professionals.
Statistics
Priority is to document the current options for accomplishing basic statistical modelling tasks, including Clojure libraries we do have, Java libs, and Python interop.
Long term goal is to have tablecloth-compatible stats libraries implemented in pure Clojure.
Foundations
Priority is to build a tidyverse for Clojure. This includes battle-testing tablecloth, fully documenting its capabilities, and fixing remaining, small, sharp edges.
Going forward
My overarching goal (personally) is still to write a canonical resource for working with Clojure's data science stack (the Clojure Data Cookbook), and I'm still working on finding the right balance of documenting "work-in-progress" tools and libraries vs. delaying progress until I feel they are more "ready". Until now I've let the absence of stable or ideal APIs in certain areas hinder development of this book, but I'm starting to feel very confident in my understanding of the current direction of the ecosystem, enough so that I would feel good about releasing something a little bit more formal than a tutorial or guide and recommending usages with the caveat that development is ongoing in some areas. And while it will take a while to get where we want to go, I feel like I can finally see the path to getting there. It just takes a lot of work and lot of collaboration, but with your support we'll make it happen! Thanks for reading.
I was lucky enough to get funding this year from Clojurists together to work on some open source projects for the Clojure community. It's been a really fun couple of months getting more involved in the ecosystem and having the time to work on some projects that I've long thought would be valuable for the community. This post is a summary of the things I've been working on over the past two months.
Sponsors
First of all, I want to thank the sponsors that make this work possible. We're living through the worst tech job market since I started working as a software engineer, and I'm lucky to have a little bit of time and runway to work on things I find interesting thanks to the generous sponsors who find my work worthwhile.
Right now my work is primarily funded by Clojurists Together and Cognitect/Nubank. Thank you to these major sponsors, and to everyone who contributes to my continued work in the Clojure open source ecosystem.
If you find the work I do valuable, please share it with others or consider supporting it financially. I would love to be able to turn working on this kind of stuff into a sustainable career in the long term.
Clojure Tidy Tuesdays
The main thing I spent my time working on over the past couple of months was a collection of tutorials and guides for working with data in Clojure. The R for Data Science online learning community publishes toy datasets every week for "Tidy Tuesdays" with a question to answer or example article to reproduce. I've been going through them in Clojure, and it's proven a great tool for uncovering areas for future development in the Clojure data science ecosystem.
Other Work
The explorations with the Tidy Tuesday data have been revealing areas where I think we could benefit from more ergonomic ways to work with tablecloth datasets. I started two little projects each with a couple of little wrappers around existing functions to make them easier to use with tablecloth datasets. So far I'm calling them tcstats (for statistical operations on datasets) and tcutils (with miscellaneous dataset manipulation tools that aren't built-in to tablecloth directly).
I am also still working on the Clojure Data Cookbook. I nudged it forward ever so slightly these last couple of months, and I plan to finish it despite the remaining holes in Clojure's data science stack. I would love to also fill these in eventually, but the Cookbook will be a living document that can easily evolve and be updated as new tools and libraries are developed.
Lastly, one of the main missing pieces I'm discovering we really need to work on in Clojure's data science ecosystem is a robust yet flexible graphics library. There are a few great solutions that already exist, but they take different approaches to graphing that can make them a bit clumsy to work with when it comes time to build more complex visualisations. My dream is to implement a proper grammar of graphics in Clojure, giving the Clojure data ecosystem a "professional quality" graphics library, so to speak. Anyway.. there is still tons of work to do here so I'm grateful for the ongoing funding that will allow me to continue to focus a large amount of time on it for the foreseeable future.
I had a really enlightening talk with Daniel Slutsky this week (who is an exceptional data scientist, software engineer, and community organizer I highly recommended meeting if you haven't already) about the current state of the machine learning landscape in Clojure. This post is my attempt to distill it into a summary for the community's benefit, so more people can understand where things are at and what the active developers in this space are working on.
It's no secret I love Clojure and especially working with data in Clojure, but it's fair to say that the Clojure for data science ecosystem is not anywhere near as easy to use or understand as reasonable potential users might expect. This is the main problem I'm focusing on this year, and there is significant effort being put into refining our tools to make them more accessible to a wider audience.
There are already people doing "real" machine learning work in Clojure, though, and below is an overview of what the current state of our libraries and tools are in that area, as of April 2024.
Update 2024-04-08: It's worth mentioning that deep learning and LLM libraries have been intentionally left out of this post in order to keep it a "reasonable" length. There is enough separate work happening in that space that it warrants its own, separate overview.
Summary
There are a lot of links in this post. This table is an attempt to aggregate and summarize them. There are more details worth reading below, but in case you don't have time, this is the gist of it. To make a very long story short, current efforts are heavily focused on consolidating all of these amazing libraries into one (or at least a small number of clearly delineated ones) that is/are easy-to-use, providing a comphrehensive toolkit for doing machine learning in Clojure.
Bridge from Clojure to R, less relevance for ML compared to Python interop
EPL-2.0
In addition to all of these libraries, the post mentions the Clojurians Zulip, the main Clojure-for-data-science community discussion forum, where main contributors to the ecosystem are active daily.
Java ML Libraries
There are two (sort of four) popular Java libraries that implement many of the main algorithms and tools used in machine learning today (e.g. classification, regression, clustering, model development, etc.): Tribuo (including XGBoost, more on that in a second) and Smile. We count Smile as two libraries because Smile 2.x is LGPL-licensed, and Smile 3.x is GPL-licensed, which poses some potential conflicts for some end users. The community consensus is converging around moving away from Smile due to the GPL-relicensing issue, focusing instead on Tribuo and hand-rolled solutions.
There is also XGBoost for the JVM, mentioned above, which is an implementation of gradient boosting. XGBoost is a collection of algorithms whereas Tribuo is a more comprehensive framework (including things like data management, model evaluation, and experiment tracking). XGBoost can be used from Tribuo, so I don't exactly count it as a standalone library, although it can also be used in that way.
Clojure wrappers
There are two main "families" of libraries that wrap these Java ML libraries in Clojure.
Fastmath includes statistical as well as machine learning tools for Clojure. Fastmath 2.4.0+ depends on Smile 2, and the forthcoming fastmath 3.x will have no Smile dependency at all. The clustering functionality in fastmath 2.x that depended on smile has been moved to the fastmath-clustering library, which will have a Smile 2.x dependency going forward. There is a strong preference in the community to avoid introducing GPL-licensed libraries into the ecosystem.
Clustering functionality will mostly be provided by scicloj.ml.tribuo going forward which, as you might expect, wraps the Tribuo Java library, and is likely to become the main source of ML algorithms for the ecosystem. This is one of a few libraries in the second family of libraries that wrap the Java libraries mentioned above. Other (self-explanatory) ones include scicloj.ml.smile, which wraps more of Smile than fastmath did (does), and scicloj.ml.xgboost.
It's also worth mentioning tech.ml.dataset (the core dataframe/dataset library underlying tablecloth), which incorporates some of the functionality of tribuo, with the API centred around individual datasets. There also used to be a library called tech.ml, which implements some machine learning tools, but has been deprecated in favour of the various libraries discussed above.
The concept of orienting an API around individual datasets vs something else leads me to the next group of libraries.
Clojure ML Pipelines
Metamorph is a library that implements a function composition mechanism for composing ML pipelines. It arises from the common ML practice of repeatedly running the same set of functions with varied parameters. You might, for example, try many different test/train splits to see how that affects your results, or fit the same data using many different algorithms, or try training your model using different sets of features. This leads to an explosion of pipeline permutations, so it's useful to have machinery to encapsulate the variable components of your ML pipeline into a single function. This is where metamorph.ml comes in.
Metamorph.ml is based on this concept of meta-functions and pipelines. It is currently the central library for orchestrating ML pipelines in Clojure. The API is stable, but there are currently many ways (10+) to achieve the same outcomes. This is great for power users who have complex needs and a clear understanding of the metamorph mental model, but it can be a bit daunting for newcomers, making it more challenging to pick a clear place to start. The community is actively discussing the best approach for consolidating and/or documenting these different approaches in the interest of making Clojure's ML stack more accessible.
Collections/Frameworks
The community is well aware that it is difficult to know where to get started and several efforts have been made in an attempt to make the path more clear for people who want tools that Just Work. scicloj.ml is one such project. It's a collection of libraries (mostly the ones mentioned above) with some lightweight wrappers and efforts in creating documentation.
The community is heading toward deprecating this library, though, in favour of noj, which we are hoping to stabilize in the near future. The goal is to have a single entry-point into the Clojure data science stack, gathering all the tools one would need to work with data consolidated in one place, with seamless interoperability akin to R's tidyverse of libraries.
Interop
It wouldn't be a complete roundup of the state of ML in Clojure without a mention of libpython-clj. This is a library that provides Python bindings for Clojure, so you can call Python code directly from Clojure if necessary. sklearn-clj makes use of this bridge to provide direct access to all of the estimators and models from Python scikit-learn in Clojure, so for cases where something is truly only available in Python, we can still access it.
It's worth also briefly mentioning clojisr here, which is a similar kind of bridge from Clojure to R (and there exist libraries for Julia and Wolframite, too), but these are all less relevant for the specific area of ML, where Python is the overwhelmingly most popular current tool of choice.
More updates
These discussions all happen in the open, on the Clojurian's Zulip instance, which has become the main gathering place of the Clojure-for-data-science community. The #data-science and #noj-dev streams are the most active on these topics at the time of this writing. You can follow along with developments in the trenches over there, or follow the key libraries on github for updates (scicloj.ml.tribuo, metamorph.ml, noj). I will also post periodic updates here and all the other corners of the internet where I lurk. Thank you for reading!
diff --git a/public/planetclojure.xml b/public/planetclojure.xml
index 71f398c..0dad078 100644
--- a/public/planetclojure.xml
+++ b/public/planetclojure.xml
@@ -3,11 +3,18 @@
Code with Kira
- 2024-05-01T18:10:13+00:00
+ 2024-07-01T03:36:37+00:00https://codewithkira.comKira McLean
+
+ https://codewithkira.com/2024-06-30-clojurists-together-update-may-jun-2024.html
+
+ OSS Updates May and June 2024
+ 2024-06-30T23:59:59+00:00
+ This is a summary of the open source work I've spent my time on throughout May and June, 2024. There were lots of small bug fixes and reports, driven by work on the Clojure Data Cookbook. This work was also the impetus for my initial release of tcutils, a library of utility functions for working with tablecloth datasets. I also had the wonderful opportunity to attend PyData London in June and found it really insightful and inspiring. Read on for more details.
Sponsors
This work is made possible by the generous ongoing support of my sponsors. I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing me with lucrative enough grants that I can reduce my client work significantly and afford to spend more time on these projects.
If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my GitHub sponsors page. On to the updates!
Ecosystem issue reports and bug fixes
Working on the cookbook these last couple of months turned up a few small issues in ecosystem libraries. The other developers of Clojure's data science tools are such a pleasure to work with, it's so rare and nice to have a distributed team of people capable of getting cool things built asynchronously. Here are some details of a few particular issues that came up:
Some good discussions about how best to incorporate the myriad of dependencies required to use Java machine learning libraries in Clojure libs, including sorting out what to do about transitive dependencies in our tribuo wrapper, led by Carsten Behring.
Initial release of tcutils
In my explorations of other languages' tools for working data I often come across nice utility functions that are super simple but have a big impact on the ergonomics of using the tools. I wanted to start bringing some of these convenience utilities to Clojure, so for now I'm putting them in tcutils. So far only a handful of helpers are implemented (lag, lead, cumsum, and clean-column-names). The goal is to eventually fill out more utilities that save people from having to dig into the documentation of half a dozen different libraries to figure out how to implement things like these. The goal is not to achieve feature parity or to exactly copy similar libraries, like pandas or dplyr, but rather to take inspiration from them and make our tools easier to use for people who are used to these conveniences.
Progress on Clojure Data Cookbook
I spent a lot of time on the Clojure Data Cookbook over these last two months. Notable progress includes:
The introductory chapters bear some resemblance now to the final form they'll take.
The overall structure of the book is much more clear now.
I started the example analysis that will serve as the high-level introductory section of the book.
The publishing and deployment process is finally working.
It's still very much in progress, but in the interest of transparency the work-in-progress version is available online now. It will continue to evolve and change as I fill out more and more of the chapters, but there's enough of it available now to hopefully give a sense of the style and tone I'm going for. I also finally have the publishing workflow set up and it's generating a nice-looking Quarto book, thanks to all of Daniel Slutsky's amazing work on Clay and Quarto integration recently.
Progress on high-level goals
The high-level goal of my work in general remains to steward Clojure's data science ecosystem to a state of maturity and flourishing so that data practitioners can use it to get real work done. Toward this end, I set up a project board to track progress toward what I see as the main components of this project.
Over the last couple of months, beginning with a prototype demoed at my London Clojurians talk in April, Daniel Slutsky has made tremendous progress on our goal of implementing a grammar of graphics in Clojure in the new hanamicloth library. The near-term goal is to stabilize the API of this library enough that it can be used to provide a user-friendly way to accomplish all of the simple data visualization tasks that are currently possible with our other tools. The long term goal is to take the lessons we learn from this library and build a JVM-only grammar of graphics library for doing data visualization "right" in Clojure.
The development and surrounding discussions of hanamicloth have also made me realize it would be useful to write an overview of the current state of dataviz options for Clojure and why we're working on building something new. That's on my list for the coming months, but lower priority than actual development work.
Impressions from PyData London
I got to attend PyData London this year thanks to a client of mine who was sponsoring the conference. I learned a lot and found the talks very interesting. My overall impression is that data science is maturing as a discipline, with more polished methods and robust theory backing up different approaches to data-related problems. With this maturation, though, comes higher expectations for production-ready, professional quality results. Most of the talks focused on high-level concerns like observability, scalability, and long-term stewardship of large open-source projects.
There are a lot of reasons why Python is just not ideal for building highly available, high-performance systems, and I really believe this is a good time to be building alternative tools for data science. Python is obviously entrenched as the current default language for working with data, but it is difficult and slow to write code that can take full advantage of modern hardware (because of the infamous global interpreter lock, reference counting, slow I/O, among other reasons). And to be fair, the Python community knows this. It's why virtually all of the libraries that do the heavy lifting for data science in Python are actually implemented in C (numpy, pandas) or Rust (Polars, Pydantic), or are wrappers around C++ (PyTorch, TensorFlow, matplotlib) or Java (PySpark, Pydoop, confluent-kafka) libraries.
I think this provides a lot of insights into what data practitioners want. It's clear that users want approachable, simple, human-readable interfaces for all of these tools, and that any new tool needs to interoperate with the rest of the ones currently in use. People are also tired of churn and are craving stability. I think Clojure has a lot to offer in all of these areas and is well placed to become more widely adopted for data science.
Ongoing work
My focus over the next two months will remain on the cookbook. My main goal is to finish the introductory chapter with the housing price analysis and to continue putting together the data import section with instructions and examples for all file formats that can reasonably be supported easily at this time.
I'll continue to support and contribute to all of the ecosystem libraries I come across in my writings and analysis work in hopes of smoothing out all the rough edges I find.
Thanks for reading. I always love hearing from people who are interested in any of the things I'm working on. If that's you, don't hesitate to be in touch :)