Author : Enyert Vinas, Senior Software Developer
In Flexiana we already have created a few posts related to Clojure and Data.
We have done this because we really believe that Clojure is one of the most “data expressive” programming languages in the field, our language provides the required structures for almost any use case.
On the other hand, Python has been around for a long time and it represents lingua franca when we talk in the data science world. The simplicity provided by
Python is amazing as well and the support for the LLM(Large Language Models) is huge as well.
In this blog we want to express our intentions to refresh our old journey with libpython-clj creating a JSON data generator powered by Python and dolly.
A few definitions before launch
We already mentioned to you that we want to get our hands dirty with LLM and libpython-clj, but what are these things? We will try to define these two things in the most basic way because the detailed information about this one escapes outside the domain of this post.
LLM(Large Language Model)
An LLM(Large Language Model) is a statistical language model, trained with a huge amount of data, that can be used as a base to deal with natural language processing(NLP) tasks.
They commonly are based on deep learning architectures and are trained with huge amounts of data to generate results similar to human languages.
Libpython-clj
We already covered libpython-clj in one of our past posts, but as this is an amazing tool it is worth talking about it again. libpython-clj is a library to work as a bridge between Python and Clojure.
In this way we can use Python code inside our Clojure application.
Jsonformer
Another important ingredient for this implementation is Jsonformer.This is another Python library used to generate json data given a specific schema.
Malli
From the side of Clojure we want to use malli for two purposes: Schema representation and schema transformation between Clojure <==> Python.
Dolly
In this particular case, we want to work with dolly. This is an LLM trained using Databricks machine learning platform that is licensed for commercial use.
So, now that we know our main tooling and concepts, please let us present our idea.
The idea
We mentioned at the beginning of the post that we want to generate JSON data. This data is a kind of random controlled by a given schema and the power of the LLM. We can represent our idea using the following diagram:
In this image we can see that we will receive our input via JSON format input, then we transform from JSON to our malli schema(For validation and data consistency purposes). After this step, we can use malli transformer to convert this into a Jsonformer, with this schema then we can intent a prompt to our dolly model and finally transform the Jsonformer object to a valid JSON object. Amazing right?
Summary
We know this is not the most innovative idea, but as always we like to express how wide the domain of Clojure is. On the other hand, we are developers, so our curiosity and our thirst for knowledge is always present.
If you liked this experimental idea, we can invite you to stay tuned with our blog because we will try to give you the second part of this post ASAP. In the next post, we will be implementing this experiment and wil be sharing with you our code and results.
Thank you for taking the time to read our blog.
Happy Hacking!