Wednesday, January 8, 2025

Introducing mall for R…and Python

The beginning

A few months ago, while working on the Databricks with R workshop, I came
across some of their custom SQL functions. These particular functions are
prefixed with “ai_”, and they run NLP with a simple SQL call:

dbplyr we can access SQL functions
in R, and it was great to see them work:

Llama from Meta
and cross-platform interaction engines like Ollama, have
made it feasible to deploy these models, offering a promising solution for
companies looking to integrate LLMs into their workflows.

The project

This project started as an exploration, driven by my interest in leveraging a
“general-purpose” LLM to produce results comparable to those from Databricks AI
functions. The primary challenge was determining how much setup and preparation
would be required for such a model to deliver reliable and consistent results.

Without access to a design document or open-source code, I relied solely on the
LLM’s output as a testing ground. This presented several obstacles, including
the numerous options available for fine-tuning the model. Even within prompt
engineering, the possibilities are vast. To ensure the model was not too
specialized or focused on a specific subject or outcome, I needed to strike a
delicate balance between accuracy and generality.

Fortunately, after conducting extensive testing, I discovered that a simple
“one-shot” prompt yielded the best results. By “best,” I mean that the answers
were both accurate for a given row and consistent across multiple rows.
Consistency was crucial, as it meant providing answers that were one of the
specified options (positive, negative, or neutral), without any additional
explanations.

The following is an example of a prompt that worked reliably against
Llama 3.2:

>>> You are a helpful sentiment engine. Return only one of the 
... following answers: positive, negative, neutral. No capitalization. 
... No explanations. The answer is based on the following text: 
... I am happy
positive

As a side note, my attempts to submit multiple rows at once proved unsuccessful.
In fact, I spent a significant amount of time exploring different approaches,
such as submitting 10 or 2 rows simultaneously, formatting them in JSON or
CSV formats. The results were often inconsistent, and it didn’t seem to accelerate
the process enough to be worth the effort.

Once I became comfortable with the approach, the next step was wrapping the
functionality within an R package.

The approach

One of my goals was to make the mall package as “ergonomic” as possible. In
other words, I wanted to ensure that using the package in R and Python
integrates seamlessly with how data analysts use their preferred language on a
daily basis.

For R, this was relatively straightforward. I simply needed to verify that the
functions worked well with pipes (%>% and |>) and could be easily
incorporated into packages like those in the tidyverse:

https://mlverse.github.io/mall/

Related Articles

Latest Articles