None of Your Business
Freeman Dyson, recounted a conversation with Enrico Fermi in 1953 when he was asked how many arbitrary parameters did he use for his calculations? When he answered "four", Fermi famously replied, "I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”1
Those parsimonious old geezers and their occam's razor!
GPT-3 has 175 billion parameters. No wonder it can fit the world and all the bad jokes in it.
When it comes to building software and modeling business domains, we hardly think about the number of parameters we are using. We never hear about the tensions boiling over the number of parameters used in our models between the business development and engineering. Probably a good thing if you are an engineer working at Twitter.
A notable exception here is the Machine Learning (ML) community. With the success of Large Language Models (LLM) like GPT-3, the number of parameters has become a fascination and a fad. We are already way past a trillion parameter mark and there is always rumours of an even bigger one in the works. Apparently, as far as the future of AI is concerned, the size matters!
Even if you perceive these advances with fearful trepidations about the rise of sentient machines or with a sense of disappointment that the human intelligence might be reduced to mere statistical regularities or with shear skepticism about the technology hype cycle, you can't ignore the influence these technology advances are having in how we build our digital systems. It is important to explore how they play into the domain and programming model challenges that I alluded to in my previous article.
First and foremost, in a world where systems can learn themselves from trillions of tokens on the web, encoding the collective human knowledge into weights and biases in hidden layers of some artificial neural entanglements buried deep beneath a friendly prompt, why do we need domain models at all?
The weights and biases that defines us
Richard Sutton2 points out the bitter lesson from 70 years of AI research that the methods relying on scalable computation, such as search and learning, always outperform the methods attempting to hand engineer the human knowledge of the domain. In his words, "We have to learn the bitter lesson that building in how we think we think does not work in the long run."
In other words, letting go to grow up in the real world applies equally well to your kids and systems.
To understand what it all means, we need to dig a little bit deeper into the world of the modern ML systems. I will try to keep this at a high level, but shall provide sufficient pointers to details for those who are interested. Another note of caution, this is a field that is rapidly evolving where innovations are happening at a breakneck pace and our understanding of how things work is shifting at an equal pace. It is almost impossible to be current on all the advancements, let alone predict their impact on the future of systems engineering.
It is also a very diverse field. Our discussion here is not intended to be a survey or tutorial of the field. Our approach here will be to understand the implications to systems engineering by exploring some of the most prominent trends in this field, specifically from the vantage of the Large Language Models (LLM).
In the traditional world I discussed in the previous article, when we set out to build a digital system, we construct our model that captures the essence of the actual system we are trying to build. We model the entities, their relationships, states and behaviors. We then immortalize this model in some programming language. Here we are in charge of defining everything. Even though there are differences in the way a domain model is captured by a domain expert and a software engineer, the model is still a human construct that has close correlations in both worlds. From a representation perspective, they use the representational systems that we have developed for ourselves for the very same purpose.
For e.g., a domain specification might say something like, "A user has a name, email address, and a role. The role can either be regular or admin." A pretty dry domain description that we are all used to. This can be translated into a programming language as follows:
#![allow(unused)] fn main() { /// User with name, email, and role struct User { name: String, email: String, role: Role, } /// Role definitions for a user enum Role { Regular, Admin, } }
The above code is written in Rust, a modern systems programming language. Even if you don't know Rust, you can probably recognize the representation of the domain knowledge captured by the code and its relation to its corresponding statement form.
We can create a user, print the information about that user and make sure that it looks exactly like what we expected. You can run all the Rust code right here by clicking on the "Run" button [▶︎] inside the code block.3
/// User with name, email, and role #[derive(Debug)] struct User { name: String, email: String, role: Role, } /// Role definitions for a user #[derive(Debug)] enum Role { Regular, Admin, } fn main() { // Create a user "John Doe" let user = User { name: "John Doe".to_string(), email: "john@doe.com".to_string(), role: Role::Regular, }; // Print the user information println!("User: {:?}", user); }
Here is the same code in Python:
@dataclass
class User:
name: str
email: str
role: "Role"
class Role(Enum):
REGULAR = 1
ADMIN = 2
user: User = User("John Doe", "john@doe.com", Role.REGULAR)
print(user)
You can't run Python code directly here, but you can run them at Google Colab
by clicking this badge
.
While specific details are irrelevant to our current discussion, one point to note is that the model of the world reflects our way of understanding and representing it and the code follows closely. After all, that is why we invented higher level programming languages in the first place.
The most important point here is that the model contains information that we have explicitly captured through our understanding of the world. We demand nothing more or expect nothing less from the digital system that uses this model. If we want the Users to have an address for example, we have to explicitly add it to the model.
Similarly, the structural relationships are hardwired into the model. A user has only one role because we specified and coded it that way. We have handcrafted the features to be used in the model. This is what Rich Sutton alluding to when he talks about humans building knowledge into agents.
Essentially, the system has no learned representations or emergent behaviors.
This is quite different in the land of the LLMs. Let us try to understand what happens in the world of LLMs4.
Token like you mean it
Whether you are using an existing LLM or building a new one, the first step is to find a way to tokenize your inputs. Tokenization is the process of breaking down the input into smaller pieces, encoding them as numbers, and creating a dictionary of these tokens. It is like learning to breakdown sentences into words in a new language and then creating a numeric index for each word. There are many ways to do this and most of these tools are readily available for use5.
If we feed our domain specification to GPT-3 tokenizer, we will get the following sequence of 22 tokens:
Input Text:
A user has a name, email address, and a role.
The role can either be regular or admin.
Output Tokens:
[32, 2836, 468, 257, 1438, 11, 3053, 2209, 11, 290, 257,
2597, 13, 383, 2597, 460, 2035, 307, 3218, 393, 13169, 13]
* I have split the text and tokens into lines for easy reading. They are just a sequence of characters and a sequence of numbers respectively.
Different tokenizers give different tokens. I have included some sample code
in the companion notebook for you to see these tokenizers in action
.
Depending on the volume and type of data the tokenizers were
trained on, they will have a different vocabulary size. The GPT tokenizer
has a vocabulary size of 50,257 and BERT (another popular open source
LLM model from Google) has a vocabulary size of 30,522. The vocabulary
size influences quality and performance of learning as well as the size of
token sequence generated for a given input. Large vocabulary size increases
memory and time complexity of training while reducing the number of tokens
in the sequence.
When we are dealing with modalities other than text, such as images, audio, or video, we will have other schemes to split the input into encoded sequences6.
As you can see, the tokens are important because we need them to get anything in and out of LLMs and any systems that rely on them. Different LLMs use different tokenizers and each tokenizer generates different token sequences and has different vocabulary size7. There are cases when you may want to modify existing tokenizers or even create new ones altogether. There is so much more to tokens than what my short description and code snippets convey.
In case any of you are wondering if this is any of your business to know, let me give you one more perspective.
Since companies like OpenAI charge by the number of tokens, it is important for us to have a basic understanding of tokens. At the time of launch, the ChatGPT API charges $0.002 per 1000 tokens. As per their rule of thumb, 1000 tokens translates to approximately 750 words in common English. Brings a new perspective to the phrase "use your words carefully", doesn't it?
To make the long story short, inputs to the LLMs are long sequences of tokens. These tokens are precious because they are not mere tokens of information exchange, but real dollars and cents. Soon enough, they are going to show up in your cost of services and OpEx, center of your buy vs. build decisions, and target of your operational efficiency initiatives.
It will soon be everybody's business!
Tell me I forget, Embed me I learn
We have created tokens and given them identities. But they lack any meaning in their lives. They are just like us. A social security number or student ID card gives us identity. But we find meaning in our lives through our interactions with the people and places around us. It is our shared experiences that enrich us. We just need to let our tokens do the same by creating a space where they can find each other, learn from one another, build their character and in the process discover their own meaning and place in the world.
In the machine learning world, this process is called embedding. And the space where they are embedded is simply known as the embedding space. Obviously, there was no marketing department involved when they were naming these things!
Embeddings are numerical representations of concepts converted to number sequences. That sounds awfully similar to the tokens we just talked about, except that they are much more.
When we talk about people, we are all used to phrases like "the depth of their character", "the breadth of their knowledge", or "being plain or a square". We are alluding to some dimensionality using some abstract coordinate system in these metaphors. Or more concretely when we describe a point in space with its x, y, and z coordinates. What is good for the goose might work for the gander. What if we represented the token's meanings using values in a number of dimensions? Since we are not sure what these tokens might learn and how many dimensions are actually needed to represent their learning, we just choose an arbitrarily large number of dimensions that we can afford (computationally). In mathematical terms, it is a vector that you might remember from your linear algebra or physics class. With some proper training, we can hope that each of these tokens will learn their coordinates in this high dimensional space. And that is, in essence, the art and science of embedding, transforming identifiable tokens into semantically rich representations.
In case you are worried about the way we are mixing metaphors and mathematics, I will let you in on a secret.
Machine learning is an alchemy where we mix precise mathematical equations with intuitive metaphors poured over a large cauldron of data subjected to extreme computational power in the hope that it all turns into a potion of immortal knowledge.
If that statement makes you uncomfortable, you should stick to something more definite like stock market speculation.
It might be helpful, at this stage, to have an idea what these embeddings look like. If we send our original domain specification to OpenAI API, and ask for its embedding, we will get a long list of numbers.
Input Text:
A user has a name, email address, and a role.
The role can either be regular or admin.
Number of tokens: 22
First 10 numbers of the embedding:
[-0.0041143037, -0.01027253, -0.0048981383, -0.037174255, -0.03511049,
0.016774718, -0.0476783, -0.022079656, -0.015676688, -0.019962972]
Size of the embedding: 1536
We get a vector of size 1536. That is the dimensionality of OpenAI embeddings8. Please note that this is the combined embedding for the entire text that is formed from the individual embeddings of each token in it.
If we sent just one token, say "user", here is what it will look like:
Input Text: user
Number of tokens: 1
First 10 numbers of the embedding:
[-0.004338521, -0.015048797, -0.01383888, -0.018127285, -0.008569653,
0.010810506, -0.011619505, -0.01788387, -0.02983986, -0.013996384]
Size of the embedding: 1536
It has the same dimensions, but different values for the embedding. You can
see them in their full glory in Embedding section of the python notebook
.
Looking at those numbers, we can only hope those tokens were really enriched. Because, all I can say, at this stage, is that I am poorer by $0.046.
It is very difficult to visualize such high dimensional data. But we can apply some dimensionality reduction techniques and visualize them in 2D or 3D spaces. Here is what we are going to do:
-
We use a sample dataset containing 200 text descriptions and a category definition of what the text is describing. These samples are taken from a curated dataset9. Here are some examples:
Text: " Oxmoor Center is a Louisville Kentucky shopping mall located at 7900 Shelbyville Road in eastern Louisville."
Category: "Building"
Text: " Sten Stjernqvist is a Swedish former footballer who played as a forward."
Category: "Athlete" -
The sample dataset has 14 categories.
-
We then setup a language model that was pretrained for sentence embedding, similar to GPT, but can fit our wallet (free) and compute (we can actually run it from a notebook on Google Colab)10.
-
We ask the model to generate embeddings for each text description from our sample. We get embedding vectors of size 384, about one third the size of OpenAI embedding.
-
We reduce the dimensionality to 3 and plot it. The first plot shows all 200 text descriptions in 14 categories and the second one shows data for just 5 categories.
Fig 2 is interactive, so feel free to explore it closely. The
companion python notebook
has interactive 3D plots for both.
Fig 1. Embeddings of 200 text descriptions belonging to 14 categories