About this site

Welcome! This is a site where I share my take on subjects that interest me and ideas that excite me.

My hope is to reach as many of my friends and like-minded people around the world, engage them in stimulating conversations, and to learn and improve from our shared experiences.

While I have worn many hats in my life, I am a programmer by trade. By that I mean, someone who uses computers and programming tools to solve problems. While I will not try to hide this bend in my explorations in this site, all avenues will be pursued to make the ideas reach as far and wide as possible. I sincerely hope that I am able to engage you irrespective of your professional background or technical depth.

I have kept this site to be minimal and functional, serving the purpose of sharing information and code with as little distractions as possible, while discussing topics that are important to both technical and non-technical audiences. It is organized like a book with chapters. These are documents edited in Visual Studio Code and published directly from a GitHub repository. Using the tools of the trade, so to say.

This site respects your individuality and right to privacy.

It doesn't serve you cookies,
Neither will it pry or spy.
It will not track you, trace you,
Categorize you, profile you or target you. 

You can always share your feedbacks and comments by mail.

Sign

Copyright 2023 Weavers @ Eternal Loom. All rights reserved.

Systems Thinking

No, this is not another article about ChatGPT and how it is going to takeover everyone's job. Don't get me wrong! I am as excited about the revolutionary potential of machine learning (ML) technology as the next person. But here and now, we are just going to think in terms of good old systems.

Thinking in Systems by Donella H. Meadows is an excellent introduction to systems thinking. It roughly defines a system to be a set of interconnected elements with a purpose where the sum is more than its parts. The central thesis of the book is that a system's behavior is intrinsic to the system itself and to produce better results from a system, we need to understand the relationship between the structure and behavior of that system.

I will not indulge in the details of systems theory here, but rather focus only on aspects relevant to building digital systems, specifically, software systems. With digitization of every worthwhile human pursuit, we are all part of a global machinery that turns physical reality into mindless stream of bits, willingly or not. A never ending quest for efficiency and speed that challenges us to conquer ever growing complexity of digital systems and infrastructure. A journey that forces us to confront the limits of our own tools and techniques far too often, and to seek new ways to build better systems.

A byte out of systems

How do we go about building programs that mimic physical systems?

We develop models of physical systems. As pointed out in Structure and Interpretation of Computer Programs1, "One powerful design strategy, which is particularly appropriate to the construction of programs for modeling physical systems, is to base the structure of our programs on the structure of the system being modeled."

This turned out to be a very effective strategy as evidenced by the huge popularity enjoyed by Object Oriented Design (OOD) and Object Oriented Programming (OOP) in the industry. While there are genuine criticisms that can be leveled against OO approaches, they are a natural fit when modeling systems in many practical domains.

The role and importance of domain models in software development can't be overstated. They are the foundation of any software system. From eliciting the structure and behaviors of systems, establishing the architectural blueprints, verification and validation of working software to deploying and operating them on distributed, virtual infrastructures of today, domain models play a critical role. They may sit majestically at the center of our software development arena or hide in the shadows of its dark alleyways. They lurk behind the software systems of all sizes and shapes, whether we formally acknowledge them or not.

Domain Driven Design (DDD), an approach to software development that centers the development on programming a domain model that has a rich understanding of the processes and rules of a domain (Martin Fowler, 2020), based on 2003 book by Eric Evans, has been widely adopted and applied in the industry. The adoption of a Ubiquitous Language (UL), a common rigorous language of communication between developers, domain experts and domain users is a core concept of DDD. Its bias towards developing models in software and its evolutionary approach to model development has made it attractive to the development community, especially to Extreme Programming (XP) and Agile practitioners.

The focus of DDD is on building better domain models whereas the related approaches of Model-Driven Engineering (MDE) and Model-Driven Architecture (MDA) are more concerned with translating the models into code. But taken together, they provide the essential tools for building better software systems faster, a way to transform strings of conscious reality into streams of flashing bits.

The great divide

As Albert Einstein2 stated, "... the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience." And there in lies the relentless quest for foundational modeling elements that connect the objective reality of a domain user to the abstract constructs in some programming language.

How do I know there is a gap? I asked ChatGPT, seriously.

Me: Is there a gap between domain models and programming models?

ChatGPT: Yes, there is often a gap between domain models and programming models. Domain models represent the concepts and relationships within a specific field or domain, while programming models are concerned with representing these concepts in code. Bridging this gap can be challenging as the concepts in the domain model may not have a straightforward mapping to code, and vice versa. It requires a good understanding of both the domain and the programming language to effectively translate between the two.

The irrefutable proof of truthiness one can hope for in 2023!

This kind of gap is more fundamental and deep, pervading all fields of study. Philip Anderson, a Nobel Laureate in Physics, in his 1972 article eloquently argued that "more is different" and "ability to reduce everything to simple fundamental laws does not imply the ability to start from those laws and reconstruct the universe". When dealing with the dual difficulties of scale and complexity, he cautions about the emergence of new properties at each level of complexity needing different abstractions (fundamental laws) to explain new behaviors.

As we tackle complex domains, attempting to model emergent behaviors at each level of complexity, shoehorning them into the existing programming models, most of them originally designed to deal with memory representations and machine code generation, leads to cognitive dissonance and compromised models. Wrong models lead to wrong systems, period! The saying, "You can't fix stupid" is no more truer in any other context.

Not all programming languages are the same. Each one was designed with a different purpose in mind. Overtime, they evolve, supporting multiple paradigms and the developer communities build better abstraction layers on top of them to address the aforementioned gap. From a systems perspective, what is important is how we tame domain complexity and handle emergent behaviors.

Time honored way to solve complex problems is to decompose them into smaller, simpler ones and solve them independently. The way to build complex systems, would be to compose them from simpler ones. A fact we might be willing to accept as self evident. But as Anderson pointed out, "a reductionist hypothesis does not imply a constructionist one". Since modularity, coupling, cohesion and information hiding were part of software engineering vocabulary for ages and nearly universal practice of modular organization of code, we naively assume that we are composing software systems, when in fact, we merely decompose them. We will explore this subtle, but crucial difference in more detail in another chapter. But for now, I will simply state that the composition models across programming paradigms require a different set of abstractions.

All programming languages provide means to model structure in terms of values (entities, value objects, classes, etc.) and relations (inheritance, aggregation etc.), typically encoded in a type system. They also support modeling the interactions using interface definitions (functions, protocols, interfaces, traits, instance methods, etc.). The behaviors emerge from dynamic states of the system that change as a consequence of computations initiated through interfaces. We can really flex the powers of type systems and modern compilers to go a long way to model and validate systems. If you are curious, there is delightful series on Designing with types by Scott Wlaschin, that might be of interest to you. I will continue with my deliberations on dynamic states and encapsulating emergent behaviors here.

Data modeling has a long and storied history. It is a rich and well established field supported by a thriving database community. The databases and their schemas have been powering most of the systems out there. When we refer to model in an application context (Model, View, Controller (MVC) pattern as example), we are often referring to the data, usually stored in some database. Yet there seems to be a divide between the modeling aspects of data pertaining to data layer and computation (or application logic) layer. One focuses more on data at rest while the other is concerned about the data in motion (transition). But both have to deal with the dynamics of the system (the changing states) or emergent behaviors.

Data community does this by shoving more status fields into their tables and documents while the application community deals with them by writing a truck load of code in the name of controllers and logical blocks. The very essence of our system lives in the wild wild west of broiler plate code that is built to tie these disparate worlds together.

Can we have a unified model of a domain that crosses the boundaries of client, server, middleware, databases and other artificial boundaries we have created to organize our teams, software artifacts and infrastructure? Can we do this without all the ceremony and fanfare?

In a much simpler past, many of us could standardize on a single programming language and move on. Not any more. Between our web and mobile applications, multiple public API's and language specific SDK's that we provide to our customers for accelerated adoption, we end up shoehorning our domain models many times over. It is not just the domain complexity we are up against, but the complexities presented by the realities of global, distributed, hybrid, virtual, polyglot environments of today. Our domain models have to transcend the very confines of programming languages whose programming model we want them to be part of!

If you have lead your professional life oblivious to the above challenges, don't feel left out. In any growing business, you will soon be forced to confront them. Sleep well tonight!

It is not just about programmer productivity. We fill the gap between the domain user and programmer with different roles - domain experts, domain consultants, business managers,business analysts, data analysts,
product managers. We equip them with even more automation tools each churning out their own digital artifacts. Meanwhile, the real developer spends all her time writing broiler plate code to bring all of them together. The official estimates of broiler plate code ranges from 20 to 30%, but in sufficiently complex projects, they easily exceed 50%.

A domain model, first and foremost, is a communication tool. It is where we collect, organize, analyze and refine the tiny, shiny granules of domain wisdom. Domain models help build a common understanding and align goals among the stakeholders. They inform and guide the design, implementation and validation of the digital systems we build. The simplicity and expressiveness needed for effective stakeholder communication often stands at odds with the implementation details that creep into the programming models. For all the allegiance we pledge to working software over comprehensive documentation, we end up doing both, by different people.

As we digitize the domain knowledge, we are at the risk of burying more and more of our organizational knowledge in code. In any modernization project in any organization with some history, there is always a spreadsheet or a piece of software that nobody wants to touch. Everyone knows it is important, but no one knows how it works. Marvels of modern architecture built around leaking legacy sewage! The modern systems we build today are the dark abyss of organization knowhow of tomorrow. Just think about the amount of knowledge that is buried in data models, database schemas, spreadsheets and code repositories in our organizations.

In biology, we study changes to organisms (biological systems) from a developmental and evolutionary perspectives. We have been looking systems and their models from a developmental perspective, exploring their changes over a single lifespan. All systems evolve over generations. Building models resilient to changes under evolutionary pressures of systems they model is key to their success. Just as we avoid under or over-fitting our models to data in our machine learning (ML) systems, we have to be careful about how well we fit our models to the requirements of the systems. Build for change is a mantra that we profess with passion, but pursue with extreme prejudice. Just as we concluded that the emerging behaviors need different kinds of abstractions, we have to explore abstractions that enable us to deal with the evolution of systems as well.

Crossing the chasm

It is fair at this point to ask, "What do we want, really?".

  • We want to apply systems lens to building software.3
  • We want powerful domain models that can capture the structure and emergent behaviors of complex systems.
  • The systems and their models should be composable.
  • The domain models should be simple, expressive and enable effective communication between stakeholders, especially bridging the chasm between the domain user and the programmer worlds.
  • We want better abstractions built on our programming models to reduce the degrees of separation between the specifications and working software.
  • Our models should transcend the artificial boundaries of programming languages, implementation details, software and organization structures, and deployment environments.
  • Models capture organizational learning and knowledge. They should not be buried in code.
  • We want to build models that are resilient to change.

We will explore how we can achieve these goals in the upcoming chapters. A journey that will take us through models, programming languages, type systems, knowledge representation, state machines and even polynomials!

Why, I wonder?

We have been building software systems for ages now. Do we really need to bother?

I would like to ask a counter question. Do we really need to spend millions of dollars to build a new chat application? Can't we just Google things?

Apparently, even Google does not think so.

Keeping with my stated intention of focusing the discussions here on technology and solutions, I will not attempt an elaborate business case based on lost productivity, time to value or any number of other flavor metrics of the day. If this is an impediment to your appreciation of the subject matter, please do reach out to me.

We are a species that progressed by building better tools and systems. It is our survivalist instinct. It is what makes us who we are.

Prof. Robert Sapolsky, in his lecture Uniqueness of Humans, said it best. "The more clearly, absolutely, utterly, irrevocably, unchangeably clear it is that it is impossible for you to make a difference and make the world better, the more you must."

So, we must!

Request for feedback

I like to hear from you.

Please share your comments, questions, and suggestions for improvement with me!

Sign

1

This book is a classic and a must read for any self-respecting programmer.

2

It is an unwritten rule that one can't discuss models without an obligatory quote from Albert Einstein.

3

Systems analysis and design methodologies have been used in software development for a long time. I hope the practitioners of these methodologies can appreciate the differences in perspectives here.

Copyright 2023 Weavers @ Eternal Loom. All rights reserved.

None of Your Business

Freeman Dyson, recounted a conversation with Enrico Fermi in 1953 when he was asked how many arbitrary parameters did he use for his calculations? When he answered "four", Fermi famously replied, "I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”1

Those parsimonious old geezers and their occam's razor!

GPT-3 has 175 billion parameters. No wonder it can fit the world and all the bad jokes in it.

When it comes to building software and modeling business domains, we hardly think about the number of parameters we are using. We never hear about the tensions boiling over the number of parameters used in our models between the business development and engineering. Probably a good thing if you are an engineer working at Twitter.

A notable exception here is the Machine Learning (ML) community. With the success of Large Language Models (LLM) like GPT-3, the number of parameters has become a fascination and a fad. We are already way past a trillion parameter mark and there is always rumours of an even bigger one in the works. Apparently, as far as the future of AI is concerned, the size matters!

Even if you perceive these advances with fearful trepidations about the rise of sentient machines or with a sense of disappointment that the human intelligence might be reduced to mere statistical regularities or with shear skepticism about the technology hype cycle, you can't ignore the influence these technology advances are having in how we build our digital systems. It is important to explore how they play into the domain and programming model challenges that I alluded to in my previous article.

First and foremost, in a world where systems can learn themselves from trillions of tokens on the web, encoding the collective human knowledge into weights and biases in hidden layers of some artificial neural entanglements buried deep beneath a friendly prompt, why do we need domain models at all?

The weights and biases that defines us

Richard Sutton2 points out the bitter lesson from 70 years of AI research that the methods relying on scalable computation, such as search and learning, always outperform the methods attempting to hand engineer the human knowledge of the domain. In his words, "We have to learn the bitter lesson that building in how we think we think does not work in the long run."

In other words, letting go to grow up in the real world applies equally well to your kids and systems.

To understand what it all means, we need to dig a little bit deeper into the world of the modern ML systems. I will try to keep this at a high level, but shall provide sufficient pointers to details for those who are interested. Another note of caution, this is a field that is rapidly evolving where innovations are happening at a breakneck pace and our understanding of how things work is shifting at an equal pace. It is almost impossible to be current on all the advancements, let alone predict their impact on the future of systems engineering.

It is also a very diverse field. Our discussion here is not intended to be a survey or tutorial of the field. Our approach here will be to understand the implications to systems engineering by exploring some of the most prominent trends in this field, specifically from the vantage of the Large Language Models (LLM).

In the traditional world I discussed in the previous article, when we set out to build a digital system, we construct our model that captures the essence of the actual system we are trying to build. We model the entities, their relationships, states and behaviors. We then immortalize this model in some programming language. Here we are in charge of defining everything. Even though there are differences in the way a domain model is captured by a domain expert and a software engineer, the model is still a human construct that has close correlations in both worlds. From a representation perspective, they use the representational systems that we have developed for ourselves for the very same purpose.

For e.g., a domain specification might say something like, "A user has a name, email address, and a role. The role can either be regular or admin." A pretty dry domain description that we are all used to. This can be translated into a programming language as follows:

#![allow(unused)]
fn main() {
/// User with name, email, and role
struct User {
    name: String,
    email: String,
    role: Role,
}

/// Role definitions for a user
enum Role {
    Regular,
    Admin,
}
}

The above code is written in Rust, a modern systems programming language. Even if you don't know Rust, you can probably recognize the representation of the domain knowledge captured by the code and its relation to its corresponding statement form.

We can create a user, print the information about that user and make sure that it looks exactly like what we expected. You can run all the Rust code right here by clicking on the "Run" button [▶︎] inside the code block.3

/// User with name, email, and role
#[derive(Debug)]
struct User {
    name: String,
    email: String,
    role: Role,
}

/// Role definitions for a user
#[derive(Debug)]
enum Role {
    Regular,
    Admin,
}

fn main() {
    // Create a user "John Doe"
    let user = User {
        name: "John Doe".to_string(),
        email: "john@doe.com".to_string(),
        role: Role::Regular,
    };

    // Print the user information
    println!("User: {:?}", user);
}

Here is the same code in Python:

@dataclass
class User:
    name: str
    email: str
    role: "Role"

class Role(Enum):
    REGULAR = 1
    ADMIN = 2

user: User = User("John Doe", "john@doe.com", Role.REGULAR)

print(user)

You can't run Python code directly here, but you can run them at Google Colab by clicking this badge Open In Colab.

While specific details are irrelevant to our current discussion, one point to note is that the model of the world reflects our way of understanding and representing it and the code follows closely. After all, that is why we invented higher level programming languages in the first place.

The most important point here is that the model contains information that we have explicitly captured through our understanding of the world. We demand nothing more or expect nothing less from the digital system that uses this model. If we want the Users to have an address for example, we have to explicitly add it to the model.

Similarly, the structural relationships are hardwired into the model. A user has only one role because we specified and coded it that way. We have handcrafted the features to be used in the model. This is what Rich Sutton alluding to when he talks about humans building knowledge into agents.

Essentially, the system has no learned representations or emergent behaviors.

This is quite different in the land of the LLMs. Let us try to understand what happens in the world of LLMs4.

Token like you mean it

Whether you are using an existing LLM or building a new one, the first step is to find a way to tokenize your inputs. Tokenization is the process of breaking down the input into smaller pieces, encoding them as numbers, and creating a dictionary of these tokens. It is like learning to breakdown sentences into words in a new language and then creating a numeric index for each word. There are many ways to do this and most of these tools are readily available for use5.

If we feed our domain specification to GPT-3 tokenizer, we will get the following sequence of 22 tokens:

Input Text: 
A user has a name, email address, and a role. 
The role can either be regular or admin.

Output Tokens: 
[32, 2836, 468, 257, 1438, 11, 3053, 2209, 11, 290, 257, 
 2597, 13, 383, 2597, 460, 2035, 307, 3218, 393, 13169, 13]

* I have split the text and tokens into lines for easy reading. They are just a sequence of characters and a sequence of numbers respectively.

Different tokenizers give different tokens. I have included some sample code in the companion notebook for you to see these tokenizers in action Open In Colab. Depending on the volume and type of data the tokenizers were trained on, they will have a different vocabulary size. The GPT tokenizer has a vocabulary size of 50,257 and BERT (another popular open source LLM model from Google) has a vocabulary size of 30,522. The vocabulary size influences quality and performance of learning as well as the size of token sequence generated for a given input. Large vocabulary size increases memory and time complexity of training while reducing the number of tokens in the sequence.

When we are dealing with modalities other than text, such as images, audio, or video, we will have other schemes to split the input into encoded sequences6.

As you can see, the tokens are important because we need them to get anything in and out of LLMs and any systems that rely on them. Different LLMs use different tokenizers and each tokenizer generates different token sequences and has different vocabulary size7. There are cases when you may want to modify existing tokenizers or even create new ones altogether. There is so much more to tokens than what my short description and code snippets convey.

In case any of you are wondering if this is any of your business to know, let me give you one more perspective.

Since companies like OpenAI charge by the number of tokens, it is important for us to have a basic understanding of tokens. At the time of launch, the ChatGPT API charges $0.002 per 1000 tokens. As per their rule of thumb, 1000 tokens translates to approximately 750 words in common English. Brings a new perspective to the phrase "use your words carefully", doesn't it?

To make the long story short, inputs to the LLMs are long sequences of tokens. These tokens are precious because they are not mere tokens of information exchange, but real dollars and cents. Soon enough, they are going to show up in your cost of services and OpEx, center of your buy vs. build decisions, and target of your operational efficiency initiatives.

It will soon be everybody's business!

Tell me I forget, Embed me I learn

We have created tokens and given them identities. But they lack any meaning in their lives. They are just like us. A social security number or student ID card gives us identity. But we find meaning in our lives through our interactions with the people and places around us. It is our shared experiences that enrich us. We just need to let our tokens do the same by creating a space where they can find each other, learn from one another, build their character and in the process discover their own meaning and place in the world.

In the machine learning world, this process is called embedding. And the space where they are embedded is simply known as the embedding space. Obviously, there was no marketing department involved when they were naming these things!

Embeddings are numerical representations of concepts converted to number sequences. That sounds awfully similar to the tokens we just talked about, except that they are much more.

When we talk about people, we are all used to phrases like "the depth of their character", "the breadth of their knowledge", or "being plain or a square". We are alluding to some dimensionality using some abstract coordinate system in these metaphors. Or more concretely when we describe a point in space with its x, y, and z coordinates. What is good for the goose might work for the gander. What if we represented the token's meanings using values in a number of dimensions? Since we are not sure what these tokens might learn and how many dimensions are actually needed to represent their learning, we just choose an arbitrarily large number of dimensions that we can afford (computationally). In mathematical terms, it is a vector that you might remember from your linear algebra or physics class. With some proper training, we can hope that each of these tokens will learn their coordinates in this high dimensional space. And that is, in essence, the art and science of embedding, transforming identifiable tokens into semantically rich representations.

In case you are worried about the way we are mixing metaphors and mathematics, I will let you in on a secret.

Machine learning is an alchemy where we mix precise mathematical equations with intuitive metaphors poured over a large cauldron of data subjected to extreme computational power in the hope that it all turns into a potion of immortal knowledge.

If that statement makes you uncomfortable, you should stick to something more definite like stock market speculation.

It might be helpful, at this stage, to have an idea what these embeddings look like. If we send our original domain specification to OpenAI API, and ask for its embedding, we will get a long list of numbers.

Input Text: 
A user has a name, email address, and a role. 
The role can either be regular or admin.

Number of tokens: 22
First 10 numbers of the embedding:
[-0.0041143037, -0.01027253, -0.0048981383, -0.037174255, -0.03511049, 
 0.016774718, -0.0476783, -0.022079656, -0.015676688, -0.019962972]
Size of the embedding: 1536

We get a vector of size 1536. That is the dimensionality of OpenAI embeddings8. Please note that this is the combined embedding for the entire text that is formed from the individual embeddings of each token in it.

If we sent just one token, say "user", here is what it will look like:

Input Text: user

Number of tokens: 1
First 10 numbers of the embedding:
[-0.004338521, -0.015048797, -0.01383888, -0.018127285, -0.008569653, 
 0.010810506, -0.011619505, -0.01788387, -0.02983986, -0.013996384]
Size of the embedding: 1536

It has the same dimensions, but different values for the embedding. You can see them in their full glory in Embedding section of the python notebook Open In Colab.

Looking at those numbers, we can only hope those tokens were really enriched. Because, all I can say, at this stage, is that I am poorer by $0.046.

It is very difficult to visualize such high dimensional data. But we can apply some dimensionality reduction techniques and visualize them in 2D or 3D spaces. Here is what we are going to do:

  • We use a sample dataset containing 200 text descriptions and a category definition of what the text is describing. These samples are taken from a curated dataset9. Here are some examples:

    Text: " Oxmoor Center is a Louisville Kentucky shopping mall located at 7900 Shelbyville Road in eastern Louisville."
    Category: "Building"

    Text: " Sten Stjernqvist is a Swedish former footballer who played as a forward."
    Category: "Athlete"

  • The sample dataset has 14 categories.

  • We then setup a language model that was pretrained for sentence embedding, similar to GPT, but can fit our wallet (free) and compute (we can actually run it from a notebook on Google Colab)10.

  • We ask the model to generate embeddings for each text description from our sample. We get embedding vectors of size 384, about one third the size of OpenAI embedding.

  • We reduce the dimensionality to 3 and plot it. The first plot shows all 200 text descriptions in 14 categories and the second one shows data for just 5 categories.

Fig 2 is interactive, so feel free to explore it closely. The companion python notebook Open In Colab has interactive 3D plots for both.

All Category Embedding

Fig 1. Embeddings of 200 text descriptions belonging to 14 categories