Protecting our environment is one of the most urgent issues that we as a society need to address.
IT companies in particular should, therefore, not ignore this issue, but actively seek solutions. Ensuring ecological well-being is one of the seven guidelines of our Ethic Policy, which we have committed to as Leftshift One.
That this point is of immense importance is undeniable. Especially in the AI environment, there are trends and developments that make you think about the ecological balance and they need to be addressed.
„by 2025, AI processes could claim ten percent of the global energy consumption of data centers“– Gary Dickerson, Applied Materials
It is when AI is widely deployed by enterprises – and eventually AI models are trained by enterprises themselves – that energy requirements should be considered. For this reason, we need approaches and solutions to deal with it now. Furthermore, there is also an opinion in many places that more data and more model parameters is the way to better AI models. A controversial opinion confronted with disagreement among AI experts.
The question of energy requirements when it comes to building AI models is one we urgently need to address. AI model training is already a big driver of CO2 emissions. If the trend towards more data and model parameters continues, these CO2 emissions will likely increase in the future. Similar to Moore’s Law, to date the computational cost of AI applications doubles approximately every 3 1/2 months.
To counter this trend, three countermeasures come to our mind:
- Exploring new AI learning methods that circumvent the need for immense amounts of data and computing capacity. Zero-shot, one-shot, and few-shot models are already moving in this direction.
- Reuse pre-trained models and simply fine-tune those for a specific use case if needed. (Transfer Learning)
- Optimization of machine learning models with respect to training and inference.
Solution one is one of the main research areas of Leftshift One. Point two and three are already enabled by our AIOS. The Skill Store offers a large number of pre-trained AI models, which can be adapted for specific requirements if needed. Thus, the effort for a basic training is omitted and an adaptation of the model can be done via transfer learning.
Within the third solution, we need to distinguish between training time optimization and inference time optimization. For the former, the AIOS supports by means of special training mechanisms. These are, e.g., Early Stopping, Gradient Accumulation, Gradient Clipping, Auto Scaling of Batch Size, etc.
AIOS supports the optimization of the inference time of the models by automatically optimizing fully trained models. There are several approaches for this. In this article we focus on optimization using ONNX, a standard format for neural models which can also be used to enable Edge AI.
What is Edge AI?
Basically, this term implies that data should be processed where it is generated. This means that the processing of data should take place entirely or partially on an edge device (e.g. a smartphone). A large part of the data that is processed using AI is currently processed in large data centers. However, with Edge AI, a trend is emerging that aims to break this pattern of thinking.
According to Gartner, only 10% of enterprise data was not processed in centralized data centers in 2018, while this figure is expected to rise to 75% by 2022. This is not to say that Big Data is being processed on edge devices. Rather, it is about personal data being processed or pre-processed directly on the edge device. The use of Edge AI thus enables the following advantages:
- Privacy & Security
Sensitive data does not leave the edge device
- Reduction of latency time
Data does not have to be sent to the backend
- Load balancing
The execution of the AI model does not take place on one backend system but distributed on multiple edge devices
The AI models can be used in a wide variety of environments.
- Offline Processing
Data can be processed offline and is sent to the backend as soon as it is online.
In order for AI models to run on edge devices, they must be optimized to enable smooth operation. Since these systems usually have different hardware requirements than server systems, AI models must be adapted accordingly. The aforementioned Open Neural Network Exchange Standard, or Onnx for short, is a tool to address this point. By means of this standard, one is able to run AI models as edge AI models.
An important point here is that if you can optimize AI models for an edge device, then you can also optimize it for the backend. This fact can be used to create more efficient models for the backend. In the following, I would like to present two use cases to demonstrate the benefits of Edge AI using ONNX:
Use Case 1: Image Similarity
A Java application needs a functionality to compare images regarding their content on similarities. The Skill Store already contains a machine learning model that fulfills this task, but it is implemented in Python. Since the solution requirements are to run the model in Java, this model must be made executable for Java.
Before we discuss the conversion of the model, I will briefly discuss the model. As an AI model for this example we will use an Image Similarity Detection model. There are several possibilities regarding the neural architecture of this AI model. For this example, I chose to use a Convolutional Autoencoder.
Autoencoders are neural architectures that are trained to best reconstruct input to the model. The model is divided into two components, the encoder and the decoder. The encoder takes over the task of converting the input into a latent representation of the image.
The decoder component uses this latent representation to reconstruct the input as best as possible. For the visual comparison of images, the encoder output is suitable, since it can be assumed that it contains all relevant information of the image in tensor form. These image embeddings can subsequently be compared using distance functions (e.g. cosine distance). The smaller the distance of the embeddings, the more similar the images are.
The structure of this architecture is not part of this article. It is a standard autoencoder which was created using PyTorch. Here we focus on the possibility to convert this model, which was created using a python framework, into an ONNX format and then execute it using the ONNX Java Runtime.
ML models are converted into an ONNX format by performing an example inference. It is irrelevant whether the data flowing through the model is meaningful or not. The only important thing is that the structure of the synthetic data (shape) corresponds to that of the real data. In the Image Similarity model, we know that the following shape is expected as tensor input:
4 Dimensions (Batch x Color x Height x Width)
It is also relevant to know that the dimensions 0 (Batch), 2 (Height) and 3 (Width) are dynamic and thus can be defined as dynamic axes. Thus, the ONNX model is able to process these axes dynamically. The ONNX framework analyzes the model to be converted and names the inputs and outputs of the model by default. However, for debugging purposes, we will manually set the labels of these values.
After running this script we get a file named image-similarity.onnx. This file contains all the information to perform the inference of the Image Similarity model. It can be run with any ONNX runtimes, which also allows us to run the model in terms of Edge AI directly in the browser or on a smartphone.
As we can see from the image, the first component of the model was named Input and the last component was named Output. Furthermore, the dynamic dimensions 0, 2 and 3 have been given a name.
Next, we will execute the model using Java code. For this we first need to create the ONNX Environment in Java.
Next, we need code that converts an image into an OnnxTensor. We need to create a tensor with dimensions that are compatible with the Onnx model. As mentioned before, these are the dimensions (Batch, Color, Height, Width) where only the Color dimension is fixed with three. The following code converts a BufferedImage into an OnnxTensor:
The last step is to perform the actual inference. Here we pass the input to the model with the name provided for it (input) and get a result with the predefined name output. Furthermore we know that the model of the model is a flatted vector with the dimensions batch and embedding. Since we expect only one batch we can read the embeddings directly:
Thus, we have a model running in Java that was originally created in python. The ONNX model thus allows us to translate ML models into a standard format that can be executed using different runtimes. These models can also be optimized for an appropriate runtime during the conversion. Thus, we are not only able to run ML models as Edge AI models, but also to optimize these models for a more performant execution in the backend.
- The existing AI model can be reused
- Data do not need to be transmitted to the backend system, saving energy
Use Case 2: Semantic Search
The second example deals with a semantic search business case, where the specific task is to find those phrases from a large pool of documents that best match a text query.
In order to compare text phrases with respect to their semantic similarity, the individual sentences of the document are converted into a vector form. These vectors can be checked for similarity using a cosine distance function. The smaller the distance between the individual vectors, the more similar the texts are.
This task is performed by an NLU skill based on BERT, which transforms the text into a vector form. This process is
called sentence embedding.
In the next step all sentences of all documents are transferred into a sentence embedding and persisted together with the document ID into a document DB. For this purpose, elastic search is a good choice, which can be used in a scalable way.
The standard procedure would be that a user sends a query to the backend system, which forwards the query to the skill, which returns the Sentence Embedding. This sentence embedding is then compared with all embeddings within the document database to find the semantically most similar results.
However, the specific use case has a requirement that, for data privacy reasons, the text query should not be sent to the backend system as plain text. To meet this requirement, we move the skill call to the edge device so that only the sentence embedding is subsequently sent to the backend.
Since we want to reduce the model size for running on an edge device while also improving the inference time, we will optimize the model for edge AI use. In this regard, we use the quantization feature provided by ONNX.
The transform models used natural language processing (NLP) are mostly large AI models:
- BERT-base-uncased: ~110 million parameters
- RoBERTa-base: ~125 million parameters
- GPT-2: ~117 million parameters.
Each parameter is a floating point number that requires 32 bits (FP32). Furthermore, these models also have a fairly high demand on the hardware that performs the inference computations. These challenges make it difficult to run transformer models on client devices with limited memory and computational resources.
At the same time, however, growing awareness of data privacy and the cost of data transmission make running inference on the edge device attractive. Latency and cost are also very important in the cloud, and any large-scale application must be optimized for both.
Quantization and distillation are two techniques commonly used to address these size and performance challenges. These techniques are complementary and can be used together.
We take advantage of distillation by already using a corresponding BERT model (bert-distilled). An excellent description on distillation is provided by Huggingface. As mentioned before, we use the already existing feature of ONNX for the quantization step.
Quantization approximates floating point numbers with numbers of smaller bit width, drastically reducing memory requirements and speeding up performance. Quantization can result in accuracy losses because fewer bits limit the precision and range of values. However, researchers have shown extensively that weights and activations can be represented with 8-bit integers (INT8) without significant loss of accuracy.
Compared to FP32, INT8 representation reduces data storage and bandwidth by 4 times, which also reduces energy consumption. In terms of inference performance, integer math is more efficient than floating point math. By using the quantization of the ONNX runtime, we can achieve significant performance gains compared to the original model.
The speedup over the original PyTorch model is due to both the quantization and the speedup provided by the ONNX Runtime. After converting the original PyTorch FP32 model to the ONNX FP32 format, the model size was almost the same, as expected. After quantification, the ONNX runtime was able to reduce the model size by 4 times.
- Real data does not leave the edge device (only sentence embedding is sent to the backend).
- Reduced load on the backend system, as the creation of the embedding is performed locally on the edge device.
The AIOS offers the possibility to automatically convert existing models into ONNX models and to perform quantization. Furthermore, the AIOS can be used to manage and distribute these ONNX models. If needed, the preprocessing of the data can also be created together with the ONNX model as an atomic unit.
As we have seen from the two use cases, we can already use Edge AI productively in a simple way. With this approach, we can create highly efficient hybrid systems where we can split the processing of data between edge devices and backend devices to achieve benefits such as increased throughput or data protection.
More importantly, however, is to create awareness that a balance can and must be made between technical progress and energy efficiency. I expect that this topic will continue to grow in popularity. Our environment will thank us for it.