# TinyTTA: Efficient Test-time Adaptation via Earlyexit Ensembles on Edge Devices

Hong Jia, Young D. Kwon, Alessio Orsino, Ting Dang, Domenico Talia and Cecilia Mascolo



# SAMSUNG Research





## AI/Deep Learning on Edge Devices

 Deploy ML on edge devices becomes popular: real-time data analysis and low-latency responses
e.g., Real-time human health monitoring and robotics



### **Realistic Scenarios**

- Adaptive ML is essential
- Test-time adaptation (TTA) is a practical solution but challenging



# Test-Time Adaptation



## **Unique Challenges of TTA on Edge Devices**

1. No batch normalization layers are supported on MCUs

## **Unique Challenges of TTA on Edge Devices**

1. No batch normalization layers are supported on MCUs

2. Adjust model parameters is expensive in terms of memory and computation



# **Unique Challenges of TTA on Edge Devices**

1. No batch normalization layers are supported on MCUs

2. Adjust model parameters is expensive in terms of memory and computation

3. Poor performance with small batch size when computational resources are limited





#### **Finetune-based**

• Update entire model

 Suffer from intensive memory usage





- Update normalization layers only and freeze other layers
- Suffer from intensive memory usage
- Suffer from intensive memory usage

#### **Modulating-based**





- Update normalization layers only and freeze other layers
- Suffer from intensive memory usage
- Suffer from intensive memory usage

#### **Modulating-based**

#### Memory-efficient TTA

• Update enabled with low memory on GPUs

• Remain memory intensive on CPUs



• Suffer from intensive • Suffer from intensive memory usage memory usage

- Model collapse with batch size of one
- Normalization layers are unavailable on MCUs

#### **Modulating-based**

#### Memory-efficient TTA

- Update normalization layers only and freeze
- Update enabled with low memory on GPUs

• Remain memory intensive on CPUs

### TinyTTA

#### • Efficient, batch-agnostic, and robust TTA on edge devices



### TinyTTA

Efficient, batch-agnostic, and robust TTA on edge devices



- TinyTTA Engine to enable TTA on MCUs

• Early-exit ensemble to co-optimize memory footprint and accuracy

#### Co-optimizes memory footprint and accuracy



#### Co-optimizes memory footprint and accuracy



latent representation of submodules



#### Co-optimizes memory footprint and accuracy



latent representation Align submodule submodule output of submodules output  $oldsymbol{z}_{i}^{k}$ exp $\searrow \boldsymbol{p}_i^k = \frac{\operatorname{Comp}\left(\boldsymbol{z}_i\right)}{\sum_{j=1}^C \exp\left(\boldsymbol{z}_j^k\right)} \qquad \mathcal{L}_1 = \sum_{i=1}^C CE\left(\boldsymbol{p}_i, y\right)$ 

#### Align latent representations

$$\mathcal{L}_2 = \|\tilde{\boldsymbol{z}}_k - \boldsymbol{z}_k\|_1$$

#### Co-optimizes memory footprint and accuracy



 $submodule output = \frac{latent representation}{submodules} = \frac{latent representation}{of submodules} = \frac{Align submodule}{output} = \frac{latent representation}{p_i^k} = \frac{exp(\boldsymbol{z}_i^k)}{\sum_{j=1}^C exp(\boldsymbol{z}_j^k)} = \mathcal{L}_1 = \sum_{i=1}^C CE(\boldsymbol{p}_i, y)$ 

Align latent representations

$$\mathcal{L}_2 = \|\tilde{\boldsymbol{z}}_k - \boldsymbol{z}_k\|_1$$

Weight standardization exits

$$\widetilde{\boldsymbol{W}} = \frac{\boldsymbol{W} - \boldsymbol{\mu}_w}{\boldsymbol{\sigma}_w + \boldsymbol{\epsilon}}$$

# **TinyTTA Engine**

- First-of-its-kind TTA engine on MCUs
- Optimized to mitigate resource limitations during TTA



# **TinyTTA Engine**

- First-of-its-kind TTA engine on MCUs
- Optimized to mitigate resource limitations during TTA



- **BP** operators support for Tensorflow Lite Micro

Layer-wise update strategy to optimize memory efficiency

### **Experimental Setup**

- Datasets
  - (1) CIFAR10C
  - (2) CIFAR100C
  - (3) OfficeHome
  - (4) PACS

- Architectures
  - (1) MCUNet
  - (2) MobileNetV2\_×05
  - (3) EfficientNet\_b1
  - (4) RegNet-200m

#### Baselines

- (1) Tent (Modulating)
- (2) Tent (Finetune)
- (3) EATA
- (4) CoTTA
- (5) EcoTTA

#### • Hardwares

- (1) MCU: STM32H747
- (2) MPU: RaspberryPi Zero 2 W





TinyTTA achieves up to 57.6% higher accuracy compared to TENT (Modulating) with a batch size of one TinyTTA achieves up to 6x lower memory usage compared to CoTTA with a batch size of one

#### Results



TinyTTA achieves an average of 4.3% higher accuracy compared to a model without update with a batch size of one

TinyTTA is the only framework capable of performing TTA under an MCU's 512 KB memory constraint

Table 2: MCU deployment of the baseline and TinyTTA on STM32H747 using MCUNet and CIFAR10C.

| System           | Accuracy | SRAM   | Flash | Latency | Energy |
|------------------|----------|--------|-------|---------|--------|
| Inference Only   | 60.2%    | 82.8KB | 290KB | 55.8ms  | 12.7mJ |
| TinyTTA (update) | 64.3%    | 123KB  | 375KB | 50.7ms  | 11.5mJ |

### Summary & Take-away Messages

**S1.** TinyTTA enables efficient, batch-agnostic and robust ondevice TTA for the first time

**T1. Self-ensemble framework and early-exit policy is effective in ensuring high TTA accuracy** 

**T2.** TinyTTA Engine enables TTA for diverse MCU applications



Any questions? You can find me at: hong.jia@unimelb.edu.au h-jia.github.io

# SAMSUNG Research



# Thank you!



