GenConViT ED (Encoder-Decoder) - ONNX
ONNX conversion of the Encoder-Decoder (ED) network from GenConViT: Generative Convolutional Vision Transformer for Deepfake Video Detection.
Converted from the official PyTorch weights released by the authors at erprogs/GenConViT.
Model Description
GenConViT is a hybrid architecture for deepfake video detection that combines:
- A CNN Encoder-Decoder that learns to reconstruct the input face image
- A ConvNeXt-Tiny + Swin Transformer-Tiny backbone (via hybrid patch embedding) that extracts features from both the reconstructed and original images
- A classification head that concatenates both feature vectors and outputs a binary REAL/FAKE prediction
The ED variant is one of two independent networks in the full GenConViT framework (the other being a VAE variant). It processes an input face image through two parallel paths using shared backbone weights, producing a 2-class logit output.
Architecture
Input (B, 3, 224, 224)
|
+---> Encoder (5x Conv2d+ReLU+MaxPool) ---> (B, 256, 7, 7)
| |
| v
| Decoder (5x ConvTranspose2d+ReLU) ---> Reconstructed Image (B, 3, 224, 224)
| |
| v
| ConvNeXt+Swin Backbone ---> 1000-dim features (from reconstruction)
|
+---> ConvNeXt+Swin Backbone ---> 1000-dim features (from original)
|
v
Concatenate ---> 2000-dim
|
v
FC(2000, 500) + GELU + FC(500, 2) ---> Output logits (B, 2)
Key Details
| Property | Value |
|---|---|
| Input | RGB image, 224x224, ImageNet-normalized |
| Output | 2 logits: [real_score, fake_score] |
| Parameters | ~59.5M unique (FP32) |
| ONNX Opset | 18 |
| File Size | ~117 MB |
| Dynamic Batch | Yes |
| Backbone | ConvNeXt-Tiny |
| Embedder | Swin Transformer-Tiny (as hybrid patch embedding) |
Conversion Fidelity
Numerical comparison against the original PyTorch model across 100 random inputs:
| Metric | Value |
|---|---|
| Max Absolute Error | 1.15e-05 |
| Mean Absolute Error | 4.31e-06 |
| Mean Relative Error | 0.015% |
| Classification Agreement | 100% |
The conversion is numerically equivalent to the original PyTorch model.
Usage
Installation
pip install onnxruntime numpy pillow
Inference on a Single Image
import numpy as np
import onnxruntime as ort
from PIL import Image
# ImageNet normalization constants
MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 3, 1, 1)
STD = np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 3, 1, 1)
def preprocess(image_path: str) -> np.ndarray:
"""Load and preprocess a face image for GenConViT."""
img = Image.open(image_path).convert("RGB").resize((224, 224))
# HWC uint8 -> CHW float32 [0, 1] -> ImageNet normalized
arr = np.asarray(img, dtype=np.float32) / 255.0
arr = np.transpose(arr, (2, 0, 1))[np.newaxis] # (1, 3, 224, 224)
return (arr - MEAN) / STD
def predict(session: ort.InferenceSession, image_path: str) -> tuple[str, float]:
"""Run prediction on a single face image. Returns (label, confidence)."""
input_tensor = preprocess(image_path)
logits = session.run(None, {"input": input_tensor})[0] # (1, 2)
scores = 1.0 / (1.0 + np.exp(-logits)) # sigmoid
mean_scores = scores.mean(axis=0)
pred_class = int(np.argmax(mean_scores))
label = "FAKE" if pred_class == 0 else "REAL"
confidence = float(mean_scores[pred_class])
return label, confidence
# Load model
session = ort.InferenceSession("genconvit_ed_inference.onnx")
# Predict
label, confidence = predict(session, "face.jpg")
print(f"{label} (confidence: {confidence:.4f})")
Inference on Video Frames
import numpy as np
import onnxruntime as ort
from PIL import Image
MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 3, 1, 1)
STD = np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 3, 1, 1)
def preprocess_frames(face_images: list[np.ndarray]) -> np.ndarray:
"""Preprocess a list of cropped face images (HWC uint8 numpy arrays)."""
batch = np.stack([
np.transpose(img.astype(np.float32) / 255.0, (2, 0, 1))
for img in face_images
]) # (N, 3, 224, 224)
return (batch - MEAN) / STD
def predict_video(session: ort.InferenceSession, face_frames: list[np.ndarray]) -> tuple[str, float]:
"""
Predict on a list of face crops extracted from video frames.
Each face_frame should be a 224x224 RGB uint8 numpy array.
"""
input_tensor = preprocess_frames(face_frames)
# Run inference frame by frame (or batched if memory allows)
all_scores = []
for i in range(len(input_tensor)):
logits = session.run(None, {"input": input_tensor[i:i+1]})[0]
scores = 1.0 / (1.0 + np.exp(-logits)) # sigmoid
all_scores.append(scores[0])
all_scores = np.stack(all_scores) # (N, 2)
mean_scores = all_scores.mean(axis=0)
pred_class = int(np.argmax(mean_scores))
label = "FAKE" if pred_class == 0 else "REAL"
confidence = float(mean_scores[pred_class])
return label, confidence
# Example usage:
# session = ort.InferenceSession("genconvit_ed_inference.onnx")
# face_crops = [...] # list of 224x224 RGB numpy arrays from face detection
# label, confidence = predict_video(session, face_crops)
Preprocessing Requirements
The model expects cropped face images, not raw frames. You must run face detection before inference:
- Extract frames from video
- Detect and crop faces (e.g., using
face_recognition,dlib,mediapipe, or any face detector) - Resize each face crop to 224x224 RGB
- Normalize with ImageNet stats:
mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225]
Output Interpretation
The model outputs 2 raw logits: [score_0, score_1].
After applying sigmoid and averaging across frames:
argmax == 0(score_0 > score_1) -> FAKEargmax == 1(score_1 > score_0) -> REAL
This follows the original GenConViT convention where class 0 = FAKE and class 1 = REAL (then the label is flipped via prediction ^ 1 in the original code; the logic above already accounts for this).
Training Data and Performance
The original model was trained and evaluated on:
| Dataset | Accuracy | AUC |
|---|---|---|
| DFDC | - | - |
| FaceForensics++ | - | - |
| Celeb-DF v2 | - | - |
| DeepfakeTIMIT | - | - |
Average across datasets: 95.8% accuracy, 99.3% AUC (as reported in the paper for the full GenConViT ensemble). Individual ED network results may differ.
Citation
@article{wodajo2023genconvit,
title={Deepfake Video Detection Using Generative Convolutional Vision Transformer},
author={Wodajo, Deressa and Mareen, Hannes and Lambert, Peter and Atnafu, Solomon and Akhtar, Zahid and Van Wallendael, Glenn},
journal={Applied Sciences},
volume={15},
number={12},
pages={6622},
year={2025},
publisher={MDPI},
doi={10.3390/app15126622}
}
Acknowledgements
- Original model and training by Deressa Wodajo et al.
- Conversion to ONNX by Pranjal Pravesh