GenConViT ED (Encoder-Decoder) - ONNX

ONNX conversion of the Encoder-Decoder (ED) network from GenConViT: Generative Convolutional Vision Transformer for Deepfake Video Detection.

Converted from the official PyTorch weights released by the authors at erprogs/GenConViT.

Model Description

GenConViT is a hybrid architecture for deepfake video detection that combines:

  • A CNN Encoder-Decoder that learns to reconstruct the input face image
  • A ConvNeXt-Tiny + Swin Transformer-Tiny backbone (via hybrid patch embedding) that extracts features from both the reconstructed and original images
  • A classification head that concatenates both feature vectors and outputs a binary REAL/FAKE prediction

The ED variant is one of two independent networks in the full GenConViT framework (the other being a VAE variant). It processes an input face image through two parallel paths using shared backbone weights, producing a 2-class logit output.

Architecture

Input (B, 3, 224, 224)
  |
  +---> Encoder (5x Conv2d+ReLU+MaxPool) ---> (B, 256, 7, 7)
  |       |
  |       v
  |     Decoder (5x ConvTranspose2d+ReLU) ---> Reconstructed Image (B, 3, 224, 224)
  |       |
  |       v
  |     ConvNeXt+Swin Backbone ---> 1000-dim features (from reconstruction)
  |
  +---> ConvNeXt+Swin Backbone ---> 1000-dim features (from original)
          |
          v
        Concatenate ---> 2000-dim
          |
          v
        FC(2000, 500) + GELU + FC(500, 2) ---> Output logits (B, 2)

Key Details

Property Value
Input RGB image, 224x224, ImageNet-normalized
Output 2 logits: [real_score, fake_score]
Parameters ~59.5M unique (FP32)
ONNX Opset 18
File Size ~117 MB
Dynamic Batch Yes
Backbone ConvNeXt-Tiny
Embedder Swin Transformer-Tiny (as hybrid patch embedding)

Conversion Fidelity

Numerical comparison against the original PyTorch model across 100 random inputs:

Metric Value
Max Absolute Error 1.15e-05
Mean Absolute Error 4.31e-06
Mean Relative Error 0.015%
Classification Agreement 100%

The conversion is numerically equivalent to the original PyTorch model.

Usage

Installation

pip install onnxruntime numpy pillow

Inference on a Single Image

import numpy as np
import onnxruntime as ort
from PIL import Image

# ImageNet normalization constants
MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 3, 1, 1)
STD = np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 3, 1, 1)

def preprocess(image_path: str) -> np.ndarray:
    """Load and preprocess a face image for GenConViT."""
    img = Image.open(image_path).convert("RGB").resize((224, 224))
    # HWC uint8 -> CHW float32 [0, 1] -> ImageNet normalized
    arr = np.asarray(img, dtype=np.float32) / 255.0
    arr = np.transpose(arr, (2, 0, 1))[np.newaxis]  # (1, 3, 224, 224)
    return (arr - MEAN) / STD

def predict(session: ort.InferenceSession, image_path: str) -> tuple[str, float]:
    """Run prediction on a single face image. Returns (label, confidence)."""
    input_tensor = preprocess(image_path)
    logits = session.run(None, {"input": input_tensor})[0]  # (1, 2)

    scores = 1.0 / (1.0 + np.exp(-logits))  # sigmoid
    mean_scores = scores.mean(axis=0)

    pred_class = int(np.argmax(mean_scores))
    label = "FAKE" if pred_class == 0 else "REAL"
    confidence = float(mean_scores[pred_class])
    return label, confidence

# Load model
session = ort.InferenceSession("genconvit_ed_inference.onnx")

# Predict
label, confidence = predict(session, "face.jpg")
print(f"{label} (confidence: {confidence:.4f})")

Inference on Video Frames

import numpy as np
import onnxruntime as ort
from PIL import Image

MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 3, 1, 1)
STD = np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 3, 1, 1)

def preprocess_frames(face_images: list[np.ndarray]) -> np.ndarray:
    """Preprocess a list of cropped face images (HWC uint8 numpy arrays)."""
    batch = np.stack([
        np.transpose(img.astype(np.float32) / 255.0, (2, 0, 1))
        for img in face_images
    ])  # (N, 3, 224, 224)
    return (batch - MEAN) / STD

def predict_video(session: ort.InferenceSession, face_frames: list[np.ndarray]) -> tuple[str, float]:
    """
    Predict on a list of face crops extracted from video frames.
    Each face_frame should be a 224x224 RGB uint8 numpy array.
    """
    input_tensor = preprocess_frames(face_frames)

    # Run inference frame by frame (or batched if memory allows)
    all_scores = []
    for i in range(len(input_tensor)):
        logits = session.run(None, {"input": input_tensor[i:i+1]})[0]
        scores = 1.0 / (1.0 + np.exp(-logits))  # sigmoid
        all_scores.append(scores[0])

    all_scores = np.stack(all_scores)  # (N, 2)
    mean_scores = all_scores.mean(axis=0)

    pred_class = int(np.argmax(mean_scores))
    label = "FAKE" if pred_class == 0 else "REAL"
    confidence = float(mean_scores[pred_class])
    return label, confidence

# Example usage:
# session = ort.InferenceSession("genconvit_ed_inference.onnx")
# face_crops = [...]  # list of 224x224 RGB numpy arrays from face detection
# label, confidence = predict_video(session, face_crops)

Preprocessing Requirements

The model expects cropped face images, not raw frames. You must run face detection before inference:

  1. Extract frames from video
  2. Detect and crop faces (e.g., using face_recognition, dlib, mediapipe, or any face detector)
  3. Resize each face crop to 224x224 RGB
  4. Normalize with ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]

Output Interpretation

The model outputs 2 raw logits: [score_0, score_1].

After applying sigmoid and averaging across frames:

  • argmax == 0 (score_0 > score_1) -> FAKE
  • argmax == 1 (score_1 > score_0) -> REAL

This follows the original GenConViT convention where class 0 = FAKE and class 1 = REAL (then the label is flipped via prediction ^ 1 in the original code; the logic above already accounts for this).

Training Data and Performance

The original model was trained and evaluated on:

Dataset Accuracy AUC
DFDC - -
FaceForensics++ - -
Celeb-DF v2 - -
DeepfakeTIMIT - -

Average across datasets: 95.8% accuracy, 99.3% AUC (as reported in the paper for the full GenConViT ensemble). Individual ED network results may differ.

Citation

@article{wodajo2023genconvit,
    title={Deepfake Video Detection Using Generative Convolutional Vision Transformer},
    author={Wodajo, Deressa and Mareen, Hannes and Lambert, Peter and Atnafu, Solomon and Akhtar, Zahid and Van Wallendael, Glenn},
    journal={Applied Sciences},
    volume={15},
    number={12},
    pages={6622},
    year={2025},
    publisher={MDPI},
    doi={10.3390/app15126622}
}

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support