KerasCV contains end-to-end implementations of popular model architectures. These models can be created in two ways:
from_preset()
constructor, which instantiates an object with
a pre-trained configuration, and (optionally) weights.
Available preset names are listed on this page.model = keras_cv.models.RetinaNet.from_preset(
"resnet50_v2_imagenet",
num_classes=20,
bounding_box_format="xywh",
)
backbone = keras_cv.models.ResNetBackbone(
stackwise_filters=[64, 128, 256, 512],
stackwise_blocks=[2, 2, 2, 2],
stackwise_strides=[1, 2, 2, 2],
include_rescaling=False,
)
model = keras_cv.models.RetinaNet(
backbone=backbone,
num_classes=20,
bounding_box_format="xywh",
)
Each of the following preset name corresponds to a configuration and weights for a backbone model.
The names below can be used with the from_preset()
constructor for the
corresponding backbone model.
backbone = keras_cv.models.ResNetBackbone.from_preset("resnet50_imagenet")
For brevity, we do not include the presets without pretrained weights in the following table.
Note: All pretrained weights should be used with unnormalized pixel
intensities in the range [0, 255]
if include_rescaling=True
or in the range
[0, 1]
if including_rescaling=False
.
Preset name | Model | Parameters | Description |
---|---|---|---|
csp_darknet_l_imagenet | CSPDarkNet | 27.11M | CSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task. |
csp_darknet_tiny_imagenet | CSPDarkNet | 2.38M | CSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task. |
csp_darknet_tiny | CSPDarkNet | 2.38M | CSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers. |
csp_darknet_s | CSPDarkNet | 4.22M | CSPDarkNet model with [64, 128, 256, 512] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers. |
csp_darknet_m | CSPDarkNet | 12.37M | CSPDarkNet model with [96, 192, 384, 768] channels and [2, 6, 6, 2] depths where the batch normalization and SiLU activation are applied after the convolution layers. |
csp_darknet_l | CSPDarkNet | 27.11M | CSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers. |
csp_darknet_xl | CSPDarkNet | 56.84M | CSPDarkNet model with [170, 340, 680, 1360] channels and [4, 12, 12, 4] depths where the batch normalization and SiLU activation are applied after the convolution layers. |
densenet121_imagenet | Unknown | Unknown | DenseNet model with 121 layers. Trained on Imagenet 2012 classification task. |
densenet169_imagenet | Unknown | Unknown | DenseNet model with 169 layers. Trained on Imagenet 2012 classification task. |
densenet201_imagenet | Unknown | Unknown | DenseNet model with 201 layers. Trained on Imagenet 2012 classification task. |
densenet121 | Unknown | Unknown | DenseNet model with 121 layers. |
densenet169 | Unknown | Unknown | DenseNet model with 169 layers. |
densenet201 | Unknown | Unknown | DenseNet model with 201 layers. |
efficientnetlite_b0 | EfficientNetLite | 3.41M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0 . |
efficientnetlite_b1 | EfficientNetLite | 4.19M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1 . |
efficientnetlite_b2 | EfficientNetLite | 4.87M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2 . |
efficientnetlite_b3 | EfficientNetLite | 6.99M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.2 and depth_coefficient=1.4 . |
efficientnetlite_b4 | EfficientNetLite | 11.84M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.4 and depth_coefficient=1.8 . |
efficientnetv1_b0 | EfficientNetV1 | 4.05M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0 . |
efficientnetv1_b1 | EfficientNetV1 | 6.58M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1 . |
efficientnetv1_b2 | EfficientNetV1 | 7.77M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2 . |
efficientnetv1_b3 | EfficientNetV1 | 10.79M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.2 and depth_coefficient=1.4 . |
efficientnetv1_b4 | EfficientNetV1 | 17.68M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.4 and depth_coefficient=1.8 . |
efficientnetv1_b5 | EfficientNetV1 | 28.52M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.6 and depth_coefficient=2.2 . |
efficientnetv1_b6 | EfficientNetV1 | 40.97M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.8 and depth_coefficient=2.6 . |
efficientnetv1_b7 | EfficientNetV1 | 64.11M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=2.0 and depth_coefficient=3.1 . |
efficientnetv2_b0_imagenet | EfficientNetV2 | 5.92M | EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0 . Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet. |
efficientnetv2_b1_imagenet | EfficientNetV2 | 6.93M | EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1 . Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet. |
efficientnetv2_b2_imagenet | EfficientNetV2 | 8.77M | EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2 . Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet. |
efficientnetv2_s_imagenet | EfficientNetV2 | 20.33M | EfficientNet architecture with 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9%top 1 accuracy and 96.7% top 5 accuracy on imagenet. |
efficientnetv2_s | EfficientNetV2 | 20.33M | EfficientNet architecture with 6 convolutional blocks. |
efficientnetv2_m | EfficientNetV2 | 53.15M | EfficientNet architecture with 7 convolutional blocks. |
efficientnetv2_l | EfficientNetV2 | 117.75M | EfficientNet architecture with 7 convolutional blocks, but more filters the in efficientnetv2_m . |
efficientnetv2_b0 | EfficientNetV2 | 5.92M | EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0 . |
efficientnetv2_b1 | EfficientNetV2 | 6.93M | EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1 . |
efficientnetv2_b2 | EfficientNetV2 | 8.77M | EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2 . |
efficientnetv2_b3 | EfficientNetV2 | 12.93M | EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.2 and depth_coefficient=1.4 . |
mit_b0_imagenet | MiT | 3.32M | MiT (MixTransformer) model with 8 transformer blocks. Pre-trained on ImageNet-1K and scores 69% top-1 accuracy on the validation set. |
mit_b0 | MiT | 3.32M | MiT (MixTransformer) model with 8 transformer blocks. |
mit_b1 | MiT | 13.16M | MiT (MixTransformer) model with 8 transformer blocks. |
mit_b2 | MiT | 24.20M | MiT (MixTransformer) model with 16 transformer blocks. |
mit_b3 | MiT | 44.08M | MiT (MixTransformer) model with 28 transformer blocks. |
mit_b4 | MiT | 60.85M | MiT (MixTransformer) model with 41 transformer blocks. |
mit_b5 | MiT | 81.45M | MiT (MixTransformer) model with 52 transformer blocks. |
mobilenet_v3_large_imagenet | MobileNetV3 | 2.99M | MobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task. |
mobilenet_v3_small_imagenet | MobileNetV3 | 933.50K | MobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task. |
mobilenet_v3_small | MobileNetV3 | 933.50K | MobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers. |
mobilenet_v3_large | MobileNetV3 | 2.99M | MobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers. |
resnet50_imagenet | ResNetV1 | 23.56M | ResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). Trained on Imagenet 2012 classification task. |
resnet18 | ResNetV1 | 11.19M | ResNet model with 18 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). |
resnet34 | ResNetV1 | 21.30M | ResNet model with 34 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). |
resnet50 | ResNetV1 | 23.56M | ResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). |
resnet101 | ResNetV1 | 42.61M | ResNet model with 101 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). |
resnet152 | ResNetV1 | 58.30M | ResNet model with 152 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). |
resnet50_v2_imagenet | ResNetV2 | 23.56M | ResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task. |
resnet18_v2 | ResNetV2 | 11.18M | ResNet model with 18 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). |
resnet34_v2 | ResNetV2 | 21.30M | ResNet model with 34 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). |
resnet50_v2 | ResNetV2 | 23.56M | ResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). |
resnet101_v2 | ResNetV2 | 42.63M | ResNet model with 101 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). |
resnet152_v2 | ResNetV2 | 58.33M | ResNet model with 152 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). |
videoswin_base_kinetics400 | VideoSwinB | 87.64M | A base Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset |
videoswin_small_kinetics400 | VideoSwinS | 49.51M | A small Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset |
videoswin_tiny_kinetics400 | VideoSwinT | 27.85M | A tiny Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. |
videoswin_tiny | VideoSwinT | 27.85M | A tiny Video Swin backbone architecture. |
videoswin_small | VideoSwinS | 49.51M | A small Video Swin backbone architecture. |
videoswin_base | VideoSwinB | 87.64M | A base Video Swin backbone architecture. |
videoswin_base_kinetics400_imagenet22k | VideoSwinB | 87.64M | A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 82.7% top1 and 95.5% top5 accuracy on the Kinetics 400 dataset |
videoswin_base_kinetics600_imagenet22k | VideoSwinB | 87.64M | A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 600 dataset. Published weight is capable of scoring 84.0% top1 and 96.5% top5 accuracy on the Kinetics 600 dataset |
videoswin_base_something_something_v2 | VideoSwinB | 87.64M | A base Video Swin backbone architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset |
vitdet_base_sa1b | VitDet | 89.67M | A base Detectron2 ViT backbone trained on the SA1B dataset. |
vitdet_huge_sa1b | VitDet | 637.03M | A huge Detectron2 ViT backbone trained on the SA1B dataset. |
vitdet_large_sa1b | VitDet | 308.28M | A large Detectron2 ViT backbone trained on the SA1B dataset. |
vitdet_base | VitDet | 89.67M | Detectron2 ViT basebone with 12 transformer encoders with embed dim 768 and attention layers with 12 heads with global attention on encoders 2, 5, 8, and 11. |
vitdet_large | VitDet | 308.28M | Detectron2 ViT basebone with 24 transformer encoders with embed dim 1024 and attention layers with 16 heads with global attention on encoders 5, 11, 17, and 23. |
vitdet_huge | VitDet | 637.03M | Detectron2 ViT basebone model with 32 transformer encoders with embed dim 1280 and attention layers with 16 heads with global attention on encoders 7, 15, 23, and 31. |
yolo_v8_xs_backbone | YOLOV8 | 1.28M | An extra small YOLOV8 backbone |
yolo_v8_s_backbone | YOLOV8 | 5.09M | A small YOLOV8 backbone |
yolo_v8_m_backbone | YOLOV8 | 11.87M | A medium YOLOV8 backbone |
yolo_v8_l_backbone | YOLOV8 | 19.83M | A large YOLOV8 backbone |
yolo_v8_xl_backbone | YOLOV8 | 30.97M | An extra large YOLOV8 backbone |
yolo_v8_xs_backbone_coco | YOLOV8 | 1.28M | An extra small YOLOV8 backbone pretrained on COCO |
yolo_v8_s_backbone_coco | YOLOV8 | 5.09M | A small YOLOV8 backbone pretrained on COCO |
yolo_v8_m_backbone_coco | YOLOV8 | 11.87M | A medium YOLOV8 backbone pretrained on COCO |
yolo_v8_l_backbone_coco | YOLOV8 | 19.83M | A large YOLOV8 backbone pretrained on COCO |
yolo_v8_xl_backbone_coco | YOLOV8 | 30.97M | An extra large YOLOV8 backbone pretrained on COCO |
center_pillar_waymo_open_dataset | Unknown | 1.28M | An example CenterPillar backbone for WOD. |
Each of the following preset name corresponds to a configuration and weights for a task model. These models are application-ready, but can be further fine-tuned if desired.
The names below can be used with the from_preset()
constructor for the
corresponding task models.
object_detector = keras_cv.models.RetinaNet.from_preset(
"retinanet_resnet50_pascalvoc",
bounding_box_format="xywh",
)
Note that all backbone presets are also applicable to the tasks. For example,
you can directly use a ResNetBackbone
preset with the RetinaNet
. In this
case, fine-tuning is necessary since task-specific layers will be randomly
initialized.
backbone = keras_cv.models.RetinaNet.from_preset(
"resnet50_imagenet",
bounding_box_format="xywh",
)
For brevity, we do not include the backbone presets in the following table.
Note: All pretrained weights should be used with unnormalized pixel
intensities in the range [0, 255]
if include_rescaling=True
or in the range
[0, 1]
if including_rescaling=False
.
{{task_presets_table}}