• Vision transformers (ViT), Google 2020
    • Use 1D sinusoidal positional encodings. Found 2D doesn’t add much.
  • Foundations
    • localization - not just classification but identify bounding box
      • how: output also x,y,h,w (augmented with classes). loss can use MSE on these components.
    • landmark detection - just annotate many x,y points on an image (still fixed number)
      • e.g. different parts outlining the eyes/facial features, or joints of a pose - must be laboriously handlabeled
    • detection - any number of objects
      • can just use sliding windows of diff sizes - inefficient
      • but can also reuse structure/computation of a CNN
  • Seminal models
    • lenet: earliest cnn

    • alexnet: deeper cnn, imagenet

    • vgg-16: bigger, uniform arch

    • resnet: residual connections

    • inception: allow many possible convolution layers

      Untitled

    • mobilenet

    • efficientnet