Image processing

Vision transformers (ViT), Google 2020
- Use 1D sinusoidal positional encodings. Found 2D doesn’t add much.
Foundations
- localization - not just classification but identify bounding box
  - how: output also x,y,h,w (augmented with classes). loss can use MSE on these components.
- landmark detection - just annotate many x,y points on an image (still fixed number)
  - e.g. different parts outlining the eyes/facial features, or joints of a pose - must be laboriously handlabeled
- detection - any number of objects
  - can just use sliding windows of diff sizes - inefficient
  - but can also reuse structure/computation of a CNN
Seminal models
- lenet: earliest cnn
- alexnet: deeper cnn, imagenet
- vgg-16: bigger, uniform arch
- resnet: residual connections
- inception: allow many possible convolution layers
- mobilenet
- efficientnet