Generalized linear models

families
- Bernoulli and binomial: logistic regression
- Gaussian: OLS regression
- multinomial: softmax regression
- Poisson: for modelling count-data
- gamma and exponential: for modelling continuous, non-neg RVs, eg time intervals
- beta and Dirichlet: for dists over probs
- many more...

Distribution	Notation	GLM Type	Link Function	MLE Loss Function
Gaussian	$N(\mu, \sigma^2)$	Linear Regression	$g(\mu) = \mu$	$L = \frac{1}{2n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$
Binomial	$B(n, p)$	Logistic Regression	$g(p) = \log(\frac{p}{1-p})$	$L = -\frac{1}{n}\sum_{i=1}^n [y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i)]$
Poisson	$Pois(\lambda)$	Poisson Regression	$g(\lambda) = \log(\lambda)$	$L = \frac{1}{n}\sum_{i=1}^n [\hat{\lambda}_i - y_i \log(\hat{\lambda}_i)]$
Multinomial	$Mult(n, p_1, ..., p_k)$	Multinomial Logistic Regression	$g(p_j) = \log(\frac{p_j}{p_k})$ for j = 1, ..., k-1	$L = -\frac{1}{n}\sum_{i=1}^n \sum_{j=1}^k y_{ij} \log(\hat{p}_{ij})$
Gamma	$Gamma(k, \theta)$	Gamma Regression	$g(\mu) = \frac{1}{\mu}$	$L = \frac{1}{n}\sum_{i=1}^n [\frac{y_i}{\hat{\mu}_i} + \log(\hat{\mu}_i)]$
Inverse Gaussian	$IG(\mu, \lambda)$	Inverse Gaussian Regression	$g(\mu) = \frac{1}{\mu^2}$	$L = \frac{1}{n}\sum_{i=1}^n [\frac{(y_i - \hat{\mu}_i)^2}{y_i \hat{\mu}_i^2}]$

Collaborative filtering

various similarity measures
- jaccard for binary features
- cosine similarity
- pearson correlation: useful if want insensitivity to users' differing scales
item-item similarity (first pub by glinden et al)
- predict a user's rating of an item as the weighted sum of other items' ratings by the user, where weights are item similarities:
```
score[usr,itm] = avg(sim(itm,itm2) * score[usr,itm2] for itm2 in itms)
```
user-user similarity
- predict a user's rating of an item as the weighted avg of other users' ratings for the item, where weights are user similarities:
```
score[usr,itm] = avg(sim(usr,usr2) * score[usr2,itm] for usr2 in usrs)
```
- http://software-carpentry.org/4_0/matrix/recommend/
evaluation metrics: RMSE is typical

Feature selection

why?
- most methods implicitly do feature selection
  - decision trees: use info gain or gain ration to decide what attrs to use as tests; many features don't get used
  - neural nets: backprop learns strong connections to some inputs, near-0 connections to others
  - kNN, MBL: weights in weighted euclidian distance
  - SVMs: max margin hyperplane may focus on important features, ignore irrelevant features
- answer: empirically degrading performance with more features; cf curse of dimensionality
  - DT (and many others, esp eg kNN) vuln to irrelevant attrs
    - as you go down the tree, less data avail bc of splitting
    - even relevant attrs can hurt; eg adding a high-signal attr deteriorates perf by being chosen high up the tree, reducing the data down the tree
  - NB robust to irrelevant attrs but vuln to redundant attrs by the same principal
- also, simpler models: faster, easier to understand, see most important
techniques
- subset search algos (see below); some subset evaluation measures:
  - filter: indep assessment based on general characteristics of data, eg simple model
  - wrapper: evaluate subset using the learner that will be ultimately used
  - keep attrs that have high class cor but low inter-cor
    - using symmetric uncertainty $U$, goodness of feature set is $\sum_j U(A_j, C) / \sqrt{\sum_{i,j} U(A_i, A_j)}$
- use learner: eg, use attrs used by DT, or use top-coef attrs in linear model, or use top 1R attrs, or use attribute weighting in instance-based learning like kNN
  - but error-based method like 1R may not be best for ranking attrs TODO supervised discretization
search algos
- metrics for filters/wrappers: F-test (sig tests are discouraged; not even sure what this is testing), AIC, BIC, $R^2$, CV accuracy
- best-first/greedy stepwise regression: most popular; data dredging?
  - forward selection: add features one/some at a time
  - backward elimination aka recursive elimination: eg this faster/sloppier version of naive backward (which is opposite of forward)
    
    start with all features in model candidates (for removal) = all features for each iteration, remove from candidates any feature whose exclusion yields no acc drop remove (from model and candidates) feature in candidates with biggest acc drop
  - combination: add & remove features, eg forward while optionally removing a feature at each step
- forward stagewise regression: "partially add" vars by increasing weight by epsilon in correct direction
  - increase weight along most-correlated var by $\epsilon$
```
let vector r = y
let vector beta = 0
iterate:
	find x[j] most correlated with r
	let delta = epsilon * sign(r, x[j])
	set beta[j] += delta
	set r -= delta * x[j]
```
  - identical to LASSO for orthogonal predictors, and similar in general case
- least angle regression (LARS): fwd stagewise made fast
  - increase weight along most-correlated var until another var just as correlated, then move along the bisection of both vars, etc.
  - similar results as LASSO and fwd stagewise, which can be thought of as restricted versions of LARS; slight modifications yield LASSO/fwd stagewise
  - same complexity as OLS on full data set; also yields full path; no reason to not try it
  - http://blog.echen.me/tag/least-angle-regression/
  - http://www.stanford.edu/~hastie/TALKS/larstalk.pdf
- L1 (LASSO) regression/regularization
  - LASSO = least absolute shrinkage and selection operator
  - impl w quad prog, coordinate descent, or LARS for exact solution
  - assume attrs and response are centered at 0 and scaled to unit variance
  - minimize RMSE where L1 norm of $\beta$ bounded by $s$ (diamond in 2D)
  - (for small $s$) does selection as well as shrinkage, since it may be optimal to put all weight on fewer attrs than a bit on each
relevance determination: rank by information gain and evaluate first 10, 20, ... features
- information gain: reduction in entropy due to attribute: $G(S,A)=H(S)-\sum_{v\in Values(A)} \frac{|S_v|}{|S|} H(S_v)$
limitations
- feature selection can overfit
- wrapper methods can be expensive
- non-selected != non-predictor (eg redundant features)
- may discard features that domain experts want to keep
- most feature selection greedy, non-optimal
http://www.quora.com/Regression-statistics/What-is-Least-Angle-Regression-and-when-should-it-be-used
TODO stepwise regression
TODO optimality criteria
http://www.cs.cornell.edu/courses/cs578/2007fa/missing_featsel_lecture.pdf

Decision tree ensembles

random forests
- bagged ensemble of decision trees (orange uses 100 by default)
- each decision tree uses subset of all features (param $m$)
- each decision tree may be grown fully without pruning
- each decision tree trained on a bootstrapped sample of the data (as per bagging)
- their reported error (confidence?) does take into account their performance on the training set, somehow
- may be obvious but: despite feature sampling, feature selection still helps, since the trees are not weighted at all by their performance (if they were, then they'd implicitly be weighing their own subset of features)
- efficient, parallelizable
- RFs is very competitive, but also has danger of overfitting
- can rank features with this rough technique: