Natural language processing

[Edit]

Probability

Experiments & Sample Spaces

Events

Probability

Estimating probability

p(A)=c1/T1.\mathrm{p}(\mathrm{A})=\mathrm{c}_{1} / \mathrm{T}_{1}.

Example

p(A)=.379 (weighted average) or simply 3032/8000\mathrm{p}(\mathrm{A})=.379 \text{ (weighted average) or simply } 3032 / 8000

Basic Properties

p()=0p(Aˉ)=1p(A),ABp(A)p(B)aΩp(a)=1\begin{aligned} p(\varnothing) &= 0 \\ p(\bar{A}) &= 1-p(A), \quad A \subseteq B \Rightarrow p(A) \leq p(B) \\ \sum_{a \in \Omega} p(a) &= 1 \\ \end{aligned}

Joint and Conditional Probability

p(A,B)=p(AB)\mathrm{p}(\mathrm{A}, \mathrm{B})=\mathrm{p}(\mathrm{A} \cap \mathrm{B}) p(AB)=p(A,B)/p(B)\mathrm{p}(\mathrm{A} \mid \mathrm{B})=\mathrm{p}(\mathrm{A}, \mathrm{B}) / \mathrm{p}(\mathrm{B})

Estimating from counts:

p(AB)=p(A,B)/p(B)=(c(AB)/T)/(c(B)/T)=c(AB)/c(B)\mathbf{p}(\mathbf{A} \mid \mathrm{B})= \mathbf{p}(\mathbf{A}, \mathrm{B}) / \mathbf{p}(\mathrm{B})= (\mathbf{c}(\mathbf{A} \cap B) / T) /(\mathbf{c}(\mathrm{B}) / T)= \mathbf{c}(\mathbf{A} \cap \mathbf{B}) / \mathbf{c}(\mathbf{B})

Bayes Rule

p(A,B)=p(B,A)p(A, B)=p(B, A)

since

p(AB)=p(BA)p(A \cap B)=p(B \cap A)

therefore

p(AB)p(B)=p(BA)p(A)\mathrm{p}(\mathrm{A} \mid \mathrm{B}) \mathrm{p}(\mathrm{B})=\mathrm{p}(\mathrm{B} \mid \mathrm{A}) \mathrm{p}(\mathrm{A})

and therefore

p(AB)=p(BA)p(A)/p(B)\mathrm{p}(\mathrm{A} \mid \mathrm{B})=\mathrm{p}(\mathrm{B} \mid \mathrm{A}) \mathrm{p}(\mathrm{A}) / \mathrm{p}(\mathrm{B})

Independence

p(AB)=p(BA)p(A)/p(B)p(AB)p(B)=p(BA)p(A)p(A,B)=p(BA)p(A)\begin{aligned} \mathrm{p}(\mathrm{A} \mid \mathrm{B}) & =\mathrm{p}(\mathrm{B} \mid \mathrm{A}) \mathrm{p}(\mathrm{A}) / \mathrm{p}(\mathrm{B}) \\ \mathrm{p}(\mathrm{A} \mid \mathrm{B}) \mathrm{p}(\mathrm{B}) & =\mathrm{p}(\mathrm{B} \mid \mathrm{A}) \mathrm{p}(\mathrm{A}) \\ \mathrm{p}(\mathrm{A}, \mathrm{B}) & =\mathrm{p}(\mathrm{B} \mid \mathrm{A}) \mathrm{p}(\mathrm{A}) \end{aligned}

… we’re almost there: how p(BA)\mathrm{p}(\mathrm{B} \mid \mathrm{A}) relates to p(B)\mathrm{p}(\mathrm{B})?

p(BA)=p(B)\mathrm{p}(\mathrm{B} \mid \mathrm{A})=\mathrm{p}(\mathrm{B}) (iff A\mathrm{A} and B\mathrm{B} are independent)

Chain Rule

p(A1,A2,A3,A4,,An)=\mathrm{p}\left(\mathrm{A}_{1}, \mathrm{A}_{2}, \mathrm{A}_{3}, \mathrm{A}_{4}, \ldots, \mathrm{A}_{\mathrm{n}}\right)=

p(A1A2,A3,A4,,An)×p(A2A3,A4,,An)×\mathrm{p}\left(\mathrm{A}_{1} \mid \mathrm{A}_{2}, \mathrm{A}_{3}, \mathrm{A}_{4}, \ldots, \mathrm{A}_{\mathrm{n}}\right) \times \mathrm{p}\left(\mathrm{A}_{2} \mid \mathrm{A}_{3}, \mathrm{A}_{4}, \ldots, \mathrm{A}_{\mathrm{n}}\right) \times ×p(A3A4,,An)××p(An1An)×p(An)\times \mathrm{p}\left(\mathrm{A}_{3} \mid \mathrm{A}_{4}, \ldots, \mathrm{A}_{\mathrm{n}}\right) \times \ldots \times \mathrm{p}\left(\mathrm{A}_{\mathrm{n}-1} \mid \mathrm{A}_{\mathrm{n}}\right) \times \mathrm{p}\left(\mathrm{A}_{\mathrm{n}}\right)

Bayesian Network

Total Probability Law

p(A)=ip(A,Bi)=ip(ABi)p(Bi)\mathrm{p}(\mathrm{A})=\sum_{\mathrm{i}} \mathrm{p}(\mathrm{A}, \mathrm{B}_{\mathrm{i}})=\sum_{\mathrm{i}} \mathrm{p}(\mathrm{A} \mid \mathrm{B}_{\mathrm{i}}) \mathrm{p}(\mathrm{B}_{\mathrm{i}})

Normalization

p(Ω)=1ωp(ω)=1ωp(ωB)p(B)=1 for any B ωp(ωB)=1 for any B \begin{aligned} \mathrm{p}(\Omega) & =1 \\ \sum_{\omega} \mathrm{p}(\omega) & =1 \\ \sum_{\omega} \mathrm{p}(\omega \mid \mathrm{B}) \mathrm{p}(\mathrm{B}) & =1 \quad \text { for any B } \\ \sum_{\omega} \mathrm{p}(\omega \mid \mathrm{B}) & =1 \quad \text { for any B } \end{aligned}

Markov Assumption

p(x1,x2,,xn)=p(x1)t=2np(xtxt1)\mathrm{p}\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{\mathbf{n}}\right)=\mathrm{p}\left(\mathbf{x}_{1}\right) \prod_{\mathrm{t}=2}^{\mathrm{n}} \mathrm{p}\left(\mathbf{x}_{\mathrm{t}} \mid \mathbf{x}_{\mathrm{t}-1}\right) p(x1,x2,,xn)=p(x1,x2,,xn)t=n+1Tp(xtxt1,xt2,,xtn)\mathrm{p}\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{\mathbf{n}}\right)=\mathrm{p}\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{\mathbf{n}}\right) \prod_{\mathrm{t}=\mathrm{n}+1}^{\mathrm{T}} \mathrm{p}\left(\mathbf{x}_{\mathrm{t}} \mid \mathbf{x}_{\mathrm{t}-1}, \mathbf{x}_{\mathrm{t}-2}, \ldots, \mathrm{x}_{\mathrm{t}-\mathrm{n}}\right)