我的学习小屋

Created2020-04-20|Updated2021-04-07|统计

本文使用的教材为《Applied Linear Statistical Models》Fifth Edition，P1~P508 课本从简单的一元线性回归讲起，再引入矩阵，拓展到多元线性回归分析一、线性回归的基本概念1. 真实的模型与线性回归我们有$n$个样本$(X_{i1},\cdots ,X_{i,p-1},Y_i)$，每个样本有真实的关系为$Y=f_{true}(X)+\varepsilon_{true}$ 我们用线性关系拟合$f_{true}$，即响应变量$Y_i$和预测变量$X_i$满足关系$Y_{i}=\beta_{0}+\beta_{1} X_{i 1}+\beta_{2} X_{i 2}+\cdots+\beta_{p-1} X_{i, p-1}+\varepsilon_{i}, \quad i=1,2, \ldots, n$ 矩阵形式为：$Y=X \beta+\varepsilon$，$Y \varepsilon$都是$n \times 1$的随机向量 \boldsymbol{Y}=\left[\begin{array}{c}{Y_{1}} \\ {Y_{ ...

正则表达式

Created2020-04-20|Updated2020-12-09

一、R包stringr我觉得这个包比自带的函数处理要简单好用字符操作：操作字符向量中的单个字符 123456str_length('abcdefgh') #返回字符长度[1] 8str_sub('abcdefgh',2,4) #返回切割的字符向量[1] "bcd"str_dup('abcdefgh',2) #重复拼接字符[1] "abcdefghabcdefgh" 添加，移除和操作空白符 12345678910str_pad(string, width, side = c("left", "right", "both"), pad = " ") #添加空白符str_pad("string", 10, side = "left", pad = " ")[1] " string"str_trim(string, sid ...

爬虫

Created2020-04-20|Updated2020-12-09

记录下爬虫的学习内容。爬虫是从网页上爬取内容学习网站 http://c.biancheng.net/view/2011.html https://www.jianshu.com/p/0c0cb9867b44 网页的知识网页一般由三部分组成，分别是 HTML（超文本标记语言）、CSS（层叠样式表）和 JScript（活动脚本语言）打开网页，ctrl + U可以快速查看内容有时候ctrl+U没有用，是因为加了<script>，这个要么解JS、要么webdriver 还有种方法，点F12 HTML 主要看菜鸟教程这里特别要注意网页有没有加<iframe>，这里吃了大亏== 12345678<html>..</html> 表示标记中间的元素是网页<body>..</body> 表示用户可见的内容<div>..</div> 表示框架<p>..</p> 表示段落<li>..</li>表示列表<img>..</im ...

博客搭建

Created2020-04-20|Updated2020-12-09

第一篇文章记录下博客搭建与latex环境主要参考博客园一、软件下载并安装node.js和git node.js网址，检查方法：cmd里输入node -v和npm -v git的国内镜像源，git是用来上传到github的二、博客创建一个文件夹，用于专门放博客的内容，如在D盘新建blog的文件打开CMD，输入d:和cd blog，定位到blog文件夹啊在输入npm i -g hexo，这里可以使用淘宝的镜像源，注意cnpm可以代替npm 这里可以使用淘宝的镜像源npm install -g cnpm --registry=https://registry.npm.taobao.org 然后再输入npm i -g hexo，或者cnpm i -g hexo更快输入hexo init，完成hexo的安装输入hexo install，安装组件（虽然不知道这个组件有啥用）在自己的电脑上检查输入hexo g 输入hexo s，这个可以在游览器上输入localhost:4000 三、github账号在GitHub上安装仓库，这个仓库用来存放博客，名字就是XXX. ...

生存分析

Created2020-04-04|Updated2021-04-07|生存分析

今天用到了生存分析，但生存分析好久之前才学的，有点忘了，所以再学一遍，顺便记录下基本概念生存分析不一定要分析死亡，可以把某些时间当作感兴趣的事件（event）。这里用HIV作为例子。 123library(speff2trial)library(survival) # 生存分析R包data(ACTG175) 删失（censoring）：没办法观察到事件（event）比如观察时间已经到了，事件还没有发生观察期内，个体退出了，没法观察了事件在某个时段内发生了，但不知道具体是啥时候截断（truncation）：每个事件的开始（时间起点）在观察期之前，那么就不知道真正的生存时间生存函数，$T^{}$是生存时间（如HIV的感染时间）。假设$T^ \sim f(t), F(t)$，那么有生存函数$S(t) = 1- F(t)$ \begin{aligned} S(t) &=P\left(T^{*}>t\right) \\ &=1-P\left(T^{*} \leqslant t\right) \end{aligned} 风险函数，有的教材用$h(t)$代替$\l ...

ARIMA

Created2020-01-17|Updated2021-04-07

记录下ARIMA学习内容 https://blog.csdn.net/dingming001/article/details/73743950 https://blog.csdn.net/c1z2w3456789/article/details/80752541 https://blog.csdn.net/gdyflxw/article/details/55509656

CHAID决策树

Created2020-01-17|Updated2021-04-07|随机森林

这是另一种决策树，我在做DRGs分组的时候用到了，记录下 wiki简介 CHAID：Chi-square automatic interaction detection 这个算法基于adjusted significance testing（Bonferroni testing）同时，这个算法是AID和THAID的拓展，是G. V. Kass于1980年在《An Exploratory Technique for Investigating Large Quantities of Categorical Data》提出的 Select the predictor that has the smallest adjusted p-value (i.e., most significant). If thisadjusted p-value is less than or equal to a user-specified alpha-levelalpha4, split the node using this predictor. Else, do not split and ...

凸分析二

Created2019-12-09|Updated2021-04-07|凸分析

不等式限制尝试最小化方程：$f: C \rightarrow \mathbf{R}$ on a set $C$ in $\mathbf{E}$ 局部最小值$\bar{x}$：$f(x) \geq f(\bar{x})$ for all points $x$ in $C$ close to $\bar{x}$ 方向导数：$f^{\prime}(\bar{x} ; d)=\lim _{t \downarrow 0} \frac{f(\bar{x}+t d)-f(\bar{x})}{t}$，$d \in \mathbf{E}$ 一阶必要条件$C$是$E$上的凸集，$\bar{x}$是$f$的局部最小值，则$\forall x \in C \ f^\prime (\bar{x}; x- \bar{x}) \ge 0$，特别的，$f$在$\bar{x}$可微，$-\nabla f(\bar{x}) \in N_{C}(\bar{x})$ 一阶充分条件$C$是$E$上的凸集，$f$是凸函数，则$\forall x \in C \ f^\prime (\bar{x}; x- \bar{x}) ...

凸分析一

Created2019-12-09|Updated2021-04-07|凸分析

背景欧几里得空间定义范数$norm$ in $\mathbf{E}$为 \| x \|=\sqrt{\langle x, x\rangle} unit ball B=\{x \in \mathbf{E} |\|x\| \leq 1\} 两个集合的加定义为 C+D=\{x+y | x \in C, y \in D\} 两个集合的乘 \Lambda C=\{\lambda x | \lambda \in \Lambda, x \in C\} $C$称为cone，（包括原点） \mathbf{R_+} C =C 凸集 \lambda x+(1-\lambda) y \in C \text { whenever } 0 \leq \lambda \leq 1 linear span，$\operatorname{span}(D)$：最小的包含$D$的线性子空间，这个是干嘛用的？？ convex hull凸包，$\operatorname{conv}(D)$：最小的包含$D$的凸集定理 Suppose that the set $C \subset E$ is ...

LSTM

Created2019-12-09|Updated2021-04-07|神经网络

记录下我学习的LSTM 学习的内容： http://colah.github.io/posts/2015-08-Understanding-LSTMs/ https://blog.csdn.net/wangyangzhizhou/article/details/76651116 LSTM的概念 LSTM（Long Short Memory Network）是基于RNN的一种神经网络 RNN有个缺陷：有时需要很短的序列（如$X_1\ X_2\ X_3$）预测$X_4$，有时需要很长的序列（如$X_1 \cdots X_{100}$）预测 LSTM的结构如下每条线都携带着整个向量，如$X_t$ 黄色是神经网络层粉色圆圈是逐点操作，如向量相加、矩阵对应点相乘 LSTM的关键是cell state，细胞状态 LSTM的步骤决定哪些信息会进入cell state：遗忘门 f_{t}=\sigma\left(W_{f} \cdot\left[h_{t-1}, x_{t}\right]+b_{f}\right) 决定哪些信息会储存在cell state：输入门 ...