文章目录

Step 1: Importing the libraries 导入库
Step 2: Importing dataset 导入数据集
Step 3: Handling the missing data 处理丢失数据
Step 4: Encoding categorical data 编码分类的数据
Creating a dummy variable 穿件中间变量
Step 5: Splitting the datasets into training sets and Test sets
数据集分割为训练集和测试集
Step 6: Feature Scaling 特征缩放

这是github一个开源项目，作者是Avik Jain，内容是从机器学习的基础概念起步，逐层递进，很适合初学者。截至到现在，已经有近6000的star。我不自量力的翻译一下。本人英语很渣，借助了翻译工具，但还是不能很好的切合文意。写博客仅仅是为了记录自己的学习历程，翻译不到位的话还请多多见谅，不足之处请指出，不喜勿喷。

这里附上原项目地址：https://github.com/Avik-Jain/100-Days-Of-ML-Code，这里可以下载到项目所需数据集。

如有侵权请告知，我将第一时间删除。

转发请注明出处

第一天数据处理

这里是作者给的知识图谱，下面给出的基本是直译，有很多专业词汇翻译不到位，知识所限。

Step 1 : importing the required libraries

第一步：导入必须的包

these two are essential libraries which we will import every time.

我们每次都要导入这两个库 //这里是指NumPyh和Pandas

Numpy is a libray which contains Mathematical function

Numpy库包含了数学处理函数

pandas is the library used to import and manage the data sets

pandas 是用来导入和管理数据集的库

step 2 :importing the data set

第二步：导入数据集

data sets are generally available in .csv format

数据集通常以.csv的格式提供

a CSV file stores tabular data in plaintext

csv 文件以明文存储表格数据

each line of the file is a data record

文件的每一行都是一个数据记录

we use the read_csv method of the pandas library to read a local csv file as a dataframe

我们使用pandas库的read_csv方法读取本地CSV文件作为数据文件

then we make separate Matrix and Vector of independent and dependent variables from the dataframe

然后从数据文件中分别生成自变量和因变量的矩阵和向量

step3 :handing the missing data

第三步：处理丢失数据

the data we get is rarely homogeneous

我们得到的数据很少是完整的

data can be missing due to various reasons and needs to be handled so that it does not reduce the performance of our machine learning model

由于各种原因，数据可能会丢失，需要进行处理，从而不会降低我们的机器学习模型的性能。

We can replace the missing data by the Median of the entire column

我们可以用整列的中值代替缺失数据

we use imputer class of sklearn.preprocessing for this taske

我们用 sklearn.preprocessing 下的inputer 类来完成这个任务

step 4:encoding categorical data

第四步：编码分类的数据

categorical data are variables that contain lable values rather than numeric values.

分类数据包含标签值的变量而不是数值

the number of possible values is often limited to a fixed set.

可能值的数目通常限于固定集。

Example values such as “yes” and “No” cannot be used in mathematical equations of the model so we need to encode these variables into numbers

诸如“是”和“否”等示例值不能用于模型的数学方程，所以我们需要将这些变量编码成数字。

To achieve this we import LableEncoder class from sklearn.preprocessing library.

为了实现这一点，我们从sklearn.preprocessing库导入LableEncoder 类。

Step 5:splotting the dataset into test set and training set

第五步：把数据集分为测试集和训练集

we make two partitions of dataset one for training the model called training set and other for testing the performance of the trained model called test set

我们把数据集分为两部分，一部分用于训练模型，被成为训练集，另一部分用于检测模型的表现，被称为测试集。

the split is generally 80/20

经常以8:2划分

we importing train_test_split() method of sklearn.crossvalidation library

我们导入 sklearn.crossvalidation库的train_test_split()方法。

Step 6 :feature scaling

第六步：特征缩放

most of machine learning algorithms use the Euclidean distance between two data points in their computations,features highly varying in magnitudes, units and range pose problems.

大多数机器学习算法在计算中使用两个数据点之间的欧氏距离，特征在幅度、单位和范围上很大的变化,这引起了问题。

High magnitudes features will weigh more in the distance calculations than features with low magnitudes.

高数值特征在距离计算中的权重将大于低数值特征。

Done by feature standardization or Z-score normalization

通过特征标准化或Z分数归一化来完成

StandardScalar of sklearn.preprocessing is imported

导入sklearn.preprocessing 库中的StandardScala

代码如下

Step 1: Importing the libraries 导入库

import numpy as np
import pandas as pd

Step 2: Importing dataset 导入数据集

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values

Step 3: Handling the missing data 处理丢失数据

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])

Step 4: Encoding categorical data 编码分类的数据

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])

Creating a dummy variable 穿件中间变量

onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y =  labelencoder_Y.fit_transform(Y)

Step 5: Splitting the datasets into training sets and Test sets

数据集分割为训练集和测试集

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)

Step 6: Feature Scaling 特征缩放

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

杂谈：

NumPy的介绍：https://baike.baidu.com/item/numpy/5678437?fr=aladdin

Pandas的介绍：https://baike.baidu.com/item/pandas/17209606

sklearn是机器学习中一个常用的python第三方模块，网址：http://scikit-learn.org/stable/index.html ，里面对一些常用的机器学习方法进行了封装，在进行机器学习任务时，并不需要每个人都实现所有的算法，只需要简单的调用sklearn里的模块就可以实现大多数机器学习任务。

如果你是新手，可以去廖雪峰老师的学习网站学习Python,在这里学习李宏毅教授的机器学习课程。

转载自原文链接, 如需删除请联系管理员。

原文链接：100-days-ML----100天搞定机器学习 (1)，转载请注明来源！

第一天 数据处理