[DAY26] 用 Azure Machine Learning SDK 来做 Pipeline

DAY26 用 Azure Machine Learning SDK 来做 Pipeline

在 Azure Machine Learning 中，Pipelines 是机器学习工作的工作流程，流程中的每个工作都是一个步骤（step）。这里的 Pipeline，和 Scikit-Learn 的 Pipeline 是不一样的。在 Scikit-Learn 中是把资料转换处理，而在 AML 中是实验执行的步骤，当然可以把 Scikit-Learn 的 Pipeline 视为一个步骤，包在 AML Pipeline 中。

AML Pipeline 管线中常见的步骤种类包括：

PythonScriptStep：执行指定的 Python 程序码。
DataTransferStep：使用 Azure Data Factory 在资料存放区之间复制资料。
DatabricksStep：在 Databricks 丛集上执行程序码。
AdlaStep：在 Azure Data Lake Analytics 中执行 U-SQL 作业。
ParallelRunStep：在多个计算节点上以分散式工作的形式执行 Python 程序码。

在一个 Pipeline 里，这些 step 都可以被用上。举例来说，我这个 Pipeline 可以用先跑一个 DatabricksStep，再跑3个 PythonScriptStep。

建立 Pipeline

我们实务上最常用到的是 PythonScriptStep。它就是把一段要执行的 script，包成一个 step。使用方式如下：

from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment, Workspace

ws = Workspace.from_config() 

# 第一步来资料前处理
step1 = PythonScriptStep(name = 'preprocess data',
                         source_directory = 'pipeline_script_folder',
                         script_name = 'preprocess_data.py',
                         compute_target = 'ironmancpu')

# 第二步训练模型
step2 = PythonScriptStep(name = 'train model',
                         source_directory = 'pipeline_script_folder',
                         script_name = 'train_model.py',
                         compute_target = 'ironmancpu')

# 接着建立一个 pipeline，把步骤一二放进去
train_pipeline = Pipeline(workspace = ws, steps = [step1,step2])

# 接着提交实验
experiment = Experiment(workspace = ws, name = 'pipeline-sdk')
pipeline_run = experiment.submit(train_pipeline)

# Pipeline 也可以发布出去
published_pipeline = train_pipeline.publish(name='Pipeline_sdk',
                                          description='sdk build pipeline',
                                          version='1.0')

AML 会自动帮你快取已经执行过的 Step，重复使用时就不会执行，用以提高效率。但是有时候可能参数变更、或是 script 改变等等的，不执行快取住的 Pipeline 就不好了。有下面两种作法，程序码参考如下：

# 法一：可以在 PythonScriptStep 里设定 allow_reuse = False
step1 = PythonScriptStep(name = 'preprocess data',
                         source_directory = '.',
                         script_name = 'preprocess_data.py',
                         compute_target = 'aml-cluster',
                         allow_reuse = False)

# 法二：可以在 submit 实验时，用 regenerate_outputs=True 强制执行所有的步骤。
pipeline_run = experiment.submit(train_pipeline, regenerate_outputs=True))

在步骤和步骤之间，常常会需要传递资料。例如说第一个步骤的资料前处理好後，把资料给第二个步骤训练模型。这时候我们就要用到 OutputFileDatasetConfig。它可以将资料暂时储存起来，传给下一个步骤。

OutputFileDatasetConfig 在使用上有下列重点，并有两段程序码供参考：

Python script 要参数化，就像我们在讲 ScriptRunConfig 那天的做法一样。
将 OutputFileDatasetConfig 做为参数来输出或输入。

Python script 要参数化的参考程序码如下：

from azureml.core import Run
import argparse
import os

run = Run.get_context()

parser = argparse.ArgumentParser()
parser.add_argument('--raw-ds', type=str, dest='raw_dataset_id')

# 参数化输出的资料夹
parser.add_argument('--out_folder', type=str, dest='folder')
args = parser.parse_args()
output_folder = args.folder

raw_df = run.input_datasets['raw_data'].to_pandas_dataframe()

prepped_df = raw_df[['col1', 'col2', 'col3']]

# 把处理好的资料存在要输出的资料夹
os.makedirs(output_folder, exist_ok=True)
output_path = os.path.join(output_folder, 'prepared_data.csv')
prepped_df.to_csv(output_path)

OutputFileDatasetConfig 参考程序码如下：

from azureml.data import OutputFileDatasetConfig
from azureml.pipeline.steps import PythonScriptStep
from azureml.core import Experiment, Workspace

ws = Workspace.from_config() 

raw_ds = Dataset.get_by_name(ws, 'raw_dataset')

# 建立 OutputFileDatasetConfig，以传递资料
prepared_data = OutputFileDatasetConfig('prepared_data')

step1 = PythonScriptStep(name = 'prepare data',
                         source_directory = 'pipeline_script_folder',
                         script_name = 'preprocess_data.py',
                         compute_target = 'aml-cluster',
                         # 在这里要输出前处理过的 data
                         arguments = ['--raw-ds', raw_ds.as_named_input('raw_data'),
                                      '--out_folder', prepped_data])

step2 = PythonScriptStep(name = 'train model',
                         source_directory = 'pipeline_script_folder',
                         script_name = 'train_model.py',
                         compute_target = 'aml-cluster',
                        # 在这里要输出前处理过的 data
                         arguments=['--training-data', prepped_data.as_input()])

Pipeline 一样也可以设计成丢参数进去给 PythonScriptStep 使用的型式，程序码参考如下：

from azureml.pipeline.core.graph import PipelineParameter

length_param = PipelineParameter(name='data_length', default_value=100)

step1 = PythonScriptStep(name = 'prepare data',
                         source_directory = 'pipeline_script_folder',
                         script_name = 'preprocess_data.py',
                         compute_target = 'aml-cluster',
                         # 在这里放 pipeline 要丢进来的参数
                         arguments = ['--raw-ds', raw_ds.as_named_input('raw_data'),
                                      '--length', data_length,
                                      '--out_folder', prepped_data])

Pipeline 也可以建立排程定期间隔，时间一定就自动触发。程序码参考如下：

from azureml.pipeline.core import ScheduleRecurrence, Schedule

# frequency 可以是 "Minute"、"Hour"、"Day"、"Week" 或 "Month"。interval 是重跑排程之前，要等候的时间单位数。这里是一天。
daily = ScheduleRecurrence(frequency='Day', interval=1)
schedule = Schedule.create( ws, name='Everyday',
                                description='天天跑',
                                pipeline_id='your pipeline id',
                                experiment_name='Training_Pipeline',
                                recurrence=daily)

Pipeline 也可以被资料改变时触发，程序码参考如下：

from azureml.core import Datastore
from azureml.pipeline.core import Schedule

datastore = Datastore(workspace=ws, name='titanic')
pipeline_schedule = Schedule.create(ws, name='Reactive Training',
                                    description='资料改变时就跑',
                                    pipeline_id='your pipeline id',
                                    experiment_name='Training_Pipeline',
                                    datastore=datastore,
                                    path_on_datastore='data/training')

今天就是我们 Pipeline 的内容啦！不知不觉又破了六千字了真的有够多。明天我们来讲怎麽用 AML SDK 做 AutoML。

<<: DAY 29『从相簿选取照片（有裁剪照片功能）』ImagePicker - Part1

>>: Day 26 - XSS 与防范输入相关攻击的方式