B.19 Python 基础操作

A Python implementation of global optimization with gaussian processes. Bayesian Optimization

用 numpy 实现一个统计类的算法,比如线性回归、稳健的线性回归、广义线性回归,数据集用 Python 内置的

import numpy as np
np.zeros(3) # vector
## array([0., 0., 0.])
np.ones(3) # vector
## array([1., 1., 1.])
np.diag([1,1,1]) # identy matrix
# np.multiply()
## array([[1, 0, 0],
##        [0, 1, 0],
##        [0, 0, 1]])
np.cumsum([1,1,1])
## array([1, 2, 3])

Python 模块 scikit-learn (Pedregosa et al. 2011) 内置的数据集 iris 为例 https://scikit-learn.org/stable/datasets/index.html

导入正则表达式库,

import re
m = re.search('(?<=abc)def', 'abcdef')
m.group(0) # 必须调用 print 函数打印结果
## 'def'
print(m.group(0))
## def
import sys
print(sys.path)
## ['', '/usr/bin', '/usr/lib/python38.zip', '/usr/lib/python3.8', '/usr/lib/python3.8/lib-dynload', '/opt/.virtualenvs/r-tensorflow/lib/python3.8/site-packages', '/home/runner/work/_temp/Library/reticulate/python', '/opt/.virtualenvs/r-tensorflow/lib/python38.zip', '/opt/.virtualenvs/r-tensorflow/lib/python3.8', '/opt/.virtualenvs/r-tensorflow/lib/python3.8/lib-dynload']

字符串基本操作,如拆分

dir(str)
## ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
print(dir(str.split))
## ['__call__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__name__', '__ne__', '__new__', '__objclass__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__text_signature__']
import re
print(dir(re.split))
## ['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']
import sys
# 模块存放路径
print(sys.path)
# 已安装的模块
sys.modules.keys()
dict_keys(['sys', 'builtins', '_frozen_importlib', '_imp', '_warnings', '_frozen_importlib_external',
'_io', 'marshal', 'posix', '_thread', '_weakref', 'time', 'zipimport', '_codecs', 'codecs',
'encodings.aliases', 'encodings.cp437', 'encodings', 'encodings.utf_8', '_signal', '__main__',
'encodings.latin_1', '_abc', 'abc', 'io', '_stat', 'stat', '_collections_abc', 'genericpath',
'posixpath', 'os.path', 'os', '_sitebuiltins', 'site', 'readline', 'atexit', 'rlcompleter'])
pip3 install virtualenv
virtualenv -p python3 <desired-path>
source <desired-path>/bin/activate
source /opt/virtualenv/tensorflow/bin/activate

编译书籍使用的 Python 3 模块有

pip3 list --format=columns
Package Version
absl-py 0.12.0
astunparse 1.6.3
cachetools 4.2.2
certifi 2021.5.30
chardet 4.0.0
cycler 0.10.0
flatbuffers 1.12
gast 0.4.0
google-auth 1.30.1
google-auth-oauthlib 0.4.4
google-pasta 0.2.0
graphviz 0.8.4
grpcio 1.34.1
h5py 3.1.0
idna 2.10
joblib 1.0.1
keras-nightly 2.5.0.dev2021032900
Keras-Preprocessing 1.1.2
kiwisolver 1.3.1
Markdown 3.3.4
matplotlib 3.4.2
mpmath 1.2.1
mxnet 1.8.0.post0
numpy 1.20.3
oauthlib 3.1.1
opt-einsum 3.3.0
pandas 1.2.4
Pillow 8.2.0
pip 20.0.2
pkg-resources 0.0.0
plotly 4.14.3
protobuf 3.17.2
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyparsing 2.4.7
python-dateutil 2.8.1
pytz 2021.1
requests 2.25.1
requests-oauthlib 1.3.0
retrying 1.3.3
rsa 4.7.2
scikit-learn 0.24.2
scipy 1.6.3
setuptools 44.0.0
six 1.15.0
sympy 1.8
tensorboard 2.5.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
tensorflow 2.5.0
tensorflow-estimator 2.5.0
termcolor 1.1.0
threadpoolctl 2.1.0
typing-extensions 3.7.4.3
urllib3 1.26.5
Werkzeug 2.0.1
wheel 0.36.2
wrapt 1.12.1
# 安装 Python 虚拟环境管理器 virtualenv
sudo dnf install -y python3-pip python3-virtualenv
# 创建虚拟环境
virtualenv -p /usr/bin/python3 $RETICULATE_PYTHON_ENV
# 激活虚拟环境
source $RETICULATE_PYTHON_ENV/bin/activate
# 将虚拟环境位置写入配置文件
echo "export RETICULATE_PYTHON_ENV=$HOME/.virtualenvs/r-tensorflow" >> ~/.bashrc
source ~/.bashrc
# 安装 numpy matplotlib 等模块
pip install -r requirements.txt
# 导出模块版本信息
pip freeze >> requirements.txt
import os
os.listdir('.git')
## ['HEAD', 'hooks', 'refs', 'branches', 'config', 'shallow', 'description', 'info', 'FETCH_HEAD', 'logs', 'index', 'objects']

多个代码块共享同一个 Python 进程

os.path
## <module 'posixpath' from '/usr/lib/python3.8/posixpath.py'>

matplotlib 绘图,支持交叉引用54,如图 B.13 所示

import matplotlib.pyplot as plt
from matplotlib import rcParams
# 其它可配置选项见 rcParams.keys()
rcParams.update({'font.size': 10, 'text.usetex': True}) 
# rcParams.update({'font.family':     ['sans-serif'], 
#                  'font.monospace':  ['DejaVu Sans Mono'], 
#                  'font.sans-serif': ['DejaVu Sans'], 
#                  'font.serif':      ['DejaVu Serif']})
plt.switch_backend('agg')
plt.plot([0, 2, 1, 4])
## [<matplotlib.lines.Line2D object at 0x7f38d5d56c10>]
plt.xlabel(r'Coord $x$')
## Text(0.5, 0, 'Coord $x$')
plt.ylabel(r'Coord $y$')
## Text(0, 0.5, 'Coord $y$')
plt.tight_layout()
plt.show()
matplotlib 复制示例

图 B.13: matplotlib 复制示例

有了 reticulate 包,我们可以把任意想要导入到 R 环境中的 Python 模块导进来,实现 R 与 Python 的数据交换和函数调用55

os <- reticulate::import("os") # 导入 Python 模块
x <- os$listdir(".git") # 调用 os.listdir() 函数
x # 得到 python 中的向量 vector 或数组 array
##  [1] "HEAD"        "hooks"       "refs"        "branches"    "config"     
##  [6] "shallow"     "description" "info"        "FETCH_HEAD"  "logs"       
## [11] "index"       "objects"
# https://docs.bokeh.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart
from bokeh.plotting import figure, output_file, show
# 准备一些数据
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]
# 将动态图形以静态 HTML 文件的方式保存
output_file("lines.html")
# 创建一个简单的图形,设置标题、x,y 轴标签
p = figure(title="simple line example", x_axis_label='x', y_axis_label='y')
# 添加一条折线,设置图例,线宽
p.line(x, y, legend_label="Temp.", line_width=2)
# 显示结果
show(p)

将静态图形嵌入到 R Markdown 中

htmltools::includeHTML("lines.html")

R 和 Python 之间的交互,Python 负责数据处理和建模, R 负责绘图,有些复杂的机器学习模型及其相关数据操作需要在 Python 中完成,数据集清理至数据框的形式后导入到 R 中,画各种静态或者动态图,这时候需要加载 reticulate 包,只是设置 python.reticulate = TRUE 还不够

pandas 读取数据,整理后由 reticulate 包传递给 R 环境中的 data.frame 对象,加载 ggplot2 绘图

以 NumPy 为例

import numpy as np
a = np.arange(15).reshape(3, 5)
a
## array([[ 0,  1,  2,  3,  4],
##        [ 5,  6,  7,  8,  9],
##        [10, 11, 12, 13, 14]])
a.shape
## (3, 5)
a.ndim
## 2
a.dtype.name
## 'int64'
a.itemsize
## 8
a.size
## 15
type(a)
## <class 'numpy.ndarray'>
b = np.array([6, 7, 8])
b
## array([6, 7, 8])
type(b)
## <class 'numpy.ndarray'>
a.transpose() @ b
## array([115, 136, 157, 178, 199])

Python 里面的点号\(\cdot\)对应于R里面的 $

library(reticulate)
np <- import("numpy", convert=FALSE) # 导入 Python 模块
a <- np$arange(0, 15)$reshape(3L, 5L)
a
## [[ 0.  1.  2.  3.  4.]
##  [ 5.  6.  7.  8.  9.]
##  [10. 11. 12. 13. 14.]]
a$shape
## (3, 5)
a$ndim
## 2
a$dtype$name
## float64
a$itemsize
## 8
a$size
## 15
a$ctypes
## <numpy.core._internal._ctypes>
a$dtype # data type 数据类型
## float64
a$astype
## <built-in method astype of numpy.ndarray>
builtins <- import_builtins() # Python 内建的函数,不需要导入第三方模块
builtins$type(a)
## <class 'numpy.ndarray'>

基本线性代数运算

a$transpose() # 转置
## [[ 0.  5. 10.]
##  [ 1.  6. 11.]
##  [ 2.  7. 12.]
##  [ 3.  8. 13.]
##  [ 4.  9. 14.]]
a$trace() # 迹
## 18.0
np$eye(2L) # 单位矩阵
## [[1. 0.]
##  [0. 1.]]
a$diagonal() # 对角
## [ 0.  6. 12.]
# 两个矩阵的乘法
b <- np$array(c(6, 7, 8, 9, 10))$reshape(5L, 1L)
b
## [[ 6.]
##  [ 7.]
##  [ 8.]
##  [ 9.]
##  [10.]]
b$shape
## (5, 1)
np$multiply(b$transpose(), a) # b 乘以 a
## [[  0.   7.  16.  27.  40.]
##  [ 30.  42.  56.  72.  90.]
##  [ 60.  77.  96. 117. 140.]]

Python 对象转化为 R 对象

py_to_r(b)
##      [,1]
## [1,]    6
## [2,]    7
## [3,]    8
## [4,]    9
## [5,]   10

参考文献

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
Python Machine Learning. n.d.
Ushey, Kevin, JJ Allaire, and Yuan Tang. 2021. Reticulate: Interface to Python. https://github.com/rstudio/reticulate.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

  1. 早些时候,在 R Markdown 中设置 python.reticulate = TRUE 调用 reticulate 包,带来的副作用是不支持交叉引用的 https://d.cosx.org/d/420680-python-reticulate-true。RStudio 1.2 已经很好地集成了 reticulate,对 Python 的支持更加到位了 https://blog.rstudio.com/2018/10/09/rstudio-1-2-preview-reticulated-python/。截至本文写作时间 2021年06月14日 使用 reticulate 版本 1.20,本文没有对之前的版本进行测试。↩︎

  2. 朱俊辉的帖子 — 在 R 中使用 gluon https://d.cosx.org/d/419785-r-gluon↩︎