どくしょめもとうきょうだいがくのでーたさいえんてぃすといくせいこうざ

読書メモ: 東京大学のデータサイエンティスト育成講座

2023-03-11 08:47

2023-03-14 10:48

データサイエンス関連処理の個人的なメモ書きやコードスニペット集

Jupyter Notebook に関するメモ

マジックコマンド

# マジックコマンド一覧表示
%lsmagic

# カレントディレクトリ表示
%pwd

# 過去の実行履歴
%history

# 計算時間平均の取得
%timeit <処理>

# 計算結果の小数表示桁数を指定できる
%precision 3
1 + 0.1 #=> 1.100

NumPy に関するメモ

NumPy は配列処理や数値計算に際して有用なライブラリ。公式のチュートリアルに基本的な利用法が記載されている。

配列の基本操作

NumPy が取り扱うデータ構造の中心的な存在が配列であり、このライブラリでは Python ビルトインの配列ではなく NumPy 自身にて定義された配列オブジェクトを作成し操作していく。配列の要素の型は dtype という形にて保持されている。dtype は int/uint の 8〜64 ビット、float の 16〜128 ビット、および bool のバリエーションがある。

import numpy as np

ns = np.array([5,1,4,9,4,8,0])
ns.dtype #=> dtype('int64')

ns.dim #=> 1 （次元）
ns.size #=> 7（要素数）

ns.sum() #=> 31 （要素の和）
ns.min() #=> 0 （最小値）
ns.max() #=> 9 （最大値）
np.average(ns) #=> 4.428571428571429 （平均値）
np.median(ns) #=> 4.0 （中央値）

np.sort() #=> array([0, 1, 4, 4, 5, 8, 9]) （ソート）
ns[::-1].sort() #=> array([9, 8, 5, 4, 4, 1, 0]) （逆順ソート）

array([1, 5]) #=> array([1, 2, 3, 4]) （1〜4 の連番配列の作成）

行列の基本操作

# 2x2 ゼロ行列の作成
np.zeros((2, 2), dtype='int8')

# 2x2 単位行列の作成
np.eye(2, dtype='int8')

# 行列の演算
z = np.zeros(2, dtype='int8')
u = np.eye(2, dtype='int8')

np.dot(z, u) #=> 行列の積
np.add(z, u) #=> 行列の和
np.subtract(z, u) #=> 行列の差

Scipy に関するメモ

Scipy は科学技術計算を行う際に有用なライブラリ

線形代数に関する計算

import numpy as np
import scipy.linalg as linalg

# 逆行列
matrix = np.matrix([[2, -1],[4, -3]])
linalg.inv(matrix) #=> array([[ 1.5, -0.5], [ 2. , -1. ]])
np.dot(matrix, linalg.inv(matrix)) #=> matrix([[1., 0.], [0., 1.]])

# 固有値・固有ベクトル
val, vec = linalg.eig(matrix)
val #=> array([ 1.+0.j, -2.+0.j])
vec #=> array([[0.70710678, 0.24253563], [0.70710678, 0.9701425 ]])

# 行列式(determinant)
linalg.det(matrix) #=> -2.000

# 対角和(trace)
np.trace(matrix) #=> -1

Pandas に関するメモ

Pandas はデータ加工を行う際に有用なライブラリ

基本的なオブジェクト

Series

インデックス付き一次元配列のようなオブジェクト

import pandas as pd

pd.Series([1, 2, 3, 4, 5])
s[1] #=> 2

sb = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
sb['b'] #=> 2
sb.values #=> array([1, 2, 3, 4, 5])
sb.index #=> Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

DataFrame

DataFrame はエクセル的な x, y 方向のセルを持つような構造のオブジェクト

from io import StringIO
import pandas as pd

csv = '''ID,Country,GivenName,FamilyName,Score
1,JP,Hiyori,Kono,77
2,JP,Yoshino,Aoyama,85
3,JP,Aika,Kobayashi,100'''
csvIO = StringIO(csv)

table = pd.read_csv(csvIO)
table

上記を実行すると以下のような構造でデータが保持される

# 表の転置（行と列の入れ替え）
table.T

# フィルタ
table[table['Score'] > 80] #=> Score が 80 以上のデータを抽出

# データの追加
newDF = pd.DataFrame({
    'ID': [4], 
    'Country': ['JP'], 
    'GivenName': ['Asuka'], 
    'FamilyName': ['Shioiri'], 
    'Score': [90]
}, index=[4])
res = pd.concat([table, newDF])
res

# 行の削除
res.drop(2)

# 列の削除
res.drop(['Country'], axis=1)

# データのソート
res.Score.sort_values() #=> スコア順にソート

# データの集計
table['Score'].mean() #=> スコアの平均値
table['Score'].median() #=> スコアの中央値
table['Score'].sum() #=> スコアの合計値

Matplotlib に関するメモ

データを可視化する際に有用な Matplotlib の利用法に関するメモ

準備

scipy, numpy, matplotlib, japanize_matplotlib をインストールしておく

$ pip install scipy numpy matplotlib, japanize_matplotlib

関数グラフの表示

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
from scipy.special import gamma
import japanize_matplotlib

x = np.arange(2, 30, 0.01)
plt.figure(figsize=(15, 10))
plt.title('増加率の比較')
plt.xlabel('x')
plt.ylabel('y')
plt.xlim(2, 30)
plt.ylim(0, 35)

plt.plot(x, x * 0 + 1, label='1')
plt.plot(x, np.log(np.log(x)), label='$log log x$')
plt.plot(x, np.log(x), label='$log x$')
plt.plot(x, x, label='$x$')
plt.plot(x, x * np.log(x), label='$x log x$')
plt.plot(x, x ** 2, label='$x^2$')
plt.plot(x, gamma(x), label='$x!$')

plt.grid(True)
_ = plt.legend()

ヒストグラムの表示

Pyplot Text

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

mu, sigma = 100, 15
np.random.seed(0)
x = mu + sigma * np.random.randn(10000)
n, bins, patches = plt.hist(x, 50, density=True, facecolor='g', alpha=0.75)
plt.xlim(40, 160)
plt.ylim(0, 0.03)
plt.grid(True)
plt.show()

散布図の表示

%matplotlib inline

import matplotlib.pyplot as plt

data = {
    -2: -4.5,
    -1.5: 3.2,
    -1: -0.5,
    0: 0.5,
    1: 2,
    2: -1.8,
}

plt.figure(figsize=(15, 10))
plt.scatter(data.keys(), data.values())
plt.grid(True)