import matplotlib.pyplot as plt
import numpy as np

# make data
np.random.seed(1)
x = 4 + np.random.normal(0, 1.5, 200)
#畫直方圖hist
plt.hist(x)
plt.show()

2、Seaborn

Seaborn 是一個(gè)基于 matplotlib 的可視化庫(kù)。它的特點(diǎn)是可以用簡(jiǎn)潔的代碼畫出復(fù)雜好看的圖表!

3、Plotly

Plotly是一個(gè)開源,交互式和基于瀏覽器的Python圖形庫(kù),它的特點(diǎn)是可以創(chuàng)建互動(dòng)性的圖表,有超過30種圖表類型, 提供了一些在大多數(shù)庫(kù)中沒有的圖表 ,如等高線圖、樹狀圖、3D圖表等。

常用的可視化圖表

有效的圖表應(yīng)該是這樣的:

Selva Prabhakaran

下文系統(tǒng)地匯總了數(shù)據(jù)可視化中最有用的圖表,這些圖表按照可視化目的可以分為7組:

一、相關(guān)性

  1. 散點(diǎn)圖
  2. 氣泡圖
  3. 帶趨勢(shì)線的散點(diǎn)圖
  4. 帶狀圖抖動(dòng)
  5. 計(jì)數(shù)圖
  6. 邊緣直方圖
  7. 邊際箱線圖
  8. 相關(guān)性熱圖
  9. 變量關(guān)系圖

二、偏差

  1. 發(fā)散柱形圖
  2. 分散文本圖
  3. 發(fā)散點(diǎn)圖
  4. 帶標(biāo)記的發(fā)散棒棒糖圖
  5. 面積圖

三、排序

  1. 有序條形圖
  2. 棒棒糖圖表
  3. 點(diǎn)圖
  4. 坡度圖
  5. 啞鈴圖

四、分布

  1. 連續(xù)變量的直方圖
  2. 分類變量的直方圖
  3. 密度圖
  4. 帶直方圖的密度曲線
  5. 密度曲線重疊圖
  6. 分布點(diǎn)圖
  7. 箱形圖
  8. 點(diǎn)+箱線圖
  9. 小提琴圖
  10. 金字塔圖
  11. 分類圖

五、組成

  1. 華夫餅圖
  2. 餅形圖
  3. 樹形圖
  4. 條形圖

六、變化

  1. 時(shí)間序列圖
  2. 帶注釋的波峰和波谷的時(shí)間序列
  3. 自相關(guān)圖
  4. 互相關(guān)圖
  5. 時(shí)間序列分解圖
  6. 多時(shí)間序列
  7. 雙坐標(biāo)圖
  8. 具有誤差帶的時(shí)間序列
  9. 堆積面積圖
  10. 未堆疊面積圖
  11. 日歷熱圖
  12. 季節(jié)圖

七、分組

  1. 樹狀圖
  2. 聚類圖
  3. 安德魯斯曲線
  4. 平行坐標(biāo)

本節(jié)代碼以matplotlib示例,你也可以選擇任意的可視化庫(kù),如seaborn、plotly 展示同樣的可視化效果,文末可下載相關(guān)數(shù)據(jù)集。

一、相關(guān)性

相關(guān)性圖用于可視化兩個(gè)或多個(gè)變量之間的關(guān)系。也就是說,一個(gè)變量相對(duì)于另一個(gè)變量如何變化。

1. 散點(diǎn)圖

散點(diǎn)圖是用于研究?jī)蓚€(gè)變量之間關(guān)系的經(jīng)典且基本的圖。如果數(shù)據(jù)中有多個(gè)組,您可能希望以不同的顏色可視化每個(gè)組。在 中matplotlib,您可以使用 方便地執(zhí)行此操作。plt.scatterplot()

# Import dataset 
midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")

# Prepare Data
# Create as many colors as there are unique midwest['category']
categories = np.unique(midwest['category'])
colors =[plt.cm.tab10(i/float(len(categories)-1))for i in range(len(categories))]

# Draw Plot for Each Category
plt.figure(figsize=(16,10), dpi=80, facecolor='w', edgecolor='k')

for i, category in enumerate(categories):
plt.scatter('area','poptotal',
data=midwest.loc[midwest.category==category,:],
s=20, c=colors[i], label=str(category))

# Decorations
plt.gca().set(xlim=(0.0,0.1), ylim=(0,90000),
xlabel='Area', ylabel='Population')

plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.title("Scatterplot of Midwest Area vs Population", fontsize=22)
plt.legend(fontsize=12)
plt.show()

2. 氣泡圖

有時(shí)您想要顯示邊界內(nèi)的一組點(diǎn)以強(qiáng)調(diào)它們的重要性。在此示例中,您從應(yīng)圈出的數(shù)據(jù)幀中獲取記錄并將其傳遞給下面代碼中描述的內(nèi)容。encircle()

from matplotlib import patches
from scipy.spatial importConvexHull
import warnings; warnings.simplefilter('ignore')
sns.set_style("white")

# Step 1: Prepare Data
midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")

# As many colors as there are unique midwest['category']
categories = np.unique(midwest['category'])
colors =[plt.cm.tab10(i/float(len(categories)-1))for i in range(len(categories))]

# Step 2: Draw Scatterplot with unique color for each category
fig = plt.figure(figsize=(16,10), dpi=80, facecolor='w', edgecolor='k')

for i, category in enumerate(categories):
plt.scatter('area','poptotal', data=midwest.loc[midwest.category==category,:], s='dot_size', c=colors[i], label=str(category), edgecolors='black', linewidths=.5)

# Step 3: Encircling
# https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot
def encircle(x,y, ax=None,**kw):
ifnot ax: ax=plt.gca()
p = np.c_[x,y]
hull =ConvexHull(p)
poly = plt.Polygon(p[hull.vertices,:],**kw)
ax.add_patch(poly)

# Select data to be encircled
midwest_encircle_data = midwest.loc[midwest.state=='IN',:]

# Draw polygon surrounding vertices
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="k", fc="gold", alpha=0.1)
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="firebrick", fc="none", linewidth=1.5)

# Step 4: Decorations
plt.gca().set(xlim=(0.0,0.1), ylim=(0,90000),
xlabel='Area', ylabel='Population')

plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.title("Bubble Plot with Encircling", fontsize=22)
plt.legend(fontsize=12)
plt.show()

3. 帶趨勢(shì)線的散點(diǎn)圖

如果您想了解兩個(gè)變量如何相互變化,最佳擬合線就是最佳選擇。下圖顯示了數(shù)據(jù)中各個(gè)組之間最佳擬合線的差異。要禁用分組并僅為整個(gè)數(shù)據(jù)集繪制一條最佳擬合線,請(qǐng)從下面的調(diào)用中刪除該參數(shù)。hue='cyl'sns.lmplot()

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
df_select = df.loc[df.cyl.isin([4,8]),:]

# Plot
sns.set_style("white")
gridobj = sns.lmplot(x="displ", y="hwy", hue="cyl", data=df_select,
height=7, aspect=1.6, robust=True, palette='tab10',
scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))

# Decorations
gridobj.set(xlim=(0.5,7.5), ylim=(0,50))
plt.title("Scatterplot with line of best fit grouped by number of cylinders", fontsize=20)
plt.show()

每條回歸線在其自己的列中

或者,您可以在每個(gè)組自己的列中顯示最佳擬合線。您可以通過設(shè)置.col=groupingcolumnsns.lmplot()

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
df_select = df.loc[df.cyl.isin([4,8]),:]

# Each line in its own column
sns.set_style("white")
gridobj = sns.lmplot(x="displ", y="hwy",
data=df_select,
height=7,
robust=True,
palette='Set1',
col="cyl",
scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))

# Decorations
gridobj.set(xlim=(0.5,7.5), ylim=(0,50))
plt.show()

4. 帶狀圖抖動(dòng)

通常多個(gè)數(shù)據(jù)點(diǎn)具有完全相同的 X 和 Y 值。結(jié)果,多個(gè)點(diǎn)被繪制在彼此之上并隱藏。為了避免這種情況,請(qǐng)稍微抖動(dòng)這些點(diǎn),以便您可以直觀地看到它們。使用seaborn 可以很方便地做到這一點(diǎn)。stripplot()

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

# Draw Stripplot
fig, ax = plt.subplots(figsize=(16,10), dpi=80)
sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5)

# Decorations
plt.title('Use jittered plots to avoid overlapping of points', fontsize=22)
plt.show()

5. 計(jì)數(shù)圖

避免點(diǎn)重疊問題的另一種選擇是根據(jù)該點(diǎn)上有多少點(diǎn)來增加點(diǎn)的大小。因此,點(diǎn)的大小越大,其周圍的點(diǎn)越集中。

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
df_counts = df.groupby(['hwy','cty']).size().reset_index(name='counts')

# Draw Stripplot
fig, ax = plt.subplots(figsize=(16,10), dpi=80)
sns.stripplot(df_counts.cty, df_counts.hwy, size=df_counts.counts*2, ax=ax)

# Decorations
plt.title('Counts Plot - Size of circle is bigger as more points overlap', fontsize=22)
plt.show()

6. 邊緣直方圖

邊緣直方圖具有沿 X 和 Y 軸變量的直方圖。這用于可視化 X 和 Y 之間的關(guān)系以及 X 和 Y 各自的單變量分布。該圖經(jīng)常用于探索性數(shù)據(jù)分析 (EDA)。

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

# Create Fig and gridspec
fig = plt.figure(figsize=(16,10), dpi=80)
grid = plt.GridSpec(4,4, hspace=0.5, wspace=0.2)

# Define the axes
ax_main = fig.add_subplot(grid[:-1,:-1])
ax_right = fig.add_subplot(grid[:-1,-1], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-1,0:-1], xticklabels=[], yticklabels=[])

# Scatterplot on main ax
ax_main.scatter('displ','hwy', s=df.cty*4, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="tab10", edgecolors='gray', linewidths=.5)

# histogram on the right
ax_bottom.hist(df.displ,40, histtype='stepfilled', orientation='vertical', color='deeppink')
ax_bottom.invert_yaxis()

# histogram in the bottom
ax_right.hist(df.hwy,40, histtype='stepfilled', orientation='horizontal', color='deeppink')

# Decorations
ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')
ax_main.title.set_fontsize(20)
for item in([ax_main.xaxis.label, ax_main.yaxis.label]+ ax_main.get_xticklabels()+ ax_main.get_yticklabels()):
item.set_fontsize(14)

xlabels = ax_main.get_xticks().tolist()
ax_main.set_xticklabels(xlabels)
plt.show()

7. 邊際箱線圖

邊緣箱線圖的用途與邊緣直方圖類似。然而,箱線圖有助于查明 X 和 Y 的中位數(shù)、第 25 個(gè)百分位數(shù)和第 75 個(gè)百分位數(shù)。

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

# Create Fig and gridspec
fig = plt.figure(figsize=(16,10), dpi=80)
grid = plt.GridSpec(4,4, hspace=0.5, wspace=0.2)

# Define the axes
ax_main = fig.add_subplot(grid[:-1,:-1])
ax_right = fig.add_subplot(grid[:-1,-1], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-1,0:-1], xticklabels=[], yticklabels=[])

# Scatterplot on main ax
ax_main.scatter('displ','hwy', s=df.cty*5, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="Set1", edgecolors='black', linewidths=.5)

# Add a graph in each part
sns.boxplot(df.hwy, ax=ax_right, orient="v")
sns.boxplot(df.displ, ax=ax_bottom, orient="h")

# Decorations ------------------
# Remove x axis name for the boxplot
ax_bottom.set(xlabel='')
ax_right.set(ylabel='')

# Main Title, Xlabel and YLabel
ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')

# Set font size of different components
ax_main.title.set_fontsize(20)
for item in([ax_main.xaxis.label, ax_main.yaxis.label]+ ax_main.get_xticklabels()+ ax_main.get_yticklabels()):
item.set_fontsize(14)

plt.show()

8. 相關(guān)性熱圖

相關(guān)圖用于直觀地查看給定數(shù)據(jù)幀(或二維數(shù)組)中所有可能的數(shù)值變量對(duì)之間的相關(guān)性度量。

# Import Dataset
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")

# Plot
plt.figure(figsize=(12,10), dpi=80)
sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True)

# Decorations
plt.title('Correlogram of mtcars', fontsize=22)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

9. 變量關(guān)系圖

成對(duì)圖是探索性分析中的最愛,用于了解所有可能的數(shù)值變量對(duì)之間的關(guān)系。它是雙變量分析的必備工具。

# Load Dataset
df = sns.load_dataset('iris')

# Plot
plt.figure(figsize=(10,8), dpi=80)
sns.pairplot(df, kind="scatter", hue="species", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))
plt.show()
# Load Dataset
df = sns.load_dataset('iris')

# Plot
plt.figure(figsize=(10,8), dpi=80)
sns.pairplot(df, kind="reg", hue="species")
plt.show()

二、偏差

10. 發(fā)散柱狀圖

如果您想了解項(xiàng)目如何根據(jù)單個(gè)指標(biāo)發(fā)生變化并可視化該差異的順序和數(shù)量,則發(fā)散條是一個(gè)很好的工具。它有助于快速區(qū)分?jǐn)?shù)據(jù)中各組的表現(xiàn),并且非常直觀,可以立即傳達(dá)要點(diǎn)。

# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
x = df.loc[:,['mpg']]
df['mpg_z']=(x - x.mean())/x.std()
df['colors']=['red'if x <0else'green'for x in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)

# Draw plot
plt.figure(figsize=(14,10), dpi=80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=5)

# Decorations
plt.gca().set(ylabel='$Model$', xlabel='$Mileage$')
plt.yticks(df.index, df.cars, fontsize=12)
plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
plt.show()

11. 分散文本圖

發(fā)散文本與發(fā)散條類似,如果您想以漂亮且美觀的方式顯示圖表中每個(gè)項(xiàng)目的值,則首選發(fā)散文本。

# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
x = df.loc[:,['mpg']]
df['mpg_z']=(x - x.mean())/x.std()
df['colors']=['red'if x <0else'green'for x in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)

# Draw plot
plt.figure(figsize=(14,14), dpi=80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z)
for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):
t = plt.text(x, y, round(tex,2), horizontalalignment='right'if x <0else'left',
verticalalignment='center', fontdict={'color':'red'if x <0else'green','size':14})

# Decorations
plt.yticks(df.index, df.cars, fontsize=12)
plt.title('Diverging Text Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
plt.xlim(-2.5,2.5)
plt.show()

12.發(fā)散點(diǎn)圖

發(fā)散點(diǎn)圖也類似于發(fā)散條形圖。然而,與發(fā)散的條形圖相比,沒有條形圖會(huì)減少組之間的對(duì)比度和差異。

# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
x = df.loc[:,['mpg']]
df['mpg_z']=(x - x.mean())/x.std()
df['colors']=['red'if x <0else'darkgreen'for x in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)

# Draw plot
plt.figure(figsize=(14,16), dpi=80)
plt.scatter(df.mpg_z, df.index, s=450, alpha=.6, color=df.colors)
for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):
t = plt.text(x, y, round(tex,1), horizontalalignment='center',
verticalalignment='center', fontdict={'color':'white'})

# Decorations
# Lighten borders
plt.gca().spines["top"].set_alpha(.3)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(.3)
plt.gca().spines["left"].set_alpha(.3)

plt.yticks(df.index, df.cars)
plt.title('Diverging Dotplot of Car Mileage', fontdict={'size':20})
plt.xlabel('$Mileage$')
plt.grid(linestyle='--', alpha=0.5)
plt.xlim(-2.5,2.5)
plt.show()

13. 帶標(biāo)記的發(fā)散棒棒糖圖

帶標(biāo)記的棒棒糖提供了一種靈活的方式來可視化差異,方法是將重點(diǎn)放在您想要引起注意的任何重要數(shù)據(jù)點(diǎn)上,并在圖表中適當(dāng)?shù)亟o出推理。

# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
x = df.loc[:,['mpg']]
df['mpg_z']=(x - x.mean())/x.std()
df['colors']='black'

# color fiat differently
df.loc[df.cars =='Fiat X1-9','colors']='darkorange'
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)


# Draw plot
import matplotlib.patches as patches

plt.figure(figsize=(14,16), dpi=80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=1)
plt.scatter(df.mpg_z, df.index, color=df.colors, s=[600if x =='Fiat X1-9'else300for x in df.cars], alpha=0.6)
plt.yticks(df.index, df.cars)
plt.xticks(fontsize=12)

# Annotate
plt.annotate('Mercedes Models', xy=(0.0,11.0), xytext=(1.0,11), xycoords='data',
fontsize=15, ha='center', va='center',
bbox=dict(boxstyle='square', fc='firebrick'),
arrowprops=dict(arrowstyle='-[, widthB=2.0, lengthB=1.5', lw=2.0, color='steelblue'), color='white')

# Add Patches
p1 = patches.Rectangle((-2.0,-1), width=.3, height=3, alpha=.2, facecolor='red')
p2 = patches.Rectangle((1.5,27), width=.8, height=5, alpha=.2, facecolor='green')
plt.gca().add_patch(p1)
plt.gca().add_patch(p2)

# Decorate
plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
plt.show()

14.面積圖

通過對(duì)軸和線之間的區(qū)域進(jìn)行著色,面積圖不僅更加強(qiáng)調(diào)波峰和波谷,還更加強(qiáng)調(diào)高點(diǎn)和低點(diǎn)的持續(xù)時(shí)間。高點(diǎn)持續(xù)的時(shí)間越長(zhǎng),線下的面積就越大。

import numpy as np
import pandas as pd

# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv", parse_dates=['date']).head(100)
x = np.arange(df.shape[0])
y_returns =(df.psavert.diff().fillna(0)/df.psavert.shift(1)).fillna(0)*100

# Plot
plt.figure(figsize=(16,10), dpi=80)
plt.fill_between(x[1:], y_returns[1:],0, where=y_returns[1:]>=0, facecolor='green', interpolate=True, alpha=0.7)
plt.fill_between(x[1:], y_returns[1:],0, where=y_returns[1:]<=0, facecolor='red', interpolate=True, alpha=0.7)

# Annotate
plt.annotate('Peak \n1975', xy=(94.0,21.0), xytext=(88.0,28),
bbox=dict(boxstyle='square', fc='firebrick'),
arrowprops=dict(facecolor='steelblue', shrink=0.05), fontsize=15, color='white')


# Decorations
xtickvals =[str(m)[:3].upper()+"-"+str(y)for y,m in zip(df.date.dt.year, df.date.dt.month_name())]
plt.gca().set_xticks(x[::6])
plt.gca().set_xticklabels(xtickvals[::6], rotation=90, fontdict={'horizontalalignment':'center','verticalalignment':'center_baseline'})
plt.ylim(-35,35)
plt.xlim(1,100)
plt.title("Month Economics Return %", fontsize=22)
plt.ylabel('Monthly returns %')
plt.grid(alpha=0.5)
plt.show()

三、排序

15. 有序條形圖

有序條形圖有效地傳達(dá)了項(xiàng)目的排名順序。但是,將指標(biāo)的值添加到圖表上方,用戶可以從圖表本身獲得精確的信息。這是基于計(jì)數(shù)或任何給定指標(biāo)可視化項(xiàng)目的經(jīng)典方法。查看有關(guān) 實(shí)現(xiàn)和解釋有序條形圖的免費(fèi)視頻教程。

# Prepare Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
df = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)

# Draw plot
import matplotlib.patches as patches

fig, ax = plt.subplots(figsize=(16,10), facecolor='white', dpi=80)
ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=20)

# Annotate Text
for i, cty in enumerate(df.cty):
ax.text(i, cty+0.5, round(cty,1), horizontalalignment='center')


# Title, Label, Ticks and Ylim
ax.set_title('Bar Chart for Highway Mileage', fontdict={'size':22})
ax.set(ylabel='Miles Per Gallon', ylim=(0,30))
plt.xticks(df.index, df.manufacturer.str.upper(), rotation=60, horizontalalignment='right', fontsize=12)

# Add patches to color the X axis labels
p1 = patches.Rectangle((.57,-0.005), width=.33, height=.13, alpha=.1, facecolor='green', transform=fig.transFigure)
p2 = patches.Rectangle((.124,-0.005), width=.446, height=.13, alpha=.1, facecolor='red', transform=fig.transFigure)
fig.add_artist(p1)
fig.add_artist(p2)
plt.show()

16. 棒棒糖圖表

棒棒糖圖以視覺上令人愉悅的方式與有序條形圖具有類似的用途。

# Prepare Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
df = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)

# Draw plot
fig, ax = plt.subplots(figsize=(16,10), dpi=80)
ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=2)
ax.scatter(x=df.index, y=df.cty, s=75, color='firebrick', alpha=0.7)

# Title, Label, Ticks and Ylim
ax.set_title('Lollipop Chart for Highway Mileage', fontdict={'size':22})
ax.set_ylabel('Miles Per Gallon')
ax.set_xticks(df.index)
ax.set_xticklabels(df.manufacturer.str.upper(), rotation=60, fontdict={'horizontalalignment':'right','size':12})
ax.set_ylim(0,30)

# Annotate
for row in df.itertuples():
ax.text(row.Index, row.cty+.5, s=round(row.cty,2), horizontalalignment='center', verticalalignment='bottom', fontsize=14)

plt.show()

17. 點(diǎn)圖

點(diǎn)圖傳達(dá)了項(xiàng)目的排名順序。由于它沿水平軸對(duì)齊,因此您可以更輕松地直觀地看到點(diǎn)之間的距離。

# Prepare Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
df = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)

# Draw plot
fig, ax = plt.subplots(figsize=(16,10), dpi=80)
ax.hlines(y=df.index, xmin=11, xmax=26, color='gray', alpha=0.7, linewidth=1, linestyles='dashdot')
ax.scatter(y=df.index, x=df.cty, s=75, color='firebrick', alpha=0.7)

# Title, Label, Ticks and Ylim
ax.set_title('Dot Plot for Highway Mileage', fontdict={'size':22})
ax.set_xlabel('Miles Per Gallon')
ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment':'right'})
ax.set_xlim(10,27)
plt.show()

18. 斜率圖

斜率圖最適合比較給定人員/項(xiàng)目的“之前”和“之后”位置。

import matplotlib.lines as mlines
# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/gdppercap.csv")

left_label =[str(c)+', '+ str(round(y))for c, y in zip(df.continent, df['1952'])]
right_label =[str(c)+', '+ str(round(y))for c, y in zip(df.continent, df['1957'])]
klass =['red'if(y1-y2)<0else'green'for y1, y2 in zip(df['1952'], df['1957'])]

# draw line
# https://stackoverflow.com/questions/36470343/how-to-draw-a-line-with-matplotlib/36479941
def newline(p1, p2, color='black'):
ax = plt.gca()
l = mlines.Line2D([p1[0],p2[0]],[p1[1],p2[1]], color='red'if p1[1]-p2[1]>0else'green', marker='o', markersize=6)
ax.add_line(l)
return l

fig, ax = plt.subplots(1,1,figsize=(14,14), dpi=80)

# Vertical Lines
ax.vlines(x=1, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')
ax.vlines(x=3, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')

# Points
ax.scatter(y=df['1952'], x=np.repeat(1, df.shape[0]), s=10, color='black', alpha=0.7)
ax.scatter(y=df['1957'], x=np.repeat(3, df.shape[0]), s=10, color='black', alpha=0.7)

# Line Segmentsand Annotation
for p1, p2, c in zip(df['1952'], df['1957'], df['continent']):
newline([1,p1],[3,p2])
ax.text(1-0.05, p1, c +', '+ str(round(p1)), horizontalalignment='right', verticalalignment='center', fontdict={'size':14})
ax.text(3+0.05, p2, c +', '+ str(round(p2)), horizontalalignment='left', verticalalignment='center', fontdict={'size':14})

# 'Before' and 'After' Annotations
ax.text(1-0.05,13000,'BEFORE', horizontalalignment='right', verticalalignment='center', fontdict={'size':18,'weight':700})
ax.text(3+0.05,13000,'AFTER', horizontalalignment='left', verticalalignment='center', fontdict={'size':18,'weight':700})

# Decoration
ax.set_title("Slopechart: Comparing GDP Per Capita between 1952 vs 1957", fontdict={'size':22})
ax.set(xlim=(0,4), ylim=(0,14000), ylabel='Mean GDP Per Capita')
ax.set_xticks([1,3])
ax.set_xticklabels(["1952","1957"])
plt.yticks(np.arange(500,13000,2000), fontsize=12)

# Lighten borders
plt.gca().spines["top"].set_alpha(.0)
plt.gca().spines["bottom"].set_alpha(.0)
plt.gca().spines["right"].set_alpha(.0)
plt.gca().spines["left"].set_alpha(.0)
plt.show()

19. 啞鈴圖

啞鈴圖傳達(dá)了各種項(xiàng)目的“之前”和“之后”位置以及項(xiàng)目的排名順序。如果您想可視化特定項(xiàng)目計(jì)劃對(duì)不同對(duì)象的影響,它非常有用。

import matplotlib.lines as mlines

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/health.csv")
df.sort_values('pct_2014', inplace=True)
df.reset_index(inplace=True)

# Func to draw line segment
def newline(p1, p2, color='black'):
ax = plt.gca()
l = mlines.Line2D([p1[0],p2[0]],[p1[1],p2[1]], color='skyblue')
ax.add_line(l)
return l

# Figure and Axes
fig, ax = plt.subplots(1,1,figsize=(14,14), facecolor='#f7f7f7', dpi=80)

# Vertical Lines
ax.vlines(x=.05, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.vlines(x=.10, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.vlines(x=.15, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.vlines(x=.20, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')

# Points
ax.scatter(y=df['index'], x=df['pct_2013'], s=50, color='#0e668b', alpha=0.7)
ax.scatter(y=df['index'], x=df['pct_2014'], s=50, color='#a3c4dc', alpha=0.7)

# Line Segments
for i, p1, p2 in zip(df['index'], df['pct_2013'], df['pct_2014']):
newline([p1, i],[p2, i])

# Decoration
ax.set_facecolor('#f7f7f7')
ax.set_title("Dumbell Chart: Pct Change - 2013 vs 2014", fontdict={'size':22})
ax.set(xlim=(0,.25), ylim=(-1,27), ylabel='Mean GDP Per Capita')
ax.set_xticks([.05,.1,.15,.20])
ax.set_xticklabels(['5%','15%','20%','25%'])
ax.set_xticklabels(['5%','15%','20%','25%'])
plt.show()

四、分布

20.連續(xù)變量的直方圖

直方圖顯示給定變量的頻率分布。下面的表示根據(jù)分類變量對(duì)頻率條進(jìn)行分組,從而更好地了解連續(xù)變量和分類變量的串聯(lián)。在此免費(fèi)視頻教程中創(chuàng)建直方圖并學(xué)習(xí)如何解釋它們。

# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare data
x_var ='displ'
groupby_var ='class'
df_agg = df.loc[:,[x_var, groupby_var]].groupby(groupby_var)
vals =[df[x_var].values.tolist()for i, df in df_agg]

# Draw
plt.figure(figsize=(16,9), dpi=80)
colors =[plt.cm.Spectral(i/float(len(vals)-1))for i in range(len(vals))]
n, bins, patches = plt.hist(vals,30, stacked=True, density=False, color=colors[:len(vals)])

# Decoration
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
plt.ylim(0,25)
plt.xticks(ticks=bins[::3], labels=[round(b,1)for b in bins[::3]])
plt.show()

21. 分類變量的直方圖

分類變量的直方圖顯示該變量的頻率分布。通過對(duì)條形進(jìn)行著色,您可以可視化與表示顏色的另一個(gè)分類變量相關(guān)的分布。

# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare data
x_var ='manufacturer'
groupby_var ='class'
df_agg = df.loc[:,[x_var, groupby_var]].groupby(groupby_var)
vals =[df[x_var].values.tolist()for i, df in df_agg]

# Draw
plt.figure(figsize=(16,9), dpi=80)
colors =[plt.cm.Spectral(i/float(len(vals)-1))for i in range(len(vals))]
n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)])

# Decoration
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
plt.ylim(0,40)
plt.xticks(ticks=bins, labels=np.unique(df[x_var]).tolist(), rotation=90, horizontalalignment='left')
plt.show()

22. 密度圖

密度圖是可視化連續(xù)變量分布的常用工具。通過按“響應(yīng)”變量對(duì)它們進(jìn)行分組,您可以檢查 X 和 Y 之間的關(guān)系。以下案例用于代表性目的,以描述城市里程的分布如何隨汽缸數(shù)量變化。

# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(16,10), dpi=80)
sns.kdeplot(df.loc[df['cyl']==4,"cty"], shade=True, color="g", label="Cyl=4", alpha=.7)
sns.kdeplot(df.loc[df['cyl']==5,"cty"], shade=True, color="deeppink", label="Cyl=5", alpha=.7)
sns.kdeplot(df.loc[df['cyl']==6,"cty"], shade=True, color="dodgerblue", label="Cyl=6", alpha=.7)
sns.kdeplot(df.loc[df['cyl']==8,"cty"], shade=True, color="orange", label="Cyl=8", alpha=.7)

# Decoration
plt.title('Density Plot of City Mileage by n_Cylinders', fontsize=22)
plt.legend()
plt.show()

23. 帶有直方圖的密度曲線

帶直方圖的密度曲線匯集了兩個(gè)圖傳達(dá)的集體信息,因此您可以將它們放在一個(gè)圖中而不是兩個(gè)圖中。

# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(13,10), dpi=80)
sns.distplot(df.loc[df['class']=='compact',"cty"], color="dodgerblue", label="Compact", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
sns.distplot(df.loc[df['class']=='suv',"cty"], color="orange", label="SUV", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
sns.distplot(df.loc[df['class']=='minivan',"cty"], color="g", label="minivan", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
plt.ylim(0,0.35)

# Decoration
plt.title('Density Plot of City Mileage by Vehicle Type', fontsize=22)
plt.legend()
plt.show()

24. 密度曲線重疊圖

Joy Plot 允許不同組的密度曲線重疊,這是可視化大量組相對(duì)于彼此的分布的好方法。它看起來賞心悅目,并且清楚地傳達(dá)了正確的信息。它可以使用joypy基于matplotlib.

# !pip install joypy
# Import Data
mpg = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(16,10), dpi=80)
fig, axes = joypy.joyplot(mpg, column=['hwy','cty'], by="class", ylim='own', figsize=(14,10))

# Decoration
plt.title('Joy Plot of City and Highway Mileage by Class', fontsize=22)
plt.show()

25. 分布式點(diǎn)圖

分布點(diǎn)圖顯示按組分割的點(diǎn)的單變量分布。點(diǎn)越黑,該區(qū)域的數(shù)據(jù)點(diǎn)越集中。通過對(duì)中位數(shù)進(jìn)行不同的著色,各組的真實(shí)定位立即變得顯而易見。

import matplotlib.patches as mpatches

# Prepare Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
cyl_colors ={4:'tab:red',5:'tab:green',6:'tab:blue',8:'tab:orange'}
df_raw['cyl_color']= df_raw.cyl.map(cyl_colors)

# Mean and Median city mileage by make
df = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', ascending=False, inplace=True)
df.reset_index(inplace=True)
df_median = df_raw[['cty','manufacturer']].groupby('manufacturer').apply(lambda x: x.median())

# Draw horizontal lines
fig, ax = plt.subplots(figsize=(16,10), dpi=80)
ax.hlines(y=df.index, xmin=0, xmax=40, color='gray', alpha=0.5, linewidth=.5, linestyles='dashdot')

# Draw the Dots
for i, make in enumerate(df.manufacturer):
df_make = df_raw.loc[df_raw.manufacturer==make,:]
ax.scatter(y=np.repeat(i, df_make.shape[0]), x='cty', data=df_make, s=75, edgecolors='gray', c='w', alpha=0.5)
ax.scatter(y=i, x='cty', data=df_median.loc[df_median.index==make,:], s=75, c='firebrick')

# Annotate
ax.text(33,13,"$red \; dots \; are \; the \: median$", fontdict={'size':12}, color='firebrick')

# Decorations
red_patch = plt.plot([],[], marker="o", ms=10, ls="", mec=None, color='firebrick', label="Median")
plt.legend(handles=red_patch)
ax.set_title('Distribution of City Mileage by Make', fontdict={'size':22})
ax.set_xlabel('Miles Per Gallon (City)', alpha=0.7)
ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment':'right'}, alpha=0.7)
ax.set_xlim(1,40)
plt.xticks(alpha=0.7)
plt.gca().spines["top"].set_visible(False)
plt.gca().spines["bottom"].set_visible(False)
plt.gca().spines["right"].set_visible(False)
plt.gca().spines["left"].set_visible(False)
plt.grid(axis='both', alpha=.4, linewidth=.1)
plt.show()

26.箱線圖

箱線圖是可視化分布的好方法,可以牢記中位數(shù)、第 25 個(gè)四分位數(shù)、第 75 個(gè)四分位數(shù)和異常值。但是,您需要小心解釋框的大小,這可能會(huì)扭曲該組中包含的點(diǎn)數(shù)。因此,手動(dòng)提供每個(gè)框中的觀測(cè)值數(shù)量可以幫助克服這個(gè)缺點(diǎn)。查看此免費(fèi)視頻課程,使用箱線圖可視化數(shù)值變量的分布。

例如,左側(cè)的前兩個(gè)框具有相同大小的框,盡管它們分別有 5 個(gè)和 47 個(gè) obs。因此,有必要寫下該組中的觀察數(shù)量。

# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(13,10), dpi=80)
sns.boxplot(x='class', y='hwy', data=df, notch=False)

# Add N Obs inside boxplot (optional)
def add_n_obs(df,group_col,y):
medians_dict ={grp[0]:grp[1][y].median()for grp in df.groupby(group_col)}
xticklabels =[x.get_text()for x in plt.gca().get_xticklabels()]
n_obs = df.groupby(group_col)[y].size().values
for(x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs):
plt.text(x, medians_dict[xticklabel]*1.01,"#obs : "+str(n_ob), horizontalalignment='center', fontdict={'size':14}, color='white')

add_n_obs(df,group_col='class',y='hwy')

# Decoration
plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)
plt.ylim(10,40)
plt.show()

27. 點(diǎn)+箱線圖

點(diǎn) + 箱線圖 傳達(dá)與分組箱線圖類似的信息。此外,這些點(diǎn)還可以讓我們了解每組中有多少個(gè)數(shù)據(jù)點(diǎn)。

# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(13,10), dpi=80)
sns.boxplot(x='class', y='hwy', data=df, hue='cyl')
sns.stripplot(x='class', y='hwy', data=df, color='black', size=3, jitter=1)

for i in range(len(df['class'].unique())-1):
plt.vlines(i+.5,10,45, linestyles='solid', colors='gray', alpha=0.2)

# Decoration
plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)
plt.legend(title='Cylinders')
plt.show()

28. 小提琴圖

小提琴圖是箱線圖的視覺上令人愉悅的替代方案。小提琴的形狀或面積取決于它所容納的觀測(cè)值的數(shù)量。然而,小提琴圖可能更難閱讀,并且在專業(yè)環(huán)境中并不常用。這個(gè)免費(fèi)的視頻教程將訓(xùn)練您如何實(shí)現(xiàn)小提琴情節(jié)。

# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(13,10), dpi=80)
sns.violinplot(x='class', y='hwy', data=df, scale='width', inner='quartile')

# Decoration
plt.title('Violin Plot of Highway Mileage by Vehicle Class', fontsize=22)
plt.show()

29. 金字塔圖

金字塔圖可用于顯示按數(shù)量排序的群體分布?;蛘咚部梢杂糜陲@示人群的逐步過濾,如下所示,它用于顯示有多少人通過營(yíng)銷漏斗的每個(gè)階段。

# Read data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")

# Draw Plot
plt.figure(figsize=(13,10), dpi=80)
group_col ='Gender'
order_of_bars = df.Stage.unique()[::-1]
colors =[plt.cm.Spectral(i/float(len(df[group_col].unique())-1))for i in range(len(df[group_col].unique()))]

for c, group in zip(colors, df[group_col].unique()):
sns.barplot(x='Users', y='Stage', data=df.loc[df[group_col]==group,:], order=order_of_bars, color=c, label=group)

# Decorations
plt.xlabel("$Users$")
plt.ylabel("Stage of Purchase")
plt.yticks(fontsize=12)
plt.title("Population Pyramid of the Marketing Funnel", fontsize=22)
plt.legend()
plt.show()

30. 分類圖

庫(kù)提供的分類圖seaborn可用于可視化兩個(gè)或更多分類變量彼此相關(guān)的計(jì)數(shù)分布。

# Load Dataset
titanic = sns.load_dataset("titanic")

# Plot
g = sns.catplot("alive", col="deck", col_wrap=4,
data=titanic[titanic.deck.notnull()],
kind="count", height=3.5, aspect=.8,
palette='tab20')

fig.suptitle('sf')
plt.show()
# Load Dataset
titanic = sns.load_dataset("titanic")

# Plot
sns.catplot(x="age", y="embark_town",
hue="sex", col="class",
data=titanic[titanic.embark_town.notnull()],
orient="h", height=5, aspect=1, palette="tab10",
kind="violin", dodge=True, cut=0, bw=.2)

五、組成

31. 華夫餅圖

waffle圖表可以使用該pywaffle包創(chuàng)建,用于顯示較大人群中的群體組成。

#! pip install pywaffle
# Reference: https://stackoverflow.com/questions/41400136/how-to-do-waffle-charts-in-python-square-piechart
from pywaffle importWaffle

# Import
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare Data
df = df_raw.groupby('class').size().reset_index(name='counts')
n_categories = df.shape[0]
colors =[plt.cm.inferno_r(i/float(n_categories))for i in range(n_categories)]

# Draw Plot and Decorate
fig = plt.figure(
FigureClass=Waffle,
plots={
'111':{
'values': df['counts'],
'labels':["{0} ({1})".format(n[0], n[1])for n in df[['class','counts']].itertuples()],
'legend':{'loc':'upper left','bbox_to_anchor':(1.05,1),'fontsize':12},
'title':{'label':'# Vehicles by Class','loc':'center','fontsize':18}
},
},
rows=7,
colors=colors,
figsize=(16,9)
)
#! pip install pywaffle
from pywaffle importWaffle

# Import
# df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare Data
# By Class Data
df_class = df_raw.groupby('class').size().reset_index(name='counts_class')
n_categories = df_class.shape[0]
colors_class =[plt.cm.Set3(i/float(n_categories))for i in range(n_categories)]

# By Cylinders Data
df_cyl = df_raw.groupby('cyl').size().reset_index(name='counts_cyl')
n_categories = df_cyl.shape[0]
colors_cyl =[plt.cm.Spectral(i/float(n_categories))for i in range(n_categories)]

# By Make Data
df_make = df_raw.groupby('manufacturer').size().reset_index(name='counts_make')
n_categories = df_make.shape[0]
colors_make =[plt.cm.tab20b(i/float(n_categories))for i in range(n_categories)]


# Draw Plot and Decorate
fig = plt.figure(
FigureClass=Waffle,
plots={
'311':{
'values': df_class['counts_class'],
'labels':["{1}".format(n[0], n[1])for n in df_class[['class','counts_class']].itertuples()],
'legend':{'loc':'upper left','bbox_to_anchor':(1.05,1),'fontsize':12,'title':'Class'},
'title':{'label':'# Vehicles by Class','loc':'center','fontsize':18},
'colors': colors_class
},
'312':{
'values': df_cyl['counts_cyl'],
'labels':["{1}".format(n[0], n[1])for n in df_cyl[['cyl','counts_cyl']].itertuples()],
'legend':{'loc':'upper left','bbox_to_anchor':(1.05,1),'fontsize':12,'title':'Cyl'},
'title':{'label':'# Vehicles by Cyl','loc':'center','fontsize':18},
'colors': colors_cyl
},
'313':{
'values': df_make['counts_make'],
'labels':["{1}".format(n[0], n[1])for n in df_make[['manufacturer','counts_make']].itertuples()],
'legend':{'loc':'upper left','bbox_to_anchor':(1.05,1),'fontsize':12,'title':'Manufacturer'},
'title':{'label':'# Vehicles by Make','loc':'center','fontsize':18},
'colors': colors_make
}
},
rows=9,
figsize=(16,14)
)

32. 餅圖

餅圖是顯示組構(gòu)成的經(jīng)典方式。然而,現(xiàn)在通常不建議使用它,因?yàn)轲W餅部分的面積有時(shí)會(huì)產(chǎn)生誤導(dǎo)。因此,如果您要使用餅圖,強(qiáng)烈建議明確寫下餅圖每個(gè)部分的百分比或數(shù)字。

# Import
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare Data
df = df_raw.groupby('class').size()

# Make the plot with pandas
df.plot(kind='pie', subplots=True, figsize=(8,8), dpi=80)
plt.title("Pie Chart of Vehicle Class - Bad")
plt.ylabel("")
plt.show()
# Import
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare Data
df = df_raw.groupby('class').size().reset_index(name='counts')

# Draw Plot
fig, ax = plt.subplots(figsize=(12,7), subplot_kw=dict(aspect="equal"), dpi=80)

data = df['counts']
categories = df['class']
explode =[0,0,0,0,0,0.1,0]

def func(pct, allvals):
absolute = int(pct/100.*np.sum(allvals))
return"{:.1f}% ({:d} )".format(pct, absolute)

wedges, texts, autotexts = ax.pie(data,
autopct=lambda pct: func(pct, data),
textprops=dict(color="w"),
colors=plt.cm.Dark2.colors,
startangle=140,
explode=explode)

# Decoration
ax.legend(wedges, categories, title="Vehicle Class", loc="center left", bbox_to_anchor=(1,0,0.5,1))
plt.setp(autotexts, size=10, weight=700)
ax.set_title("Class of Vehicles: Pie Chart")
plt.show()

33. 樹形圖

樹形圖類似于餅圖,它可以更好地工作,并且不會(huì)誤導(dǎo)每個(gè)組的貢獻(xiàn)。

# pip install squarify
import squarify

# Import Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare Data
df = df_raw.groupby('class').size().reset_index(name='counts')
labels = df.apply(lambda x: str(x[0])+"\n ("+ str(x[1])+")", axis=1)
sizes = df['counts'].values.tolist()
colors =[plt.cm.Spectral(i/float(len(labels)))for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(12,8), dpi=80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Treemap of Vechile Class')
plt.axis('off')
plt.show()

34. 條形圖

條形圖是根據(jù)計(jì)數(shù)或任何給定指標(biāo)可視化項(xiàng)目的經(jīng)典方式。在下圖中,我為每個(gè)項(xiàng)目使用了不同的顏色,但您通??赡芟M麨樗许?xiàng)目選擇一種顏色,除非您按組為它們著色。顏色名稱存儲(chǔ)all_colors在下面的代碼中。color您可以通過設(shè)置中的參數(shù)來更改條形的顏色。plt.plot()

import random

# Import Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare Data
df = df_raw.groupby('manufacturer').size().reset_index(name='counts')
n = df['manufacturer'].unique().__len__()+1
all_colors = list(plt.cm.colors.cnames.keys())
random.seed(100)
c = random.choices(all_colors, k=n)

# Plot Bars
plt.figure(figsize=(16,10), dpi=80)
plt.bar(df['manufacturer'], df['counts'], color=c, width=.5)
for i, val in enumerate(df['counts'].values):
plt.text(i, val, float(val), horizontalalignment='center', verticalalignment='bottom', fontdict={'fontweight':500,'size':12})

# Decoration
plt.gca().set_xticklabels(df['manufacturer'], rotation=60, horizontalalignment='right')
plt.title("Number of Vehicles by Manaufacturers", fontsize=22)
plt.ylabel('# Vehicles')
plt.ylim(0,45)
plt.show()

六、變化

35.時(shí)間序列圖

時(shí)間序列圖用于可視化給定指標(biāo)如何隨時(shí)間變化。在這里,您可以看到 1949 年至 1969 年間航空客運(yùn)量的變化。查看此免費(fèi)視頻教程,了解如何實(shí)現(xiàn)線圖來分析時(shí)間序列。

# Import Data
df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')

# Draw Plot
plt.figure(figsize=(16,10), dpi=80)
plt.plot('date','traffic', data=df, color='tab:red')

# Decoration
plt.ylim(50,750)
xtick_location = df.index.tolist()[::12]
xtick_labels =[x[-4:]for x in df.date.tolist()[::12]]
plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=0, fontsize=12, horizontalalignment='center', alpha=.7)
plt.yticks(fontsize=12, alpha=.7)
plt.title("Air Passengers Traffic (1949 - 1969)", fontsize=22)
plt.grid(axis='both', alpha=.3)

# Remove borders
plt.gca().spines["top"].set_alpha(0.0)
plt.gca().spines["bottom"].set_alpha(0.3)
plt.gca().spines["right"].set_alpha(0.0)
plt.gca().spines["left"].set_alpha(0.3)
plt.show()

36.帶波峰和波谷注釋的時(shí)間序列

下面的時(shí)間序列繪制了所有的波峰和波谷,并注釋了選定特殊事件的發(fā)生。

# Import Data
df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')

# Get the Peaks and Troughs
data = df['traffic'].values
doublediff = np.diff(np.sign(np.diff(data)))
peak_locations = np.where(doublediff ==-2)[0]+1

doublediff2 = np.diff(np.sign(np.diff(-1*data)))
trough_locations = np.where(doublediff2 ==-2)[0]+1

# Draw Plot
plt.figure(figsize=(16,10), dpi=80)
plt.plot('date','traffic', data=df, color='tab:blue', label='Air Traffic')
plt.scatter(df.date[peak_locations], df.traffic[peak_locations], marker=mpl.markers.CARETUPBASE, color='tab:green', s=100, label='Peaks')
plt.scatter(df.date[trough_locations], df.traffic[trough_locations], marker=mpl.markers.CARETDOWNBASE, color='tab:red', s=100, label='Troughs')

# Annotate
for t, p in zip(trough_locations[1::5], peak_locations[::3]):
plt.text(df.date[p], df.traffic[p]+15, df.date[p], horizontalalignment='center', color='darkgreen')
plt.text(df.date[t], df.traffic[t]-35, df.date[t], horizontalalignment='center', color='darkred')

# Decoration
plt.ylim(50,750)
xtick_location = df.index.tolist()[::6]
xtick_labels = df.date.tolist()[::6]
plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=90, fontsize=12, alpha=.7)
plt.title("Peak and Troughs of Air Passengers Traffic (1949 - 1969)", fontsize=22)
plt.yticks(fontsize=12, alpha=.7)

# Lighten borders
plt.gca().spines["top"].set_alpha(.0)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(.0)
plt.gca().spines["left"].set_alpha(.3)

plt.legend(loc='upper left')
plt.grid(axis='y', alpha=.3)
plt.show()

37. 自相關(guān) (ACF) 和偏自相關(guān) (PACF) 圖

ACF 圖顯示了時(shí)間序列與其自身滯后的相關(guān)性。每條垂直線(在自相關(guān)圖上)代表序列與其從滯后 0 開始的滯后之間的相關(guān)性。圖中的藍(lán)色陰影區(qū)域是顯著性水平。藍(lán)線上方的滯后是顯著滯后。

那么如何解釋這一點(diǎn)呢?

對(duì)于 AirPassengers,我們看到多達(dá) 14 次滯后超過了藍(lán)線,因此非常嚴(yán)重。這意味著 14 年前的航空客運(yùn)量會(huì)對(duì)今天的航空客運(yùn)量產(chǎn)生影響。

另一方面,PACF 顯示了任何給定滯后(時(shí)間序列)與當(dāng)前序列的自相關(guān),但消除了中間滯后的貢獻(xiàn)。

注意:如果您想了解如何解釋和繪制 ACF 和 PACF 圖,請(qǐng)查看此免費(fèi)視頻教程。

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Import Data
df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')

# Draw Plot
fig,(ax1, ax2)= plt.subplots(1,2,figsize=(16,6), dpi=80)
plot_acf(df.traffic.tolist(), ax=ax1, lags=50)
plot_pacf(df.traffic.tolist(), ax=ax2, lags=20)

# Decorate
# lighten the borders
ax1.spines["top"].set_alpha(.3); ax2.spines["top"].set_alpha(.3)
ax1.spines["bottom"].set_alpha(.3); ax2.spines["bottom"].set_alpha(.3)
ax1.spines["right"].set_alpha(.3); ax2.spines["right"].set_alpha(.3)
ax1.spines["left"].set_alpha(.3); ax2.spines["left"].set_alpha(.3)

# font size of tick labels
ax1.tick_params(axis='both', labelsize=12)
ax2.tick_params(axis='both', labelsize=12)
plt.show()

38.互相關(guān)圖

互相關(guān)圖顯示兩個(gè)時(shí)間序列彼此之間的滯后。

import statsmodels.tsa.stattools as stattools

# Import Data
df = pd.read_csv('https://github.com/selva86/datasets/raw/master/mortality.csv')
x = df['mdeaths']
y = df['fdeaths']

# Compute Cross Correlations
ccs = stattools.ccf(x, y)[:100]
nlags = len(ccs)

# Compute the Significance level
# ref: https://stats.stackexchange.com/questions/3115/cross-correlation-significance-in-r/3128#3128
conf_level =2/ np.sqrt(nlags)

# Draw Plot
plt.figure(figsize=(12,7), dpi=80)

plt.hlines(0, xmin=0, xmax=100, color='gray')# 0 axis
plt.hlines(conf_level, xmin=0, xmax=100, color='gray')
plt.hlines(-conf_level, xmin=0, xmax=100, color='gray')

plt.bar(x=np.arange(len(ccs)), height=ccs, width=.3)

# Decoration
plt.title('$Cross\; Correlation\; Plot:\; mdeaths\; vs\; fdeaths$', fontsize=22)
plt.xlim(0,len(ccs))
plt.show()

39.時(shí)間序列分解圖

時(shí)間序列分解圖顯示時(shí)間序列分解為趨勢(shì)、季節(jié)性和殘差分量。

from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse

# Import Data
df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')
dates = pd.DatetimeIndex([parse(d).strftime('%Y-%m-01')for d in df['date']])
df.set_index(dates, inplace=True)

# Decompose
result = seasonal_decompose(df['traffic'], model='multiplicative')

# Plot
plt.rcParams.update({'figure.figsize':(10,10)})
result.plot().suptitle('Time Series Decomposition of Air Passengers')
plt.show()

40. 多時(shí)間序列

您可以在同一個(gè)圖表上繪制測(cè)量相同值的多個(gè)時(shí)間序列,如下所示。

# Import Data
df = pd.read_csv('https://github.com/selva86/datasets/raw/master/mortality.csv')

# Define the upper limit, lower limit, interval of Y axis and colors
y_LL =100
y_UL = int(df.iloc[:,1:].max().max()*1.1)
y_interval =400
mycolors =['tab:red','tab:blue','tab:green','tab:orange']

# Draw Plot and Annotate
fig, ax = plt.subplots(1,1,figsize=(16,9), dpi=80)

columns = df.columns[1:]
for i, column in enumerate(columns):
plt.plot(df.date.values, df[column].values, lw=1.5, color=mycolors[i])
plt.text(df.shape[0]+1, df[column].values[-1], column, fontsize=14, color=mycolors[i])

# Draw Tick lines
for y in range(y_LL, y_UL, y_interval):
plt.hlines(y, xmin=0, xmax=71, colors='black', alpha=0.3, linestyles="--", lw=0.5)

# Decorations
plt.tick_params(axis="both", which="both", bottom=False, top=False,
labelbottom=True, left=False, right=False, labelleft=True)

# Lighten borders
plt.gca().spines["top"].set_alpha(.3)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(.3)
plt.gca().spines["left"].set_alpha(.3)

plt.title('Number of Deaths from Lung Diseases in the UK (1974-1979)', fontsize=22)
plt.yticks(range(y_LL, y_UL, y_interval),[str(y)for y in range(y_LL, y_UL, y_interval)], fontsize=12)
plt.xticks(range(0, df.shape[0],12), df.date.values[::12], horizontalalignment='left', fontsize=12)
plt.ylim(y_LL, y_UL)
plt.xlim(-2,80)
plt.show()

41. 雙坐標(biāo)圖

如果要顯示在同一時(shí)間點(diǎn)測(cè)量?jī)蓚€(gè)不同數(shù)量的兩個(gè)時(shí)間序列,您可以根據(jù)右側(cè)的輔助 Y 軸繪制第二個(gè)序列。

# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv")

x = df['date']
y1 = df['psavert']
y2 = df['unemploy']

# Plot Line1 (Left Y Axis)
fig, ax1 = plt.subplots(1,1,figsize=(16,9), dpi=80)
ax1.plot(x, y1, color='tab:red')

# Plot Line2 (Right Y Axis)
ax2 = ax1.twinx()# instantiate a second axes that shares the same x-axis
ax2.plot(x, y2, color='tab:blue')

# Decorations
# ax1 (left Y axis)
ax1.set_xlabel('Year', fontsize=20)
ax1.tick_params(axis='x', rotation=0, labelsize=12)
ax1.set_ylabel('Personal Savings Rate', color='tab:red', fontsize=20)
ax1.tick_params(axis='y', rotation=0, labelcolor='tab:red')
ax1.grid(alpha=.4)

# ax2 (right Y axis)
ax2.set_ylabel("# Unemployed (1000's)", color='tab:blue', fontsize=20)
ax2.tick_params(axis='y', labelcolor='tab:blue')
ax2.set_xticks(np.arange(0, len(x),60))
ax2.set_xticklabels(x[::60], rotation=90, fontdict={'fontsize':10})
ax2.set_title("Personal Savings Rate vs Unemployed: Plotting in Secondary Y Axis", fontsize=22)
fig.tight_layout()
plt.show()

42.帶有誤差帶的時(shí)間序列

如果您有一個(gè)時(shí)間序列數(shù)據(jù)集,每個(gè)時(shí)間點(diǎn)(日期/時(shí)間戳)有多個(gè)觀測(cè)值,則可以構(gòu)建具有誤差帶的時(shí)間序列。下面您可以看到幾個(gè)基于一天中不同時(shí)間收到的訂單的示例。另一個(gè)例子是 45 天內(nèi)到達(dá)的訂單數(shù)量。

在這種方法中,訂單數(shù)的平均值由白線表示。圍繞平均值計(jì)算并繪制 95% 置信帶。

from scipy.stats import sem

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/user_orders_hourofday.csv")
df_mean = df.groupby('order_hour_of_day').quantity.mean()
df_se = df.groupby('order_hour_of_day').quantity.apply(sem).mul(1.96)

# Plot
plt.figure(figsize=(16,10), dpi=80)
plt.ylabel("# Orders", fontsize=16)
x = df_mean.index
plt.plot(x, df_mean, color="white", lw=2)
plt.fill_between(x, df_mean - df_se, df_mean + df_se, color="#3F5D7D")

# Decorations
# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(1)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(1)
plt.xticks(x[::2],[str(d)for d in x[::2]], fontsize=12)
plt.title("User Orders by Hour of Day (95% confidence)", fontsize=22)
plt.xlabel("Hour of Day")

s, e = plt.gca().get_xlim()
plt.xlim(s, e)

# Draw Horizontal Tick lines
for y in range(8,20,2):
plt.hlines(y, xmin=s, xmax=e, colors='black', alpha=0.5, linestyles="--", lw=0.5)

plt.show()
"Data Source: https://www.kaggle.com/olistbr/brazilian-ecommerce#olist_orders_dataset.csv"
from dateutil.parser import parse
from scipy.stats import sem

# Import Data
df_raw = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/orders_45d.csv',
parse_dates=['purchase_time','purchase_date'])

# Prepare Data: Daily Mean and SE Bands
df_mean = df_raw.groupby('purchase_date').quantity.mean()
df_se = df_raw.groupby('purchase_date').quantity.apply(sem).mul(1.96)

# Plot
plt.figure(figsize=(16,10), dpi=80)
plt.ylabel("# Daily Orders", fontsize=16)
x =[d.date().strftime('%Y-%m-%d')for d in df_mean.index]
plt.plot(x, df_mean, color="white", lw=2)
plt.fill_between(x, df_mean - df_se, df_mean + df_se, color="#3F5D7D")

# Decorations
# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(1)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(1)
plt.xticks(x[::6],[str(d)for d in x[::6]], fontsize=12)
plt.title("Daily Order Quantity of Brazilian Retail with Error Bands (95% confidence)", fontsize=20)

# Axis limits
s, e = plt.gca().get_xlim()
plt.xlim(s, e-2)
plt.ylim(4,10)

# Draw Horizontal Tick lines
for y in range(5,10,1):
plt.hlines(y, xmin=s, xmax=e, colors='black', alpha=0.5, linestyles="--", lw=0.5)

plt.show()

43. 堆積面積圖

堆積面積圖直觀地表示了多個(gè)時(shí)間序列的貢獻(xiàn)程度,以便于相互比較。

# Import Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/nightvisitors.csv')

# Decide Colors
mycolors =['tab:red','tab:blue','tab:green','tab:orange','tab:brown','tab:grey','tab:pink','tab:olive']

# Draw Plot and Annotate
fig, ax = plt.subplots(1,1,figsize=(16,9), dpi=80)
columns = df.columns[1:]
labs = columns.values.tolist()

# Prepare data
x = df['yearmon'].values.tolist()
y0 = df[columns[0]].values.tolist()
y1 = df[columns[1]].values.tolist()
y2 = df[columns[2]].values.tolist()
y3 = df[columns[3]].values.tolist()
y4 = df[columns[4]].values.tolist()
y5 = df[columns[5]].values.tolist()
y6 = df[columns[6]].values.tolist()
y7 = df[columns[7]].values.tolist()
y = np.vstack([y0, y2, y4, y6, y7, y5, y1, y3])

# Plot for each column
labs = columns.values.tolist()
ax = plt.gca()
ax.stackplot(x, y, labels=labs, colors=mycolors, alpha=0.8)

# Decorations
ax.set_title('Night Visitors in Australian Regions', fontsize=18)
ax.set(ylim=[0,100000])
ax.legend(fontsize=10, ncol=4)
plt.xticks(x[::5], fontsize=10, horizontalalignment='center')
plt.yticks(np.arange(10000,100000,20000), fontsize=10)
plt.xlim(x[0], x[-1])

# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(.3)

plt.show()

44. 未堆疊面積圖

非堆疊面積圖用于可視化兩個(gè)或多個(gè)系列相對(duì)于彼此的進(jìn)度(上升和下降)。在下圖中,您可以清楚地看到個(gè)人儲(chǔ)蓄率如何隨著失業(yè)持續(xù)時(shí)間中位數(shù)的增加而下降。非堆疊面積圖很好地體現(xiàn)了這種現(xiàn)象。

# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv")

# Prepare Data
x = df['date'].values.tolist()
y1 = df['psavert'].values.tolist()
y2 = df['uempmed'].values.tolist()
mycolors =['tab:red','tab:blue','tab:green','tab:orange','tab:brown','tab:grey','tab:pink','tab:olive']
columns =['psavert','uempmed']

# Draw Plot
fig, ax = plt.subplots(1,1, figsize=(16,9), dpi=80)
ax.fill_between(x, y1=y1, y2=0, label=columns[1], alpha=0.5, color=mycolors[1], linewidth=2)
ax.fill_between(x, y1=y2, y2=0, label=columns[0], alpha=0.5, color=mycolors[0], linewidth=2)

# Decorations
ax.set_title('Personal Savings Rate vs Median Duration of Unemployment', fontsize=18)
ax.set(ylim=[0,30])
ax.legend(loc='best', fontsize=12)
plt.xticks(x[::50], fontsize=10, horizontalalignment='center')
plt.yticks(np.arange(2.5,30.0,2.5), fontsize=10)
plt.xlim(-10, x[-1])

# Draw Tick lines
for y in np.arange(2.5,30.0,2.5):
plt.hlines(y, xmin=0, xmax=len(x), colors='black', alpha=0.3, linestyles="--", lw=0.5)

# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(.3)
plt.show()

45. 日歷熱圖

與時(shí)間序列相比,日歷地圖是可視化基于時(shí)間的數(shù)據(jù)的替代方案,也是不太優(yōu)選的選項(xiàng)。盡管視覺上很吸引人,但數(shù)值并不十分明顯。然而,它可以有效地很好地描繪極端值和假期影響。

import matplotlib as mpl
import calmap

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv", parse_dates=['date'])
df.set_index('date', inplace=True)

# Plot
plt.figure(figsize=(16,10), dpi=80)
calmap.calendarplot(df['2014']['VIX.Close'], fig_kws={'figsize':(16,10)}, yearlabel_kws={'color':'black','fontsize':14}, subplot_kws={'title':'Yahoo Stock Prices'})
plt.show()

46. 季節(jié)圖

季節(jié)性圖可用于比較上一季節(jié)(年/月/周等)同一天時(shí)間序列的表現(xiàn)。

from dateutil.parser import parse 

# Import Data
df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')

# Prepare data
df['year']=[parse(d).year for d in df.date]
df['month']=[parse(d).strftime('%b')for d in df.date]
years = df['year'].unique()

# Draw Plot
mycolors =['tab:red','tab:blue','tab:green','tab:orange','tab:brown','tab:grey','tab:pink','tab:olive','deeppink','steelblue','firebrick','mediumseagreen']
plt.figure(figsize=(16,10), dpi=80)

for i, y in enumerate(years):
plt.plot('month','traffic', data=df.loc[df.year==y,:], color=mycolors[i], label=y)
plt.text(df.loc[df.year==y,:].shape[0]-.9, df.loc[df.year==y,'traffic'][-1:].values[0], y, fontsize=12, color=mycolors[i])

# Decoration
plt.ylim(50,750)
plt.xlim(-0.3,11)
plt.ylabel('$Air Traffic$')
plt.yticks(fontsize=12, alpha=.7)
plt.title("Monthly Seasonal Plot: Air Passengers Traffic (1949 - 1969)", fontsize=22)
plt.grid(axis='y', alpha=.3)

# Remove borders
plt.gca().spines["top"].set_alpha(0.0)
plt.gca().spines["bottom"].set_alpha(0.5)
plt.gca().spines["right"].set_alpha(0.0)
plt.gca().spines["left"].set_alpha(0.5)
# plt.legend(loc='upper right', ncol=2, fontsize=12)
plt.show()

七、分組

47.樹狀圖

樹狀圖根據(jù)給定的距離度量將相似的點(diǎn)分組在一起,并根據(jù)點(diǎn)的相似性將它們組織在樹狀鏈接中。

import scipy.cluster.hierarchy as shc

# Import Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv')

# Plot
plt.figure(figsize=(16,10), dpi=80)
plt.title("USArrests Dendograms", fontsize=22)
dend = shc.dendrogram(shc.linkage(df[['Murder','Assault','UrbanPop','Rape']], method='ward'), labels=df.State.values, color_threshold=100)
plt.xticks(fontsize=12)
plt.show()

48. 聚類圖

聚類圖可用于劃分屬于同一聚類的點(diǎn)。下面是一個(gè)代表性示例,根據(jù) USArrests 數(shù)據(jù)集將美國(guó)各州分為 5 組。該聚類圖使用“謀殺”和“襲擊”列作為 X 軸和 Y 軸?;蛘撸梢允褂玫谝粋€(gè)主成分作為 X 軸和 Y 軸。

from sklearn.cluster importAgglomerativeClustering
from scipy.spatial importConvexHull

# Import Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv')

# Agglomerative Clustering
cluster =AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(df[['Murder','Assault','UrbanPop','Rape']])

# Plot
plt.figure(figsize=(14,10), dpi=80)
plt.scatter(df.iloc[:,0], df.iloc[:,1], c=cluster.labels_, cmap='tab10')

# Encircle
def encircle(x,y, ax=None,**kw):
ifnot ax: ax=plt.gca()
p = np.c_[x,y]
hull =ConvexHull(p)
poly = plt.Polygon(p[hull.vertices,:],**kw)
ax.add_patch(poly)

# Draw polygon surrounding vertices
encircle(df.loc[cluster.labels_ ==0,'Murder'], df.loc[cluster.labels_ ==0,'Assault'], ec="k", fc="gold", alpha=0.2, linewidth=0)
encircle(df.loc[cluster.labels_ ==1,'Murder'], df.loc[cluster.labels_ ==1,'Assault'], ec="k", fc="tab:blue", alpha=0.2, linewidth=0)
encircle(df.loc[cluster.labels_ ==2,'Murder'], df.loc[cluster.labels_ ==2,'Assault'], ec="k", fc="tab:red", alpha=0.2, linewidth=0)
encircle(df.loc[cluster.labels_ ==3,'Murder'], df.loc[cluster.labels_ ==3,'Assault'], ec="k", fc="tab:green", alpha=0.2, linewidth=0)
encircle(df.loc[cluster.labels_ ==4,'Murder'], df.loc[cluster.labels_ ==4,'Assault'], ec="k", fc="tab:orange", alpha=0.2, linewidth=0)

# Decorations
plt.xlabel('Murder'); plt.xticks(fontsize=12)
plt.ylabel('Assault'); plt.yticks(fontsize=12)
plt.title('Agglomerative Clustering of USArrests (5 Groups)', fontsize=22)
plt.show()

49.安德魯斯曲線

安德魯斯曲線有助于可視化是否存在基于給定分組的數(shù)字特征的固有分組。如果特征(數(shù)據(jù)集中的列)無助于區(qū)分組 ( ,那么這些行將不會(huì)被很好地隔離,如下所示。cyl)

from pandas.plotting import andrews_curves

# Import
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
df.drop(['cars','carname'], axis=1, inplace=True)

# Plot
plt.figure(figsize=(12,9), dpi=80)
andrews_curves(df,'cyl', colormap='Set1')

# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(.3)

plt.title('Andrews Curves of mtcars', fontsize=22)
plt.xlim(-3,3)
plt.grid(alpha=0.3)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

50.平行坐標(biāo)

平行坐標(biāo)有助于可視化某個(gè)特征是否有助于有效地隔離組。如果實(shí)現(xiàn)了隔離,該特征可能對(duì)于預(yù)測(cè)該群體非常有用。

from pandas.plotting import parallel_coordinates

# Import Data
df_final = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/diamonds_filter.csv")

# Plot
plt.figure(figsize=(12,9), dpi=80)
parallel_coordinates(df_final,'cut', colormap='Dark2')

# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(.3)

plt.title('Parallel Coordinated of Diamonds', fontsize=22)
plt.grid(alpha=0.3)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

致謝:https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/

文章轉(zhuǎn)自微信公眾號(hào)@算法進(jìn)階

上一篇:

輕量高效的API開發(fā):如何使用Falcon快速構(gòu)建RESTful接口

下一篇:

20個(gè)必知的自動(dòng)化機(jī)器學(xué)習(xí)庫(kù)(Python)
#你可能也喜歡這些API文章!

我們有何不同?

API服務(wù)商零注冊(cè)

多API并行試用

數(shù)據(jù)驅(qū)動(dòng)選型,提升決策效率

查看全部API→
??

熱門場(chǎng)景實(shí)測(cè),選對(duì)API

#AI文本生成大模型API

對(duì)比大模型API的內(nèi)容創(chuàng)意新穎性、情感共鳴力、商業(yè)轉(zhuǎn)化潛力

25個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)

#AI深度推理大模型API

對(duì)比大模型API的邏輯推理準(zhǔn)確性、分析深度、可視化建議合理性

10個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)