資料清洗原理、資料探勘、資料探勘、預定義的清理規則、pandas原理

資料清洗原理

資料清洗原理：利用有關技術如數理統計、資料探勘或預定義的清理規則將髒資料轉化為滿足資料質量要求的資料。

Pandas

名稱來自於Panel data和Python資料分析data analysis，Python的一個數據分析包，

最初由AQR Capital Management於2008年4月開發，被作為金融資料分析工具，為時間序列分析提供了很好的支援，並於2009年底開源出來

pandas是用於資料清洗的庫，是基於NumPy 的一種工具，該工具是為了解決資料分析任務而建立的。Pandas 納入了大量庫和一些標準的資料模型，提供了高效地操作大型資料集所需的工具。pandas提供了大量能使我們快速便捷地處理資料的函式和方法。你很快就會發現，它是使Python成為強大而高效的資料分析環境的重要因素之一。

一、安裝pandas

win+R開啟cmd命令列輸入pip install pandas即可，因為我已經安裝了，這裡就不展示圖片了

二、匯入包

import pandas as pd

讀取資料

使用 pd 的 read_sql 讀取資料

import pymysql

import pandas as pd

self。conn = pymysql。connect（host=host， user=user，

password=pass， db=db， charset=‘utf8’）

sql = ‘select * from table_name’

df = pd。read_sql（sql， con=self。conn）

空值空格處理

處理空值以及空格使用 pd 的 strip 方法以及 dropna 方法

df［‘product_name’］。str。strip（）

# 刪除列 `product_name` 為 `NaN` 的行

df。dropna（subset=［‘product_name’］， inplace=True）

異常值處理

處理異常值使用 pd 的 replace 方法

df。replace（‘ ’， np。nan， inplace=True）

資料重新寫入到 MySQL

資料重新寫入 MySQL 使用 pd 的 to_sql 方法

df。to_sql（name=table_name， con=self。conn， if_exists=‘append’， index=True）

pandas 設定

#顯示所有列

pd。set_option（‘display。max_columns’， None）

#顯示所有行

pd。set_option（‘display。max_rows’， None）

#設定 value 的顯示長度為 100，預設為 50

pd。set_option（‘max_colwidth’，100）

問題

1、pd 的 to_sql 不能使用 pymysql 的連線，否則就會直接報錯

pandas。io。sql。DatabaseError： Execution failed on sql ‘SELECT name FROM sqlite_master WHERE type=’table‘ AND name=？；’： not all arguments converted during string formatting

需要改為

from sqlalchemy import create_engine

engine = create_engine（“mysql+pymysql：//user：pass@host：port/db”）

2、空值處理的問題

儲存在 mysql 中的資料中有空值，但是使用 pd。str。strip（）處理沒有用

使用 replace 替換空格、空值為 nan 也沒有用解決辦法：replace 使用正則替換

# 替換\r\n\t 以及 html 中的\xa0

df。replace（r‘\r|\t|\n|\xa0’， ‘’， regex=True， inplace=True）

# 替換空格，將空格替換為空字串

df［‘product_name’］。replace（r‘ ’， ‘’， regex=True， inplace=True）

# 將空字串替換為 nan

df［‘product_name’］。replace（r‘’， np。nan， regex=True， inplace=True）

# 將亂碼替換替換為空字串（正則為匹配不是中文、字母、數字組成的字串）

df［‘product_name’］。replace（r‘［^\u4e00-\u9fa5_a-zA-Z0-9］’， np。nan， regex=True， inplace=True）

需要下面的資料的，還有更多課件資源的，請私信我“666”領取！！！

私信我還有更多驚喜哦！！！

友快網

資料清洗原理、資料探勘、資料探勘、預定義的清理規則、pandas原理

相關推薦

最近发表