๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ ํŒŒ์ด์ฌ

[Pandas] ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ๊ธฐ๋ณธ๊ณผ ์ƒ์„ฑ, ์ˆ˜์ •, ์‚ญ์ œ

by ์œ ์Šค :) 2023. 7. 10.
๋ฐ˜์‘ํ˜•
 

๐Ÿ’ก Index

 

 

 

์ผ๋‹จ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ธ pandas๋ฅผ import ํ•ด์ค€๋‹ค.

import pandas as pd

1. ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

1) Encoding ์ง€์ •ํ•ด์ฃผ๊ธฐ

ํ•œ๊ธ€ ํŒŒ์ผ ํฐํŠธ๊ฐ€ ๊นจ์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ธ์ฝ”๋”ฉ์„ ์„ค์ •ํ•ด์ค€๋‹ค.

์•„๋ž˜ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์€ Mac ์šด์˜์ฒด์ œ์—์„œ ํ•œ๊ธ€์ด ๊นจ์ง์—†์ด ์ž˜ ์ ์šฉ๋˜์—ˆ๋‹ค.

encoding = 'utf-8'

df = pd.read_csv('./data/train.csv', encoding = 'utf-8')

2) ์ˆซ์ž ๊ตฌ๋ถ„ ๊ธฐํ˜ธ ์—†์ด ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

์ˆซ์ž์— ์ฝค๋งˆ(,)๋กœ ์ฒœ ๋‹จ์œ„ ๊ตฌ๋ถ„์ด ๋˜์–ด์žˆ๋Š” ๊ฒฝ์šฐ, ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋ถˆ๋Ÿฌ์™”์„ ๋•Œ Objectํ˜•์œผ๋กœ ์ธ์‹ํ•˜๊ฒŒ ๋œ๋‹ค. ์ด ๊ฒฝ์šฐ ์ฝค๋งˆ ๋•Œ๋ฌธ์— ํ•œ ๋ฒˆ์— ์ˆซ์žํ˜•์œผ๋กœ ๋ณ€ํ™˜๋˜์ง€ ์•Š๋Š”๋‹ค. ๋”ฐ๋ผ์„œ ์ดˆ๋ฐ˜์— ์ฝค๋งˆ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ค๋„๋ก ํ•œ๋‹ค.

**thousands = ","

df = pd.read_csv('./data/train.csv',thousands="," )

3) ์ปฌ๋Ÿผ๋ช…์„ ๋ณ€๊ฒฝํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

names์— ์ปฌ๋Ÿผ๋ช…์„ ์ˆœ์„œ๋Œ€๋กœ ์ž‘์„ฑํ•˜์—ฌ ์ด๋ฆ„์„ ๋ณ€๊ฒฝํ•œ๋‹ค.
names = ['col1', 'col2']

df = pd.read_csv('./data/train.csv',  names= ['ํ’ˆ๋ชฉ', 'ํฌ๊ธฐ', '๊ธˆ์•ก', '์ˆ˜์ˆ˜๋ฃŒ'])

4) ์ฒซ ๋ฒˆ์งธ ์—ด์„ ์ธ๋ฑ์Šค๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ

index_col = 1

df = pd.read_csv('./data/train.csv',  index_col = 1)

5) ํŠน์ • ์›ํ•˜๋Š” ์ปฌ๋Ÿผ๋งŒ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ

usecols = ['col1', 'col2']

df = pd.read_csv('./data/train.csv',  usecols=['ํ’ˆ๋ชฉ', 'ํฌ๊ธฐ'])

6) ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ csv ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๊ธฐ

์ธ์ฝ”๋”ฉ์— ์ฃผ์˜ํ•˜๋„๋ก ํ•œ๋‹ค.

df.to_csv('.csv', encoding = 'utf-8-sig')

2. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ

1) list์™€ array๋ฅผ ์ด์šฉํ•œ ์ƒ์„ฑ

import numpy as np

col_name1 = ['col1']
list1 = [1,2,3]
array1 = np.array(list1)
print('array1 shape: ', array1.shape)

# ๋ฆฌ์ŠคํŠธ๋ฅผ ์ด์šฉํ•ด DataFrame ์ƒ์„ฑ
df_list1 = pd.DataFrame(list1, columns=col_name1)
print('1์ฐจ์› ๋ฆฌ์ŠคํŠธ๋กœ ๋งŒ๋“  DataFrame: \n', df_list1)

# ๋„˜ํŒŒ์ด ndarray๋ฅผ ์ด์šฉํ•ด DataFrame ์ƒ์„ฑ 
df_array1 = pd.DataFrame(array1, columns=col_name1)
print('1์ฐจ์› array๋กœ ๋งŒ๋“  DataFrame: \n', df_array1)

์œ„ ์ฝ”๋“œ์˜ ์ถœ๋ ฅ๋ฌผ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

2) Dictionary๋ฅผ ์ด์šฉํ•œ ์ƒ์„ฑ

dict = {'col1': [1,11], 'col2':[2,22], 'col3': [3,33]} # key๋Š” ๋ฌธ์ž์—ด ์นผ๋Ÿผ๋ช…์œผ๋กœ ๋งคํ•‘, Value๋Š” ๋ฆฌ์ŠคํŠธํ˜• ์นผ๋Ÿผ ๋ฐ์ดํ„ฐ๋กœ ๋งคํ•‘
df_dict = pd.DataFrame(dict)
print('๋”•์…”๋„ˆ๋ฆฌ๋กœ ๋งŒ๋“  DataFrame: \n', df_dict)

3) List๋ฅผ ์ด์šฉํ•œ ์ƒ์„ฑ

a = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
df3 = pd.DataFrame(a)

4) Nan ๊ฒฐ์ธก๊ฐ’ ๋„ฃ์–ด์„œ ์ƒ์„ฑ

import numpy as np
a =  {'company' : ['abc', 'ํšŒ์‚ฌ', 123], '์ง์›์ˆ˜' : [400, 10, 6], '์œ„์น˜' : ['Seoul', np.NaN, 'Busan']}
df4 = pd.DataFrame(a)

์œ„ ์ฝ”๋“œ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฌผ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

3. ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋ณต์‚ฌํ•˜๊ธฐ

1) Shallow Copy

๋‹จ์ˆœํžˆ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ๊ฐ’์„ ๋ณต์ œํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ˆ˜์ •ํ•  ๊ฒฝ์šฐ ๋ณต์ œํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์—ญ์‹œ ๋™์ผํ•˜๊ฒŒ ๊ฐ’์ด ๋ณ€๊ฒฝ๋œ๋‹ค.

df = pd.DataFrame({'a': [1, 2, 3], 'b' : [4, 5, 6], 'c' : [7, 8, 9]})
df2 = df # df dataframe์„ ๋ณต์ œ

์œ„์™€ ๊ฐ™์€ ๋ฐฉ์‹์—์„œ ๋™์ผํ•˜๊ฒŒ ๋ณ€๊ฒฝ์ด ์ ์šฉ๋˜๋Š” ์ด์œ ๋Š” ์™ผ์ชฝ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋ฅผ ๋ณ€์ˆ˜ ์ด๋ฆ„ ๋‘๊ฐœ๋กœ ๊ณต์œ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

400

๋”ฐ๋ผ์„œ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ deep copy ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.

2) Deep Copy

df = pd.DataFrame({'a': [1, 2, 3], 'b' : [4, 5, 6], 'c' : [7, 8, 9]})

# ๋ฐฉ๋ฒ• 1
import copy
df2 = copy.deepcopy(df)

# ๋ฐฉ๋ฒ• 2
df2 = df.__deepcopy__()

4. Column, Row ์˜ ์ถ”๊ฐ€์™€ ์ˆ˜์ •

1) Column ; ์ปฌ๋Ÿผ ์ถ”๊ฐ€ํ•˜๊ธฐ

(1) ๋™์ผํ•œ ๊ฐ’์„ ๋„ฃ์–ด ์ปฌ๋Ÿผ ์ƒ์„ฑํ•˜๊ธฐ

๊ฐ„๋‹จํ•˜๊ฒŒ ์ปฌ๋Ÿผ๋ช…์„ ์ง€์ •ํ•˜๊ณ  ๊ทธ ์•ˆ์— ๋™์ผํ•œ ๊ฐ’์œผ๋กœ ์ฑ„์›Œ๋„ฃ์„ ์ˆ˜ ์žˆ๋‹ค .

df['col_1'] = 0  #๋ชจ๋“  ๋ฐ์ดํ„ฐ ๊ฐ’์ด 0์œผ๋กœ ํ• ๋‹น๋œ ์‹œ๋ฆฌ์ฆˆ๊ฐ€ ๊ธฐ์กด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ์ถ”๊ฐ€

(2) ๊ธฐ์กด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ์ปฌ๋Ÿผ์„ ํ™œ์šฉํ•˜์—ฌ ์ƒ์„ฑํ•˜๊ธฐ

๊ธฐ์กด ์ปฌ๋Ÿผ์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฐ’์„ ๋ณ€ํ˜•ํ•˜์—ฌ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

titanic_df['Age_by_10'] = titanic_df['Age'] * 10
titanic_df['Family_no'] = titanic_df['SibSp'] + titanic_df['Parch'] + 1

(3) ํŠน์ • ์œ„์น˜์— ์ปฌ๋Ÿผ ์ƒ์„ฑํ•˜๊ธฐ

insert(์‚ฝ์ž…ํ•  ์œ„์น˜, ์ปฌ๋Ÿผ ์ด๋ฆ„, ์ปฌ๋Ÿผ ์•ˆ์— ๋“ค์–ด๊ฐˆ ๋ฐ์ดํ„ฐ)

df.insert(2, 'col_name', 'No.') 

2) Row ์ถ”๊ฐ€ํ•˜๊ธฐ

inplace = True๋ฅผ ์ง€์ •ํ•ด์•ผ ๊ธฐ์กด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์ €์žฅ๋จ์— ์ฃผ์˜ํ•œ๋‹ค.

# ๋ฐฉ๋ฒ• 1
df.append({'a' : 7, 'b' : 8, 'c' : 9}, ignore_index = True)

# ๋ฐฉ๋ฒ• 2
df.loc[6] = [7, 8, 9]

์œ„ ์ฝ”๋“œ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

3) Column ; ์ปฌ๋Ÿผ ์ด๋ฆ„ ๋ณ€๊ฒฝํ•˜๊ธฐ

(1) ์ปฌ๋Ÿผ ์ „์ฒด์˜ ์ด๋ฆ„ ๋ณ€๊ฒฝํ•˜๊ธฐ

์ปฌ๋Ÿผ์˜ ์ˆ˜์— ๋งž์ถ”์–ด ์ „์ฒด์˜ ์ด๋ฆ„์„ ์ง€์ •ํ•ด์ฃผ์–ดํ– ํ•œ๋‹ค.

df..columns = ['A','B','C','D','E','F',"G",'H',"I",'J','K','L']

(2) ํŠน์ • ๋‹จ์ผ ์ปฌ๋Ÿผ ์ด๋ฆ„ ๋ณ€๊ฒฝํ•˜๊ธฐ

rename(columns={'๊ธฐ์กด ์ปฌ๋Ÿผ ์ด๋ฆ„' : '๋ณ€๊ฒฝํ•  ์ปฌ๋Ÿผ ์ด๋ฆ„})

df.rename(columns={'Name':'name'}, inplace=True)

5. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์‚ญ์ œ

Drop( )

axis = 0 ์€ row ๋ฐฉํ–ฅ, axis = 1์€ col ๋ฐฉํ–ฅ์„ ์˜๋ฏธํ•œ๋‹ค.
inplace = True๋ฅผ ํ•ด์ฃผ์–ด์•ผ ์›๋ณธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณ€๊ฒฝ๋œ๋‹ค.

1) ์ปฌ๋Ÿผ ์‚ญ์ œํ•˜๊ธฐ

(1) column ์ด๋ฆ„์„ ํ†ตํ•ด ์‚ญ์ œํ•˜๊ธฐ

df.drop('col_name', axis=1, inplace=True)

(2) index ๋ฒ”์œ„๋ฅผ ํ†ตํ•ด ์‚ญ์ œํ•˜๊ธฐ

idx = [x for x in range(3,5)]
titanic_df.drop(titanic_df.columns[idx], axis=1)

2) Row ์‚ญ์ œํ•˜๊ธฐ

(1) index ํ†ตํ•ด ์‚ญ์ œํ•˜๊ธฐ

์ธ๋ฑ์Šค ๋ฒˆํ˜ธ๋ฅผ ํ†ตํ•ด ํ–‰์„ ์‚ญ์ œํ•œ๋‹ค.

df.drop([0,1,2], axis=0)

(2) ์กฐ๊ฑด์— ๋งž๋Š” index ํ†ตํ•ด ์‚ญ์ œํ•˜๊ธฐ

df.drop(df[(df['a'] < 3) & (df['c'] == 4)].index)

6. Index์˜ ์„ค์ •

1) ์ธ๋ฑ์Šค ๊ฐ์ฒด ์ถ”์ถœํ•˜๊ธฐ

# ์ธ๋ฑ์Šค ๊ฐ์ฒด ์ถ”์ถœ
indexes = titanic_df.index
print(indexes)
# Index ๊ฐ์ฒด๋ฅผ ์‹ค์ œ ๊ฐ’ array๋กœ ๋ณ€ํ™˜
print('Index ๊ฐ์ฒด array๊ฐ’: \n', indexes.values) # index ๋ฐ์ดํ„ฐ๋Š” ์‹๋ณ„์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ 1์ฐจ์› array๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 

# ndarray์™€ ์œ ์‚ฌํ•˜๊ฒŒ ๋‹จ์ผ ๊ฐ’ ๋ฐ˜ํ™˜ ๋ฐ ์Šฌ๋ผ์ด์‹ฑ ๊ฐ€๋Šฅ
print(indexes.shape)
print(indexes[:5].values)
print(indexes.values[:5])

2) reset index

์ƒˆ๋กญ๊ฒŒ ์ธ๋ฑ์Šค๋ฅผ ์—ฐ์† ์ˆซ์žํ˜•์œผ๋กœ ํ• ๋‹นํ•˜๊ณ  ๊ธฐ์กด ์ธ๋ฑ์Šค๋Š” 'index'๋ผ๋Š” ์ƒˆ๋กœ์šด ์นผ๋Ÿผ ๋ช…์œผ๋กœ ์ถ”๊ฐ€๋œ๋‹ค.

df.reset_index(inplace=False) 

3) set index

๊ธฐ์กด ์ปฌ๋Ÿผ์„ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ

df.set_index(['col1', 'col2'], inplace = False)
๋ฐ˜์‘ํ˜•