๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ ํŒŒ์ด์ฌ

[Pandas] ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ๊ธฐ๋ณธ๊ณผ ์ƒ์„ฑ, ์ˆ˜์ •, ์‚ญ์ œ

by ์œ ์Šค :) 2023. 7. 10.
๋ฐ˜์‘ํ˜•
 

๐Ÿ’ก Index

     

     

     

    ์ผ๋‹จ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ธ pandas๋ฅผ import ํ•ด์ค€๋‹ค.

    import pandas as pd

    1. ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

    1) Encoding ์ง€์ •ํ•ด์ฃผ๊ธฐ

    ํ•œ๊ธ€ ํŒŒ์ผ ํฐํŠธ๊ฐ€ ๊นจ์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ธ์ฝ”๋”ฉ์„ ์„ค์ •ํ•ด์ค€๋‹ค.

    ์•„๋ž˜ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์€ Mac ์šด์˜์ฒด์ œ์—์„œ ํ•œ๊ธ€์ด ๊นจ์ง์—†์ด ์ž˜ ์ ์šฉ๋˜์—ˆ๋‹ค.

    encoding = 'utf-8'

    df = pd.read_csv('./data/train.csv', encoding = 'utf-8')

    2) ์ˆซ์ž ๊ตฌ๋ถ„ ๊ธฐํ˜ธ ์—†์ด ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

    ์ˆซ์ž์— ์ฝค๋งˆ(,)๋กœ ์ฒœ ๋‹จ์œ„ ๊ตฌ๋ถ„์ด ๋˜์–ด์žˆ๋Š” ๊ฒฝ์šฐ, ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋ถˆ๋Ÿฌ์™”์„ ๋•Œ Objectํ˜•์œผ๋กœ ์ธ์‹ํ•˜๊ฒŒ ๋œ๋‹ค. ์ด ๊ฒฝ์šฐ ์ฝค๋งˆ ๋•Œ๋ฌธ์— ํ•œ ๋ฒˆ์— ์ˆซ์žํ˜•์œผ๋กœ ๋ณ€ํ™˜๋˜์ง€ ์•Š๋Š”๋‹ค. ๋”ฐ๋ผ์„œ ์ดˆ๋ฐ˜์— ์ฝค๋งˆ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ค๋„๋ก ํ•œ๋‹ค.

    **thousands = ","

    df = pd.read_csv('./data/train.csv',thousands="," )

    3) ์ปฌ๋Ÿผ๋ช…์„ ๋ณ€๊ฒฝํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

    names์— ์ปฌ๋Ÿผ๋ช…์„ ์ˆœ์„œ๋Œ€๋กœ ์ž‘์„ฑํ•˜์—ฌ ์ด๋ฆ„์„ ๋ณ€๊ฒฝํ•œ๋‹ค.
    names = ['col1', 'col2']

    df = pd.read_csv('./data/train.csv',  names= ['ํ’ˆ๋ชฉ', 'ํฌ๊ธฐ', '๊ธˆ์•ก', '์ˆ˜์ˆ˜๋ฃŒ'])

    4) ์ฒซ ๋ฒˆ์งธ ์—ด์„ ์ธ๋ฑ์Šค๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ

    index_col = 1

    df = pd.read_csv('./data/train.csv',  index_col = 1)

    5) ํŠน์ • ์›ํ•˜๋Š” ์ปฌ๋Ÿผ๋งŒ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ

    usecols = ['col1', 'col2']

    df = pd.read_csv('./data/train.csv',  usecols=['ํ’ˆ๋ชฉ', 'ํฌ๊ธฐ'])

    6) ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ csv ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๊ธฐ

    ์ธ์ฝ”๋”ฉ์— ์ฃผ์˜ํ•˜๋„๋ก ํ•œ๋‹ค.

    df.to_csv('.csv', encoding = 'utf-8-sig')

    2. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ

    1) list์™€ array๋ฅผ ์ด์šฉํ•œ ์ƒ์„ฑ

    import numpy as np
    
    col_name1 = ['col1']
    list1 = [1,2,3]
    array1 = np.array(list1)
    print('array1 shape: ', array1.shape)
    
    # ๋ฆฌ์ŠคํŠธ๋ฅผ ์ด์šฉํ•ด DataFrame ์ƒ์„ฑ
    df_list1 = pd.DataFrame(list1, columns=col_name1)
    print('1์ฐจ์› ๋ฆฌ์ŠคํŠธ๋กœ ๋งŒ๋“  DataFrame: \n', df_list1)
    
    # ๋„˜ํŒŒ์ด ndarray๋ฅผ ์ด์šฉํ•ด DataFrame ์ƒ์„ฑ 
    df_array1 = pd.DataFrame(array1, columns=col_name1)
    print('1์ฐจ์› array๋กœ ๋งŒ๋“  DataFrame: \n', df_array1)

    ์œ„ ์ฝ”๋“œ์˜ ์ถœ๋ ฅ๋ฌผ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

    2) Dictionary๋ฅผ ์ด์šฉํ•œ ์ƒ์„ฑ

    dict = {'col1': [1,11], 'col2':[2,22], 'col3': [3,33]} # key๋Š” ๋ฌธ์ž์—ด ์นผ๋Ÿผ๋ช…์œผ๋กœ ๋งคํ•‘, Value๋Š” ๋ฆฌ์ŠคํŠธํ˜• ์นผ๋Ÿผ ๋ฐ์ดํ„ฐ๋กœ ๋งคํ•‘
    df_dict = pd.DataFrame(dict)
    print('๋”•์…”๋„ˆ๋ฆฌ๋กœ ๋งŒ๋“  DataFrame: \n', df_dict)

    3) List๋ฅผ ์ด์šฉํ•œ ์ƒ์„ฑ

    a = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
    df3 = pd.DataFrame(a)

    4) Nan ๊ฒฐ์ธก๊ฐ’ ๋„ฃ์–ด์„œ ์ƒ์„ฑ

    import numpy as np
    a =  {'company' : ['abc', 'ํšŒ์‚ฌ', 123], '์ง์›์ˆ˜' : [400, 10, 6], '์œ„์น˜' : ['Seoul', np.NaN, 'Busan']}
    df4 = pd.DataFrame(a)

    ์œ„ ์ฝ”๋“œ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฌผ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

    3. ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋ณต์‚ฌํ•˜๊ธฐ

    1) Shallow Copy

    ๋‹จ์ˆœํžˆ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ๊ฐ’์„ ๋ณต์ œํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ˆ˜์ •ํ•  ๊ฒฝ์šฐ ๋ณต์ œํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์—ญ์‹œ ๋™์ผํ•˜๊ฒŒ ๊ฐ’์ด ๋ณ€๊ฒฝ๋œ๋‹ค.

    df = pd.DataFrame({'a': [1, 2, 3], 'b' : [4, 5, 6], 'c' : [7, 8, 9]})
    df2 = df # df dataframe์„ ๋ณต์ œ

    ์œ„์™€ ๊ฐ™์€ ๋ฐฉ์‹์—์„œ ๋™์ผํ•˜๊ฒŒ ๋ณ€๊ฒฝ์ด ์ ์šฉ๋˜๋Š” ์ด์œ ๋Š” ์™ผ์ชฝ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋ฅผ ๋ณ€์ˆ˜ ์ด๋ฆ„ ๋‘๊ฐœ๋กœ ๊ณต์œ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

    400

    ๋”ฐ๋ผ์„œ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ deep copy ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.

    2) Deep Copy

    df = pd.DataFrame({'a': [1, 2, 3], 'b' : [4, 5, 6], 'c' : [7, 8, 9]})
    
    # ๋ฐฉ๋ฒ• 1
    import copy
    df2 = copy.deepcopy(df)
    
    # ๋ฐฉ๋ฒ• 2
    df2 = df.__deepcopy__()

    4. Column, Row ์˜ ์ถ”๊ฐ€์™€ ์ˆ˜์ •

    1) Column ; ์ปฌ๋Ÿผ ์ถ”๊ฐ€ํ•˜๊ธฐ

    (1) ๋™์ผํ•œ ๊ฐ’์„ ๋„ฃ์–ด ์ปฌ๋Ÿผ ์ƒ์„ฑํ•˜๊ธฐ

    ๊ฐ„๋‹จํ•˜๊ฒŒ ์ปฌ๋Ÿผ๋ช…์„ ์ง€์ •ํ•˜๊ณ  ๊ทธ ์•ˆ์— ๋™์ผํ•œ ๊ฐ’์œผ๋กœ ์ฑ„์›Œ๋„ฃ์„ ์ˆ˜ ์žˆ๋‹ค .

    df['col_1'] = 0  #๋ชจ๋“  ๋ฐ์ดํ„ฐ ๊ฐ’์ด 0์œผ๋กœ ํ• ๋‹น๋œ ์‹œ๋ฆฌ์ฆˆ๊ฐ€ ๊ธฐ์กด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ์ถ”๊ฐ€

    (2) ๊ธฐ์กด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ์ปฌ๋Ÿผ์„ ํ™œ์šฉํ•˜์—ฌ ์ƒ์„ฑํ•˜๊ธฐ

    ๊ธฐ์กด ์ปฌ๋Ÿผ์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฐ’์„ ๋ณ€ํ˜•ํ•˜์—ฌ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

    titanic_df['Age_by_10'] = titanic_df['Age'] * 10
    titanic_df['Family_no'] = titanic_df['SibSp'] + titanic_df['Parch'] + 1

    (3) ํŠน์ • ์œ„์น˜์— ์ปฌ๋Ÿผ ์ƒ์„ฑํ•˜๊ธฐ

    insert(์‚ฝ์ž…ํ•  ์œ„์น˜, ์ปฌ๋Ÿผ ์ด๋ฆ„, ์ปฌ๋Ÿผ ์•ˆ์— ๋“ค์–ด๊ฐˆ ๋ฐ์ดํ„ฐ)

    df.insert(2, 'col_name', 'No.') 

    2) Row ์ถ”๊ฐ€ํ•˜๊ธฐ

    inplace = True๋ฅผ ์ง€์ •ํ•ด์•ผ ๊ธฐ์กด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์ €์žฅ๋จ์— ์ฃผ์˜ํ•œ๋‹ค.

    # ๋ฐฉ๋ฒ• 1
    df.append({'a' : 7, 'b' : 8, 'c' : 9}, ignore_index = True)
    
    # ๋ฐฉ๋ฒ• 2
    df.loc[6] = [7, 8, 9]

    ์œ„ ์ฝ”๋“œ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

    3) Column ; ์ปฌ๋Ÿผ ์ด๋ฆ„ ๋ณ€๊ฒฝํ•˜๊ธฐ

    (1) ์ปฌ๋Ÿผ ์ „์ฒด์˜ ์ด๋ฆ„ ๋ณ€๊ฒฝํ•˜๊ธฐ

    ์ปฌ๋Ÿผ์˜ ์ˆ˜์— ๋งž์ถ”์–ด ์ „์ฒด์˜ ์ด๋ฆ„์„ ์ง€์ •ํ•ด์ฃผ์–ดํ– ํ•œ๋‹ค.

    df..columns = ['A','B','C','D','E','F',"G",'H',"I",'J','K','L']

    (2) ํŠน์ • ๋‹จ์ผ ์ปฌ๋Ÿผ ์ด๋ฆ„ ๋ณ€๊ฒฝํ•˜๊ธฐ

    rename(columns={'๊ธฐ์กด ์ปฌ๋Ÿผ ์ด๋ฆ„' : '๋ณ€๊ฒฝํ•  ์ปฌ๋Ÿผ ์ด๋ฆ„})

    df.rename(columns={'Name':'name'}, inplace=True)

    5. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์‚ญ์ œ

    Drop( )

    axis = 0 ์€ row ๋ฐฉํ–ฅ, axis = 1์€ col ๋ฐฉํ–ฅ์„ ์˜๋ฏธํ•œ๋‹ค.
    inplace = True๋ฅผ ํ•ด์ฃผ์–ด์•ผ ์›๋ณธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณ€๊ฒฝ๋œ๋‹ค.

    1) ์ปฌ๋Ÿผ ์‚ญ์ œํ•˜๊ธฐ

    (1) column ์ด๋ฆ„์„ ํ†ตํ•ด ์‚ญ์ œํ•˜๊ธฐ

    df.drop('col_name', axis=1, inplace=True)

    (2) index ๋ฒ”์œ„๋ฅผ ํ†ตํ•ด ์‚ญ์ œํ•˜๊ธฐ

    idx = [x for x in range(3,5)]
    titanic_df.drop(titanic_df.columns[idx], axis=1)

    2) Row ์‚ญ์ œํ•˜๊ธฐ

    (1) index ํ†ตํ•ด ์‚ญ์ œํ•˜๊ธฐ

    ์ธ๋ฑ์Šค ๋ฒˆํ˜ธ๋ฅผ ํ†ตํ•ด ํ–‰์„ ์‚ญ์ œํ•œ๋‹ค.

    df.drop([0,1,2], axis=0)

    (2) ์กฐ๊ฑด์— ๋งž๋Š” index ํ†ตํ•ด ์‚ญ์ œํ•˜๊ธฐ

    df.drop(df[(df['a'] < 3) & (df['c'] == 4)].index)

    6. Index์˜ ์„ค์ •

    1) ์ธ๋ฑ์Šค ๊ฐ์ฒด ์ถ”์ถœํ•˜๊ธฐ

    # ์ธ๋ฑ์Šค ๊ฐ์ฒด ์ถ”์ถœ
    indexes = titanic_df.index
    print(indexes)
    # Index ๊ฐ์ฒด๋ฅผ ์‹ค์ œ ๊ฐ’ array๋กœ ๋ณ€ํ™˜
    print('Index ๊ฐ์ฒด array๊ฐ’: \n', indexes.values) # index ๋ฐ์ดํ„ฐ๋Š” ์‹๋ณ„์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ 1์ฐจ์› array๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 
    
    # ndarray์™€ ์œ ์‚ฌํ•˜๊ฒŒ ๋‹จ์ผ ๊ฐ’ ๋ฐ˜ํ™˜ ๋ฐ ์Šฌ๋ผ์ด์‹ฑ ๊ฐ€๋Šฅ
    print(indexes.shape)
    print(indexes[:5].values)
    print(indexes.values[:5])

    2) reset index

    ์ƒˆ๋กญ๊ฒŒ ์ธ๋ฑ์Šค๋ฅผ ์—ฐ์† ์ˆซ์žํ˜•์œผ๋กœ ํ• ๋‹นํ•˜๊ณ  ๊ธฐ์กด ์ธ๋ฑ์Šค๋Š” 'index'๋ผ๋Š” ์ƒˆ๋กœ์šด ์นผ๋Ÿผ ๋ช…์œผ๋กœ ์ถ”๊ฐ€๋œ๋‹ค.

    df.reset_index(inplace=False) 

    3) set index

    ๊ธฐ์กด ์ปฌ๋Ÿผ์„ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ

    df.set_index(['col1', 'col2'], inplace = False)
    ๋ฐ˜์‘ํ˜•