printf("ho_tari\n");

HW4 본문

Daily COVID-19 confirmed cases prediction

 

Let’s predict the daily number of confirmed cases of COVID-19 in Spain through recurrent neural network (RNN). Predicting the daily number of confirmed cases of COVID-19 is a regression problem.

Dataset

• Daily incidence of the 52 Spanish provinces for 450 days from January 1st , 2020 to March 27th, 2021

 

Q1

Let’s load the dataset using “pickle” python module.

Pickle module implements binary protocols for serializing and deserializing a Python object structure.

Simply, the pickle module is a module that can save and load objects of the Python, such as lists, dictionaries, or tuple etc.

Usage: Save the object

import pickle

with open('filename', 'wb') as f:

           pickle.dump(data, f)

Load the object

with open(‘filename', 'rb') as f:

           data = pickle.load(f)

 

Download the COVID-19 data named “practice_data.txt” and load it with the pickle module.

Attach the code used to load the data and print the shape of the data. What does the data shape mean?

 

<Q1.py>

import os
import numpy as np
import pickle
from sklearn.model_selection import TimeSeriesSplit
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
import matplotlib.pyplot as plt


with open('practice_data.txt', 'rb') as f:
    data = pickle.load(f)

print(data.shape)

train_data = data[:350]
test_data = data[350:]

Results of the code

사용하는 스페인의 코로나 확진자 데이터는 450일 동안 발생한 52개 주의 확진자 수를 담고 있다. 따라서 450개의 rows가 있고 row당 각 도시의 확진자 수를 52개의 column에 담고 있다.

 

Q2

The input shape of the RNN is as follows -> (batch_size, time_steps, feature)

Let’s preprocess the loaded data that would have 10 time steps. First, split the data input train and test data. Use the last 100 days of data as test data, and the rest as training data. Then, normalize the train and test data using the maximum value of the train data.

 Attach the code and write the shape of input and target data.

 

<Q2.py>

import os
import numpy as np
import pickle
from sklearn.model_selection import TimeSeriesSplit
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
import matplotlib.pyplot as plt


with open('practice_data.txt', 'rb') as f:
    data = pickle.load(f)

print(data.shape)

train_data = data[:350]
test_data = data[350:]

def normalize(train_data, test_data):
    import copy
    train_data_norm = copy.deepcopy(train_data)
    test_data_norm = copy.deepcopy(test_data)
    max = train_data_norm.max(axis=0)
    train_data_norm /= max
    test_data_norm /= max
    print(train_data_norm)
    print(test_data_norm)
    return train_data_norm, test_data_norm

train_data_norm, test_data_norm = normalize(train_data, test_data)

def preprocess_data(data, n=10):
    x = []
    y = []
    for i in range(data.shape[0]-n):
        last = i + n
        x.append(data[i:last])
        y.append(data[last])
    print('input_shape:', np.array(x).shape)
    print('target_shape:', np.array(y).shape)
    return np.array(x), np.array(y)

x_train, y_train = preprocess_data(train_data_norm)
x_test, y_test = preprocess_data(test_data_norm)

print(x_train)
print(x_test)

Results of the code

Shape (batch_size, time_steps, feature)을 의미한다. Train Test Input Target data 10 time steps를 가지며 350개의 train data 100개의 test data split한 후 time step 10으로 preprocessing 한 결과 batch_size 340 90으로 나타나고 52개 주의 데이터를 가지므로 feature 52이다.

 

Q3

Let’s build a RNN model and train it with preprocessed data.

Build the model with a simple RNN with the following hyper-parameters

Units = 52

Activation = Sigmoid

Optimizer = Adam with 0.001 learning rate

Loss function = mean squared error

Batch size = 20

Epoch = 1000

 

Before training the model, let’s find out optimal epochs with time series cross validation (CV).

Unlike typical CV (left figure), time series CV (right figure) is applied differently.

This type of CV is prepared in sklearn module, called “TimeSeriesSplit”

The usage of “TimeSeriesSplit” is the same as typical CV.

 

We need to follow the procedure below with the optimal epochs:

1. Build and train RNN model several times with CV sets and enough epochs (1000 epochs)

2. Average the “validation” losses over all the folds (np.mean)

3. Find “the optimal epoch” that shows the minimum validation loss (np.argmax)

4. after finding the optimal epoch, train the model with the “optimal” epochs and whole training dataset.

 

Attach the code that builds a model, finds the optimal epochs (CV with 4 folds), and performs the final training with the identified optimal epochs.

Attach the train and validation losses of each fold, and the final training losses

 

<Q3.py>

import os
import numpy as np
import pickle
from sklearn.model_selection import TimeSeriesSplit
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
import matplotlib.pyplot as plt


with open('practice_data.txt', 'rb') as f:
    data = pickle.load(f)

print(data.shape)

train_data = data[:350]
test_data = data[350:]

def normalize(train_data, test_data):
    import copy
    train_data_norm = copy.deepcopy(train_data)
    test_data_norm = copy.deepcopy(test_data)
    max = train_data_norm.max(axis=0)
    train_data_norm /= max
    test_data_norm /= max
    print(train_data_norm)
    print(test_data_norm)
    return train_data_norm, test_data_norm

train_data_norm, test_data_norm = normalize(train_data, test_data)

def preprocess_data(data, n=10):
    x = []
    y = []
    for i in range(data.shape[0]-n):
        last = i + n
        x.append(data[i:last])
        y.append(data[last])
    print('input_shape:', np.array(x).shape)
    print('target_shape:', np.array(y).shape)
    return np.array(x), np.array(y)

x_train, y_train = preprocess_data(train_data_norm)
x_test, y_test = preprocess_data(test_data_norm)

print(x_train)
print(x_test)

def build_model():
    model = Sequential()
    model.add(SimpleRNN(52, activation='sigmoid'))
    model.add(Dense(1))
    model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['mse'])
    return model

def plot_loss(h, title='loss'):
    plt.plot(h.history['loss'])
    plt.plot(h.history['val_loss'])
    plt.title(title)
    plt.ylabel('loss')
    plt.xlabel('Epoch')
    plt.legend(['Training', 'Validation'])

tscv = TimeSeriesSplit(n_splits=4)
overall_loss = []

for epochs in range(1, 1000+1, 25):
    validation_loss = []
    for train_index, val_index in tscv.split(x_train):
        model = build_model()
        h = model.fit(x_train[train_index], y_train[train_index],
                      validation_data=(x_train[val_index], y_train[val_index]),
                      epochs=epochs, batch_size=20, verbose=0)
        validation_loss.append(h.history['val_loss'][-1])
        print('loss', h.history['loss'][-1], '\nval_loss:', h.history['val_loss'][-1])
        plot_loss(h)
        plt.show()
        plt.savefig('loss.png')
        plt.clf()

    validation_loss = np.array(validation_loss)
    val_loss_mean = np.mean(validation_loss)
    print(epochs, 'mean:', val_loss_mean, '\n')

Loss of four folds: first fold

Loss of four folds: second fold

Loss of four folds: third fold

Loss of four folds: fourth fold

Loss of all training data set

Optimal Epochs: 900

Q4

Combining the train and test data, draw the “overall” daily confirmed cases of the target (shown in black) and the predicted results (shown in red). See the example below. 

- Overall cases means the summation of all the 52 provinces in Spain, not the cases of each provinces)

 

<Q4.py>

import os
import numpy as np
import pickle
from sklearn.model_selection import TimeSeriesSplit
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
import matplotlib.pyplot as plt


with open('practice_data.txt', 'rb') as f:
    data = pickle.load(f)

overall = data.sum(axis=1)

max = np.max(data)
data_norm = data/max

def preprocess_data(data, n=10):
    x = []
    y = []
    for i in range(data.shape[0]-n):
        last = i + n
        x.append(data[i:last])
        y.append(data[last])
    print('input_shape:', np.array(x).shape)
    print('target_shape:', np.array(y).shape)
    return np.array(x), np.array(y)

x_data, y_data = preprocess_data(data_norm)

def build_model():
    model = Sequential()
    model.add(SimpleRNN(52, activation='sigmoid'))
    model.add(Dense(1))
    model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['mse'])
    return model

def plot_loss(h, title='loss'):
    plt.plot(h.history['loss'])
    plt.plot(h.history['val_loss'])
    plt.title(title)
    plt.ylabel('loss')
    plt.xlabel('Epoch')
    plt.legend(['Training', 'Validation'])

model = build_model()
h = model.fit(x_data, y_data, epochs=800, batch_size=20, verbose=0)
predict = model.predict(x_data)
predict = predict * max * 52

predict_list = []
for i in range(x_data.shape[0]):
    if i<10:
        predict_list.append(0)
    else:
        predict_list.append(predict[i-10,])

predict = np.array(predict_list)
length = range(0, len(predict))

def plot_graph():
    plt.plot(length, overall[length,])
    plt.plot(length, predict[length,])
    plt.title('COVID 19 of Overall Spain')
    plt.xlabel('Days')
    plt.ylabel('Confirmed Cases')
    plt.legend(['Overall', 'Predict'])

plot_graph()
plt.savefig("Q4.png")

Overall cases (results of codes)

QE1

Is there any way to improve the performance?

 

Optimal epoch을 찾았지만 epoch 이외에도 모델의 성능에 영향을 미치는 hyper-parameter들이 있다. Learning rate, layer의 수, hidden neuron의 개수 등을 변화를 주면서 optical hyper-parameter을 찾은 후 학습시킨다면 조금 더 좋은 성능을 가질 수 있을 것이다. 또한 수업시간에 배운 short term뿐 아니라 long term memory를 가질 수 있는 LSTM을 사용한다면 gradient problem을 개선하여 더 좋은 성능을 낼 수 있을 것이라고 판단된다.

 

QE2

Repeat Q3-Q4 for GRU and LSTM.

Attach the code and result figures.

Discuss which model performs the best

 

<QE2_1.py>

import os
import numpy as np
import pickle
from sklearn.model_selection import TimeSeriesSplit
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, GRU, Dense
import matplotlib.pyplot as plt


with open('practice_data.txt', 'rb') as f:
    data = pickle.load(f)

print(data.shape)

train_data = data[:350]
test_data = data[350:]

def normalize(train_data, test_data):
    import copy
    train_data_norm = copy.deepcopy(train_data)
    test_data_norm = copy.deepcopy(test_data)
    max = train_data_norm.max(axis=0)
    train_data_norm /= max
    test_data_norm /= max
    print(train_data_norm)
    print(test_data_norm)
    return train_data_norm, test_data_norm

train_data_norm, test_data_norm = normalize(train_data, test_data)

def preprocess_data(data, n=10):
    x = []
    y = []
    for i in range(data.shape[0]-n):
        last = i + n
        x.append(data[i:last])
        y.append(data[last])
    print('input_shape:', np.array(x).shape)
    print('target_shape:', np.array(y).shape)
    return np.array(x), np.array(y)

x_train, y_train = preprocess_data(train_data_norm)
x_test, y_test = preprocess_data(test_data_norm)

print(x_train)
print(x_test)

def build_model():
    model = Sequential()
    model.add(GRU(52, activation='sigmoid'))
    model.add(Dense(1))
    model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['mse'])
    return model

def plot_loss(h, title='loss'):
    plt.plot(h.history['loss'])
    plt.plot(h.history['val_loss'])
    plt.title(title)
    plt.ylabel('loss')
    plt.xlabel('Epoch')
    plt.legend(['Training', 'Validation'])

tscv = TimeSeriesSplit(n_splits=4)
overall_loss = []
min_epochs = 0
min = 0

for epochs in range(1, 1000+1, 10):
    validation_loss = []
    print('\nEpoch: ', epochs)
    for train_index, val_index in tscv.split(x_train):
        model = build_model()
        h = model.fit(x_train[train_index], y_train[train_index],
                      validation_data=(x_train[val_index], y_train[val_index]),
                      epochs=epochs, batch_size=20, verbose=0)
        validation_loss.append(h.history['val_loss'][-1])
        print('loss', h.history['loss'][-1], '\nval_loss:', h.history['val_loss'][-1])

    validation_loss = np.array(validation_loss)
    val_loss_mean = np.mean(validation_loss)
    print('mean validation loss:', val_loss_mean, '\n')
    overall_loss.append(val_loss_mean)
    if min > np.argmin(overall_loss):
        min = np.argmin(overall_loss)
        min_epochs = epochs
        plot_loss(h)
        plt.show()
        plt.savefig('loss.png')
        plt.clf()
        print('middle loss-----------------------', min)
print('Optimal Epochs: ', min_epochs)
print('Val Loss: ', min)

 

 

Epoch: 1000일 때 각 Fold 마다의 결과

(Optimal Epoch: 351)

 

QE2

<QE2_2.py>

import os
import numpy as np
import pickle
from sklearn.model_selection import TimeSeriesSplit
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, GRU, Dense, LSTM
import matplotlib.pyplot as plt


with open('practice_data.txt', 'rb') as f:
    data = pickle.load(f)

print(data.shape)

train_data = data[:350]
test_data = data[350:]

def normalize(train_data, test_data):
    import copy
    train_data_norm = copy.deepcopy(train_data)
    test_data_norm = copy.deepcopy(test_data)
    max = train_data_norm.max(axis=0)
    train_data_norm /= max
    test_data_norm /= max
    print(train_data_norm)
    print(test_data_norm)
    return train_data_norm, test_data_norm

train_data_norm, test_data_norm = normalize(train_data, test_data)

def preprocess_data(data, n=10):
    x = []
    y = []
    for i in range(data.shape[0]-n):
        last = i + n
        x.append(data[i:last])
        y.append(data[last])
    print('input_shape:', np.array(x).shape)
    print('target_shape:', np.array(y).shape)
    return np.array(x), np.array(y)

x_train, y_train = preprocess_data(train_data_norm)
x_test, y_test = preprocess_data(test_data_norm)

print(x_train)
print(x_test)

def build_model():
    model = Sequential()
    model.add(LSTM(52, activation='sigmoid'))
    model.add(Dense(1))
    model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['mse'])
    return model

def plot_loss(h, title='loss'):
    plt.plot(h.history['loss'])
    plt.plot(h.history['val_loss'])
    plt.title(title)
    plt.ylabel('loss')
    plt.xlabel('Epoch')
    plt.legend(['Training', 'Validation'])

tscv = TimeSeriesSplit(n_splits=4)
overall_loss = []
min_epochs = 0
min = 0

for epochs in range(1, 1000+1, 10):
    validation_loss = []
    print('\nEpoch: ', epochs)
    for train_index, val_index in tscv.split(x_train):
        model = build_model()
        h = model.fit(x_train[train_index], y_train[train_index],
                      validation_data=(x_train[val_index], y_train[val_index]),
                      epochs=epochs, batch_size=20, verbose=0)
        validation_loss.append(h.history['val_loss'][-1])
        print('loss', h.history['loss'][-1], '\nval_loss:', h.history['val_loss'][-1])

    validation_loss = np.array(validation_loss)
    val_loss_mean = np.mean(validation_loss)
    print('mean validation loss:', val_loss_mean, '\n')
    overall_loss.append(val_loss_mean)
    if min > np.argmin(overall_loss):
        min = np.argmin(overall_loss)
        min_epochs = epochs
        plot_loss(h)
        plt.show()
        plt.savefig('loss.png')
        plt.clf()
        print('middle loss-----------------------', min)
print('Optimal Epochs: ', min_epochs)
print('Val Loss: ', min)

Epochs 1000일 때 각 Fold마다의 결과

(Optimal Epochs: 921)

GRU보다 LSTM Loss값이 더 작게 나왔다. GRU LSTM을 간소화 한 모델인데 가지고 있는 데이터에 비해서 간소화를 하였기에 더 loss가 크게 나왔을 것이라고 예상된다. 하지만 Loss값은 크게 차이 나지 않고 GRU는 학습 시간이 덜 걸리기 때문에 만약 데이터가 더 풍부하였다면 학습시간에 대한 이득 부분이 GRU에서 더 컸을 것이다.

'대학교 4학년 1학기 > 인공신경망과딥러닝' 카테고리의 다른 글

Final Project (Deep Fake Detection on Videos)  (0) 2023.10.31
HW3  (0) 2023.09.11
HW2  (0) 2023.09.07
HW1  (0) 2023.09.05
PCA #13  (0) 2023.09.05