Main objective of the project: Automate the process of detecting COVID-19 cases from chest radiograph images, using convolutional neural networks (CNN) through deep learning techniques. The complete project can be accessed here

Steps to reach the goal:
1- Data pre-processing
2- Model training and exposure of results

Step 1 - Data pre-processing

Databases used:

  • X-ray images and chest CT scans of individuals infected with COVID-19 (COHE; MORRISON; DAO, 2020): link
  • Images of lungs of individuals without any infection and with other infections (different from COVID-19) (KERMANY; ZHANG; GOLDBAUM, 2018): link

Packages used:

  • Pandas
  • Os
  • PIL
  • Numpy
  • CV2

Code used in the project:
The notebook with all the codes used in this step is available here
Note: the numbering and title of each step described in this tutorial correspond with the numbering and title contained in the notebook.

Steps to be followed:
1º StepImport the libraries to be used
2º StepLoad the dataframe for lung images of individuals with COVID-19
3º StepAnalysis of the “df” dataframe
4º StepSelect the cases related to COVID-19 on the “df” dataframe
5º StepAnalysis of the “df_covid” dataframe
6º StepCreate a list to add the values of the variable/column “filename”
7º StepCreate a list with only the image formats that exist in the image folder
8º StepCreate a function to open the images, check their dimensions and, later, save this data in a dataframe
9º PassoCreate a variable that contains as value the address of the folder where the images are saved
10º StepUse the function created to check the size of the images
11º StepConvert all images to 237 x 237px .png
12º StepCreate a list of the images that will be deleted from the folder
13º StepOpen the lung images of individuals without infection and create a list with the name of the images that exist in the image folder
14º StepConvert all images of lungs from uninfected individuals to 237 x 237px .png
15º Step - Open the lung images of individuals with other infections and create a list with the name of the images that exist in the image folder
16º Step - Convert all lung images of individuals with other infections to 237 x 237px .png
17º StepOpen the images of the lungs of individuals infected with COVID-19 in a list and transform them into an array (matrix of pixel values that represent the image))
18º StepOpen the images of the lungs of individuals without infections in a list and transform them into an array (Matrix of values of the pixels that represent the image)
19º Step - Open the images of the lungs of individuals with other infections in a list and transform them into an array (array of pixel values representing the image)
20º Step - Group arrays into a single array containing information about COVID-19, normal lung images, and with other infections
21º Step - Indicate the cases that are COVID-19, the ones that are normal and the cases with other infections and create an array
22º StepSave arrays to .npy

Tutorial 1:

1º Step

Import the libraries to be used

We imported the Pandas, Os, PIL, Numpy, and CV2 libraries as we will rely on them to pre-process the model data for COVID-19.

import pandas as pd
import os
from PIL import Image
import numpy as np
import cv2

Note: the “Pandas” library was imported as “pd”, in order to speed up the writing of the code. That is, instead of typing “pandas” when using it, I will just type “pd”. The same was done with the “numpy” library. In addition, the “PIL” library has not been imported completely, as we will not use all the functions contained therein. In this way, it facilitates the use of the library and the processing of codes/data.

2º Step

Load the dataframe for lung images of individuals with COVID-19

We load the file in .csv, called “metadata”, which accompanies the image bank provided by the researchers (COHE; MORRISON; DAO, 2020).

The command below names this dataframe “df” when loading it. In parentheses, you must enter the address of this file.

df = pd.read_csv("/Users/Neto/Desktop/Aprendizados/2020/Kaggle/corona_deep_learning/covid-chestxray-dataset-master/metadata.csv")

3º Step

Analysis of the “df” dataframe

We generated some descriptive data in order to find out how many images of COVID-19 are available on the dataframe (df). For that, we ask for a count of values from the variable/column “finding”. This variable contains the diagnosis related to each lung image.

df.finding.value_counts()

COVID-19          188
Streptococcus      17
Pneumocystis       15
SARS               11
E.Coli              4
ARDS                4
COVID-19, ARDS      2
Chlamydophila       2
No Finding          2
Legionella          2
Klebsiella          1
Name: finding, dtype: int64

It is possible to notice from the data that 188 images refer to COVID-19.

4º Step

We separated only the cases of the variable/column “finding” in the dataframe “df” that were COVID-19 since we will only use these cases in the model. We saved this selection in a new dataframe named “df_covid”.

df_covid = df[df["finding"] == "COVID-19"]

5º Step

Analysis of the “df_covid” dataframe

We asked to check the “df_covid” dataframe, in order to analyze whether the selection of COVID-19 cases was carried out correctly. For this, we ask to see the end of this dataframe. In addition, we request that only the variables/columns “finding” and “filename” be shown. The “finding” refers to the selected COVID-19 cases and the “filename” indicates the name of the COVID-19 radiography images made available by the authors of the bank in question (COHE; MORRISON; DAO, 2020). This last information was requested, as it will be used in the next step.

df_covid[["finding","filename"]].tail()
	finding	filename
307	COVID-19	covid-19-pneumonia-58-day-9.jpg
308	COVID-19	covid-19-pneumonia-58-day-10.jpg
309	COVID-19	covid-19-pneumonia-mild.JPG
310	COVID-19	covid-19-pneumonia-67.jpeg
311	COVID-19	covid-19-pneumonia-bilateral.jpg

6º Step

Create a list to add the values of the variable/column “filename”

We created a list from the variable/column “filename” located in the dataframe “df_covid”. This was called “imagesCOVID”. This list only shows the names of the images with the lungs of individuals infected with the COVID-19 virus. This list was created to facilitate the selection of the images that we will use to train the model.

imagensCOVID = df_covid["filename"].tolist()

7º Step

Create a list with only the image formats that exist in the image folder

When manually checking the folder where the images are located, only the formats “.jpg” and “.png” were noticed. However, the variable/column “filename” has among its values, images with extension “.gz”. Thus, we created a list (“imagensCovid”) with only the name of the images in the formats existing in the folder (“.jpg” and “.png”).

imagensCovid = []
for imagem in imagensCOVID:
    if imagem.endswith(".gz"):
        pass
    else:
        imagensCovid.append(imagem)
        
print(len(imagensCovid))

8º Step

Create a function to open the images, check their dimensions and, later, save this data in a dataframe

Knowing that this action will be used frequently in the data pre-processing steps of the models that will be trained, we created a function to facilitate this process. Thus, the function below (“df_dimensao”) defines the creation of a dataframe with the dimensions of the images located in a given folder.

def df_dimensao(folder_das_imagens, lista_nome_imagens):
    """Function to create a dataframe with the original dimensions of the images in a given folder.
    
    Parameters:
    
    folder_das_imagens(str): colocar a pasta onde as imagens estão salvas
    lista_nome_imagens(list): colocar a lista com o nome das imagens
    
    return
    
    df_dims(pd.DataFrame)
    
    """
    
    dic = {}
    dimensaoImagensLargura = []
    dimensaoImagensAltura = []
    nome = []
    
    if ".DS_Store" in lista_nome_imagens:
        lista_nome_imagens.remove(".DS_Store")
    for imagem in lista_nome_imagens:
        
        enderecoDaImagem = folder_das_imagens + "/" + imagem
        abrirImagem = Image.open(enderecoDaImagem)
        nome.append(imagem)
        dimensaoImagensLargura.append(abrirImagem.size[0])
        dimensaoImagensAltura.append(abrirImagem.size[1])

    dic["nome"] = nome
    dic["largura"] = dimensaoImagensLargura
    dic["altura"] = dimensaoImagensAltura
    df_dims = pd.DataFrame(dic)
    
    return df_dims

9º Step

Create a variable that contains as value the address of the folder where the images are saved

In order to use the function created in Step 8, specifically the parameter “folder_das_imagens (str)”, we must have a string variable that indicates the address of the images on the computer. For this, the code below creates a variable (“rootFolder”) indicating this location.

rootFolder = "/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/covid-chestxray-dataset-master/images"

Note: in relation to the other function attribute called “lista_nome_imagens”, we will use the list created in Step 7 (“imagensCovid”).

10º Step

Use the function created to check the size of the images

From the created function, we save the values in the “dimensao” variable. Below you can see the names of each figure and its dimension (width x height) in pixels.

dimensao = df_dimensao(rootFolder, imagensCovid)
print(dimensao)
                                                  nome  largura  altura
0    auntminnie-a-2020_01_28_23_51_6665_2020_01_28_...      882     888
1    auntminnie-b-2020_01_28_23_51_6665_2020_01_28_...      880     891
2    auntminnie-c-2020_01_28_23_51_6665_2020_01_28_...      882     876
3    auntminnie-d-2020_01_28_23_51_6665_2020_01_28_...      880     874
4                                nejmc2001573_f1a.jpeg     1645    1272
..                                                 ...      ...     ...
211                    covid-19-pneumonia-58-day-9.jpg     2267    1974
212                   covid-19-pneumonia-58-day-10.jpg     2373    2336
213                        covid-19-pneumonia-mild.JPG      867     772
214                         covid-19-pneumonia-67.jpeg      492     390
215                   covid-19-pneumonia-bilateral.jpg     2680    2276

[216 rows x 3 columns]

Note: this step is important, because to execute the model, all images must have the same dimension.

11º Step

Convert all images to 237 x 237px .png

Since to run the model we need to have all images with the same dimension, we chose to reduce them all to the dimension of the smallest figure available in the image bank. In addition, to maintain a standard, we changed the format to “.png” for all figures, since some were “.jpg”.

The code below resizes the images to 237 x 237px, saves them in another folder and executes the function that we built in Step 8 to see if all dimensions have been changed.

for imagem in imagensCovid:
    enderecoDaImagem = rootFolder + "/" + imagem
    abrirImagem = Image.open(enderecoDaImagem)
    image_resize = abrirImagem.resize((237,237))
    os.chdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/covid-chestxray-dataset-master/images/images_resize")
    image_resize.save(f'{imagem}_resize_237_237.png')
    
    
rootFolder = "/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/covid-chestxray-dataset-master/images/images_resize"
imagensDaPastaResize = os.listdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/covid-chestxray-dataset-master/images/images_resize")
df_redimensao = df_dimensao(rootFolder, imagensDaPastaResize)
print(df_redimensao)
                                                  nome  largura  altura
0    01E392EE-69F9-4E33-BFCE-E5C968654078.jpeg_resi...      237     237
1    39EE8E69-5801-48DE-B6E3-BE7D1BCF3092.jpeg_resi...      237     237
2                 lancet-case2b.jpg_resize_237_237.png      237     237
3             nejmoa2001191_f4.jpeg_resize_237_237.png      237     237
4    7C69C012-7479-493F-8722-ABC29C60A2DD.jpeg_resi...      237     237
..                                                 ...      ...     ...
211  23E99E2E-447C-46E5-8EB2-D35D12473C39.png_resiz...      237     237
212  covid-19-pneumonia-43-day2.jpeg_resize_237_237...      237     237
213    radiol.2020201160.fig6b.jpeg_resize_237_237.png      237     237
214  8FDE8DBA-CFBD-4B4C-B1A4-6F36A93B7E87.jpeg_resi...      237     237
215      covid-19-pneumonia-7-L.jpg_resize_237_237.png      237     237

[216 rows x 3 columns]

Note: as you can see, all figures have the same dimension (width x height).

12º Step

Create a list of the images that will be deleted from the folder

We created a list with the name of the images that were deleted from the folder. The authors of this model decided not to include the lateral and computed tomography images existing in the original image bank. Thus, the variable “listaImagemDeletar” presents a list with the name of these images as a value.

listaImagemDeletar = os.listdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/covid-chestxray-dataset-master/deletadas")
listaImagemDeletar = ['covid-19-pneumonia-30-L.jpg_resize_237_237.png',
 '396A81A5-982C-44E9-A57E-9B1DC34E2C08.jpeg_resize_237_237.png',
 'covid-19-infection-exclusive-gastrointestinal-symptoms-l.png_resize_237_237.png',
 'nejmoa2001191_f3-L.jpeg_resize_237_237.png',
 '3ED3C0E1-4FE0-4238-8112-DDFF9E20B471.jpeg_resize_237_237.png',
 'covid-19-pneumonia-38-l.jpg_resize_237_237.png',
 'a1a7d22e66f6570df523e0077c6a5a_jumbo.jpeg_resize_237_237.png',
 '254B82FC-817D-4E2F-AB6E-1351341F0E38.jpeg_resize_237_237.png',
 'covid-19-pneumonia-15-L.jpg_resize_237_237.png',
 'kjr-21-e24-g002-l-b.jpg_resize_237_237.png',
 'D5ACAA93-C779-4E22-ADFA-6A220489F840.jpeg_resize_237_237.png',
 'kjr-21-e24-g002-l-c.jpg_resize_237_237.png',
 'covid-19-pneumonia-14-L.png_resize_237_237.png',
 'kjr-21-e24-g004-l-a.jpg_resize_237_237.png',
 'nejmoa2001191_f1-L.jpeg_resize_237_237.png',
 'kjr-21-e24-g003-l-b.jpg_resize_237_237.png',
 'kjr-21-e24-g004-l-b.jpg_resize_237_237.png',
 'DE488FE1-0C44-428B-B67A-09741C1214C0.jpeg_resize_237_237.png',
 '191F3B3A-2879-4EF3-BE56-EE0D2B5AAEE3.jpeg_resize_237_237.png',
 '35AF5C3B-D04D-4B4B-92B7-CB1F67D83085.jpeg_resize_237_237.png',
 '6A7D4110-2BFC-4D9A-A2D6-E9226D91D25A.jpeg_resize_237_237.png',
 '4C4DEFD8-F55D-4588-AAD6-C59017F55966.jpeg_resize_237_237.png',
 'covid-19-caso-70-1-L.jpg_resize_237_237.png',
 '44C8E3D6-20DA-42E9-B33B-96FA6D6DE12F.jpeg_resize_237_237.png',
 'kjr-21-e24-g001-l-b.jpg_resize_237_237.png',
 'FC230FE2-1DDF-40EB-AA0D-21F950933289.jpeg_resize_237_237.png',
 '1-s2.0-S0929664620300449-gr3_lrg-a.jpg_resize_237_237.png',
 '925446AE-B3C7-4C93-941B-AC4D2FE1F455.jpeg_resize_237_237.png',
 'jkms-35-e79-g001-l-e.jpg_resize_237_237.png',
 '1-s2.0-S0929664620300449-gr3_lrg-b.jpg_resize_237_237.png',
 '21DDEBFD-7F16-4E3E-8F90-CB1B8EE82828.jpeg_resize_237_237.png',
 'covid-19-pneumonia-evolution-over-a-week-1-day0-L.jpg_resize_237_237.png',
 '1-s2.0-S0929664620300449-gr3_lrg-d.jpg_resize_237_237.png',
 '1-s2.0-S0929664620300449-gr3_lrg-c.jpg_resize_237_237.png',
 'nejmoa2001191_f5-L.jpeg_resize_237_237.png',
 'jkms-35-e79-g001-l-d.jpg_resize_237_237.png',
 'covid-19-pneumonia-22-day1-l.png_resize_237_237.png',
 'kjr-21-e24-g001-l-c.jpg_resize_237_237.png',
 '66298CBF-6F10-42D5-A688-741F6AC84A76.jpeg_resize_237_237.png',
 'covid-19-pneumonia-20-l-on-admission.jpg_resize_237_237.png',
 'covid-19-pneumonia-7-L.jpg_resize_237_237.png']

13º Step

Open the lung images of individuals without infection and create a list with the name of the images that exist in the image folder

After creating a variable called “pastaTreinoNormal” with the address of the folder with the images of the lungs of individuals without infection, we created a list (“listaImagensTreino”) with only the name and format of these images.

pastaTreinoNormal = "/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/NORMAL"

listaImagensTreino = os.listdir(pastaTreinoNormal)

14º Step

Convert all images of lungs from uninfected individuals to 237 x 237px .png

The images of normal lungs were resized to the same dimension as the lung images with COVID-19: namely, 237 x 237px. To maintain the same pattern, we changed the format to “.png” for all figures. It is important to note that we selected only the first 100 images in the folder using the code below. This was done to maintain the training with a similar amount of image of individuals without any infection and with COVID-19.

In addition, we run the function we built in Step 8 to see if all dimensions have changed.

listaCemImagens = listaImagensTreino[0:100]
for imagem in listaCemImagens:
    enderecoDaImagem = "/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/NORMAL"+ "/" + imagem
    abrirImagem = Image.open(enderecoDaImagem)
    image_resize = abrirImagem.resize((237,237))
    os.chdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/NORMAL/images_resize_normal")
    image_resize.save(f'{imagem}_resize_237_237.png')
  
rootFolder = "/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/NORMAL/images_resize_normal"
imagensDaPastaResize = os.listdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/NORMAL/images_resize_normal")
df_redimensao = df_dimensao(rootFolder, imagensDaPastaResize)
print(df_redimensao)
                                            nome  largura  altura
0   NORMAL2-IM-1196-0001.jpeg_resize_237_237.png      237     237
1   NORMAL2-IM-0645-0001.jpeg_resize_237_237.png      237     237
2           IM-0269-0001.jpeg_resize_237_237.png      237     237
3   NORMAL2-IM-1131-0001.jpeg_resize_237_237.png      237     237
4      IM-0545-0001-0002.jpeg_resize_237_237.png      237     237
..                                           ...      ...     ...
95  NORMAL2-IM-0592-0001.jpeg_resize_237_237.png      237     237
96  NORMAL2-IM-1167-0001.jpeg_resize_237_237.png      237     237
97  NORMAL2-IM-0741-0001.jpeg_resize_237_237.png      237     237
98  NORMAL2-IM-0535-0001.jpeg_resize_237_237.png      237     237
99          IM-0119-0001.jpeg_resize_237_237.png      237     237

[100 rows x 3 columns]

Note: as you can see, all figures have the same dimension (width x height).

15º Step

Open the lung images of individuals with other infections and create a list with the name of the images that exist in the image folder

After creating a variable called “pastaTreinoOutrasInfeccoes” with the address of the folder with the images of lungs of individuals with other infections (that is, without the COVID-19), we created a list (“listaImagensTreinoOutrasInfeccoes”) with only the name and format of these images.

pastaTreinoOutrasInfeccoes = "/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/PNEUMONIA"

listaImagensTreinoOutrasInfeccoes = os.listdir(pastaTreinoOutrasInfeccoes)

16º Step

Convert all lung images of individuals with other infections to 237 x 237px .png

The images of lungs with other infections (ie, other than COVID-19) were resized to the same dimension as the images of lungs with COVID-19 and normal: namely, 237 x 237px. To maintain the same pattern, we changed the format to “.png” for all figures. It is important to note that we selected only the first 100 images in the folder using the code below. This was done to maintain the training with a similar amount of image of individuals with other infections, without any infection and with COVID-19.

In addition, we run the function we built in Step 8 to see if all dimensions have changed.

listaCemImagensOutrasInfeccoes = listaImagensTreinoOutrasInfeccoes[0:100]
for imagem in listaCemImagensOutrasInfeccoes:
    enderecoDaImagem = "/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/PNEUMONIA"+ "/" + imagem
    abrirImagem = Image.open(enderecoDaImagem)
    image_resize = abrirImagem.resize((237,237))
    os.chdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/PNEUMONIA/images_resize_infeccoes")
    image_resize.save(f'{imagem}_resize_237_237.png')

rootFolder = "/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/PNEUMONIA/images_resize_infeccoes"
imagensDaPastaResize = os.listdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/PNEUMONIA/images_resize_infeccoes")
df_redimensao = df_dimensao(rootFolder, imagensDaPastaResize)
print(df_redimensao)

                                                nome  largura  altura
0    person890_bacteria_2814.jpeg_resize_237_237.png      237     237
1   person1016_bacteria_2947.jpeg_resize_237_237.png      237     237
2    person306_bacteria_1439.jpeg_resize_237_237.png      237     237
3    person472_bacteria_2015.jpeg_resize_237_237.png      237     237
4   person1491_bacteria_3893.jpeg_resize_237_237.png      237     237
..                                               ...      ...     ...
95   person364_bacteria_1660.jpeg_resize_237_237.png      237     237
96     person1455_virus_2489.jpeg_resize_237_237.png      237     237
97     person1238_virus_2098.jpeg_resize_237_237.png      237     237
98      person620_virus_1191.jpeg_resize_237_237.png      237     237
99     person26_bacteria_122.jpeg_resize_237_237.png      237     237

[100 rows x 3 columns]

Note: as you can see, all figures have the same dimension (width x height).

17º Step

Open the images of the lungs of individuals infected with COVID-19 in a list and transform them into an array (matrix of pixel values that represent the image)

First, from the resized lung images of individuals with COVID-19 obtained in Step 11, we created a variable (“imagensCovid”) with the list of names of these images. Then, using the list of images that were not used in the model (lateral and computed tomography), referring to Step 12, these were deleted from the variable values (“imagensCovid”).

Subsequently, we created a list with the arrays called “XTrainCovid” from the resized images, that is, a list with the values referring to the image pixels that represent the lungs of individuals infected by COVID-19.

Finally, we saved the “XTrainCovid” list in an array called “xArrayCOVID”.

imagensCovid = os.listdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/covid-chestxray-dataset-master/images/images_resize")
imagensCovid = [x for x in imagensCovid if x not in listaImagemDeletar]

if ".DS_Store" in imagensCovid:
    imagensCovid.remove(".DS_Store")

xTrainCovid = []

for image in imagensCovid:
    x = cv2.imread("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/covid-chestxray-dataset-master/images/images_resize/" + image)
    x = np.array(x)
    xTrainCovid.append(x)

xArrayCOVID = np.array(xTrainCovid)
print(xArrayCOVID.shape)

(175, 237, 237, 3)

Note: as you can see, the built array (“xArrayCOVID”) has four dimensions. The first (“175”) refers to the number of cases, that is, images of individuals with COVID-19; the second (“237”) refers to the width of the image; the third (“237”) refers to the height of the image and; the fourth (“3”), the number of color channels in the images.

18º Step

Open the images of the lungs of individuals without infections in a list and transform them into an array (Matrix of values of the pixels that represent the image)

First, from the resized images of the lungs of individuals without infections obtained in Step 13, we created a variable (“imagensNormal”) with the list of names of these images.

In a second step, we created a list with the arrays called “XTrainNormal” from the resized images, that is, a list with the values referring to the image pixels that represent the lungs of individuals without infections.

Finally, we save the “XTrainNormal” list in an array called “xArrayNormal”.

imagensNormal = os.listdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/NORMAL/images_resize_normal")

if ".DS_Store" in imagensNormal:
    imagensNormal.remove(".DS_Store")

xTrainNormal = []

for image in imagensNormal:
    x = cv2.imread("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/NORMAL/images_resize_normal/" + image)
    x = np.array(x)
    xTrainNormal.append(x)

xArrayNormal = np.array(xTrainNormal)
print(xArrayNormal.shape)

(100, 237, 237, 3)

Note: as you can see, the built array (“xArrayNormal”) has four dimensions. The first (“100”) refers to the number of cases, that is, images of individuals without infections; the second (“237”) refers to the width of the image; the third (“237”) refers to the height of the image and; the fourth (“3”), the number of color channels in the images.

19º Step

Open the images of the lungs of individuals with other infections in a list and transform them into an array (array of pixel values representing the image)

First, from the resized images of the lungs of individuals with other infections obtained in Step 16, we created a variable (“imagensInfeccoes”) with the list of names of these images.

In a second step, we created a list with the arrays called “XTrainInfeccoes” from the resized images, that is, a list with the values referring to the image pixels that represent the lungs of individuals with other infections.

Finally, we save the list “XTrainInfeccoes” in an array called “xArrayInfeccoes”.

imagensInfeccoes = os.listdir("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/PNEUMONIA/images_resize_infeccoes")

if ".DS_Store" in imagensNormal:
    imagensNormal.remove(".DS_Store")

xTrainInfeccoes = []

for image in imagensInfeccoes:
    x = cv2.imread("/Users/cesarsoares/Documents/Python/COVID/Banco_de_Dados/chest_xray/train/PNEUMONIA/images_resize_infeccoes/" + image)
    x = np.array(x)
    xTrainInfeccoes.append(x)

xArrayInfeccoes = np.array(xTrainInfeccoes)
print(xArrayInfeccoes.shape)

(100, 237, 237, 3)

Note: as you can see, the built array (“xArrayInfeccoes”) has four dimensions. The first (“100”) refers to the number of cases, that is, images of individuals without infections; the second (“237”) refers to the width of the image; the third (“237”) refers to the height of the image and; the fourth (“3”), the number of color channels in the images.

20º Step

Group arrays into a single array containing information about COVID-19, normal lung images, and with other infections

We group the array of images of individuals with COVID-19 (“xArrayCOVID”), created in Step 17, with the array of images of individuals without infections (“xArrayNormal”), created in Step 18, and, finally, with the array of images of individuals with other infections (“xArrayInfeccoes”), created in Step 19. This array was saved in the variable “X_train”.

X_train = np.vstack((xArrayCOVID,xArrayNormal, xArrayInfeccoes))

21º Step

Indicate the cases that are COVID-19, the ones that are normal and the cases with other infections and create an array

Three arrays were created. The first indicates the variable “dfCOVID” with the value “1”, indicating the presence of COVID-1, and the variable “dfNormal” and the variable “dfInfeccoes” with the value “0”, indicating the absence of images with these characteristics . The second indicating “1” for “dfNormal” and “0” for “dfCOVID” and “dfInfeccoes” and the third “1” for “dfInfeccoes” and “0” for “dfCOVID” and “dfNormal”.

Finally, we group the array of images of individuals with COVID-19 (“Y_train_COVID”) with the array of images of individuals without infections (“Y_train_NORMAL”) with the array of individuals with other infections (“Y_train_INFECCOES”). This array was saved in the “Y_train” variable.

dfCOVID = np.ones((xArrayCOVID.shape[0],1))
dfNormal = np.zeros((xArrayNormal.shape[0],1))
dfInfeccoes = np.zeros((xArrayInfeccoes.shape[0],1))

Y_train_COVID = np.vstack((dfCOVID,dfNormal, dfInfeccoes))

dfCOVID = np.zeros((xArrayCOVID.shape[0],1))
dfNormal = np.ones((xArrayNormal.shape[0],1))
dfInfeccoes = np.zeros((xArrayInfeccoes.shape[0],1))

Y_train_NORMAL = np.vstack((dfCOVID,dfNormal, dfInfeccoes))

dfCOVID = np.zeros((xArrayCOVID.shape[0],1))
dfNormal = np.zeros((xArrayNormal.shape[0],1))
dfInfeccoes = np.ones((xArrayInfeccoes.shape[0],1))

Y_train_INFECCOES = np.vstack((dfCOVID,dfNormal, dfInfeccoes))

Y_train = np.hstack((Y_train_COVID, Y_train_NORMAL, Y_train_INFECCOES))

22º Step

Save arrays to .npy

To use the arrays in training the model, these were saved in “X_Train.npy” and “Y_Train.npy”.

np.save("/Users/cesarsoares/Documents/Python/COVID/X_Train.npy",X_train)
np.save("/Users/cesarsoares/Documents/Python/COVID/Y_Train.npy", Y_train)

Note: X_Train will be the input of the trained model and Y_Train will be the target, that is, the expected result of the model.(br />

Bibliography
COHEN, Joseph; MORRISON, Paul; DAO, Lan. COVID-19 Image Data Collection. arXiv:2003.11597, 2020.

KERMANY, Daniel; ZHANG, Kang; GOLDBAUM, Michael. Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification. Mendeley Data, v.2, 2018. Disponível em: http://dx.doi.org/10.17632/rscbjbr9sj.2