The Galaxy Zoo Classifier is a framework to build, compile and train deep learning models designed to classify galaxies based on their images. It uses convolutional neural network (CNN) architectures to learn and predict the classes of galaxies. While this was the original aim of the project for which I developed the code, it could serve as a more general framework for .jpg images classification with CNNs.
For practical reasons, I am not including here the datasets used for model training and model testing. You can find the relevant galaxy images at the webpage of the Galaxy Zoo Kaggle challenge.
Before using the Galaxy Zoo Classifier, it is recommended to set up a Conda environment with TensorFlow. This ensures that the necessary dependencies are isolated and managed efficiently. Here are the steps to set up the environment:
-
First, install Anaconda or Miniconda if you haven't already. Anaconda is a distribution of Python and R for scientific computing, while Miniconda is a smaller, "minimal installer" version that only includes Conda and its dependencies.
-
Once you have Anaconda or Miniconda installed, create a new Conda environment. You can name it whatever you want, but in this example, we'll call it
galaxy_env. To create the environment, open your terminal and run the following command:conda create --name galaxy_env
-
Activate the newly created environment by running:
conda activate galaxy_env
-
Install TensorFlow in the Conda environment. You might prefer to choose the latest stable version:
conda install -c conda-forge tensorflow
-
Now that TensorFlow is installed, clone the Galaxy Zoo Classifier project into your desired directory and navigate to the project's root directory in the terminal.
-
Install the other necessary dependencies by running the following command. You might need to modify the requirements.txt file to specify the version you want to install:
pip install -r requirements.txt
To run the Galaxy Zoo Classifier without saving the model or training history, you can use the following instructions:
-
Prepare your data:
- Ensure that your galaxy images are stored in the directory specified by
paths.images_path. - Create a YAML configuration file (
config.yaml) and update it with your specific paths and parameters. An example configuration file is provided. - Modify the configuration file according to your needs, including data preprocessing settings, model parameters, and training parameters.
- Ensure that your galaxy images are stored in the directory specified by
-
Run the code by executing the
run.pyscript with the configuration file as a command-line argument:python run.py --config config.yaml --no-save_model --no-save_history --seed 48
-
The code will load the configuration, preprocess the data, build the model, train it, evaluate its performance, and generate plots and a confusion matrix. However, the model and training history will not be saved.
Note that the --no-save_model and --no-save_history flags are used to disable the saving of the model and training history, respectively. By default, both saving options are enabled (--save_model and --save_history), so you need to explicitly specify the flags to disable them. The --seed flag is used to set the random seed for all the randomized processes. The value 48 is the default value, being the value that must be chosen for reproducibility of the results.
The configuration file (config.yaml) contains the following sections. For more information about the role of each configuration parameter, please see the code documentation. A few important notes:
-
The
taskparameter defines the task to solve. We have only considered the first and second tasks of the proposed project, with values1and2respectively. -
One has to set by hand the parameter
one_hot_labelstoTruefor each task if desired. -
The
conv_layersandpool_sizemust be passed as a list of lists. They are transformed to tuples in therun.pyfile. Being the configuration file.yaml, it does not recognize tuples by default. -
The parameters which require a number as input such as
crop_size,dropout_rate,early_stop_patienceand so on must be passed asnullif not wanted. The same applies to thedata_augmentation_paramsdictionary. The code was written to take those asNoneif not desired, but.yamlfiles requirenullwhich then is transformed toNonein Python. -
If
class_weightsset toTruethe class weights are calculated. However, if you don't use the custom losses the weights are not applied.
paths:
images_path: 'galaxy-zoo_data/images_training_rev1/'
labels_path: 'galaxy-zoo_data/training_solutions_rev1.csv'
plots_path: 'metrics_plots/'
models_path: 'saved_models/'
data_preprocessing:
task: 2
min: 0.5
one_hot_labels: False
crop: True
crop_size: 256
img_size: [64, 64]
normalize: True
grayscale: False
training_size: 0.80
test_size: 0.2
model_name: 'Base_ModelII_regression'
model_params:
model_type: 'base'
conv_layers: [[32, [3,3]], [64, [3,3]], [128, [3,3]]]
dense_units: [256, 128]
batch_normalization: False
activation: 'relu'
pool_size: [2, 2]
flattening: 'Flatten'
class_weights: False
out_activation: 'sigmoid'
dropout_rate: 0.25
max_out: True
early_stop_patience: 15
monitor: 'val_loss'
data_augmentation_params:
rotation_range: 90
width_shift_range: 0.01
height_shift_range: 0.01
horizontal_flip: True
vertical_flip: True
shear_range: 0.015
zoom_range: 0.15
train_params:
learning_rate: 0.001
loss_function: 'mean_squared_error'
metrics: ['mse', 'accuracy']
batch_size: 100
epochs: 150
threshold: 0.0
take_weights_log: TrueFeel free to customize the code to fit your specific needs. You can modify the model architecture, experiment with different data preprocessing techniques, or adjust the training parameters. Additionally, you can extend the code by adding new functionality or implementing advanced features.
The Galaxy Zoo Classifier relies on the following dependencies:
- TensorFlow
- NumPy
- argparse
Ensure that these dependencies are installed before running the code.
- Example configuration file:
config.yaml run.py: Main script for running the codeeval.py: Evaluation functions for model performancemodels.py: Definition of the GalaxyZooClassifier modelutils.py: Utility functions for data loading and preprocessing
The Galaxy Zoo Classifier is open source and distributed under the MIT License.