Scalling Up Kaggle Leanderboard
The competition multi-label-classification-competition-2023 challenges participants to develop an image classifier capable of predicting labels for image data samples. Optionally, the classifier may also consider captions as part of the input.
This notebook aims to outline the steps I have taken to enhance the performance of the model. It’s important to note that the experiments mentioned in this summary represent a compilation of all the submissions I have tried, encompassing the most significant changes made throughout the process.
Step 1 - Initial Submission
The initial goal of the submission was to accurately generate the submission file. However, the achieved score remained at a low value of 0.03741. You can access the notebook that accomplishes this objective through the following links:
Step 2 - Basic Vision Multilabel Model
In this step, I attempted to build a simple RESNET18 vision multimodel using fastai. Using this approach, I achieved a score of 0.80102. You can view the details in the following notebook.
Step 3 - Visual Transformer Multilabel Model
After starting with a baseline RESNET18 model, I experimented with a more advanced model that utilizes the transformer approach. In this case, I used the vit_small_r26_s32_224 model from the timm library. This enhancement contributed 5 additional points to the score, resulting in an achievement of 0.85173. For more details, please refer to the following notebook.
Step 4 - Swin Base Multilabel Model
The report, created by @jhoward (available here), provides insightful perspectives on vision models. Alongside RESNET and VIT, it emphasizes the superior performance of SWIN models. Motivated by this information, I embarked on an experiment with the swin_base_patch4_window7_224_in22k model. To ensure smooth execution, I needed to adjust the batch size from 128 to 32, as a larger batch size would lead to memory errors on the GPU. This adjustment led to a significant improvement, raising the score to 0.88472, an increase of over 3 points. Please refer to the following notebook, for details:
Step 5 - BERT Multilabel Model
Since the performance of vision models has reached a plateau in the LB (Leaderboard), I have decided to invest in NLP models that take the image captions as input. The initial model, which utilizes BERT cased, achieved a score of 0.85118 in the LB. For more details, please refer to the following notebook. It’s worth noting that this score is comparable to the performance of the VIT model.
Step 6 - Multimodal Classifier
After extensively exploring high-performing vision and text models, I came to the realization that leveraging the strengths of both through a multimodal approach would yield promising results. Instead of utilizing BERT cased, I opted to fine-tune the microsoft/deberta-v3-small model. The Jupyter notebook linked below provides a detailed account of the process. However, it is worth noting that the performance of this model did not meet expectations, resulting in a score of 0.64622. I firmly believe that there is ample room for improvement, and I encourage you to share any advancements you may achieve in enhancing its performance.
Step 7 - Multimodal Ensamble Classifier
Recognizing the potential for improved performance by combining text and image modalities, I embarked on an exploration of ensemble techniques. By leveraging the predictions of both text and image models and averaging their probabilities, as demonstrated in the code above, I achieved a notable score of 0.88682. This approach showcases the efficacy of leveraging multiple modalities to enhance overall performance.
PATH = '/kaggle/input/multilabel-preds'
preds_bertcased.csv preds_swin.csv
preds_txts = pd.read_csv(f'{PATH}/preds_bertcased.csv').iloc[:,1:]
display(preds_txts.head(3))
preds_imgs = pd.read_csv(f'{PATH}/preds_swin.csv').iloc[:,1:]
display(preds_imgs.head(3))
 | 0 | 30000.jpg | 0.998301 | 0.000451 | 0.000210 | 0.000346 | 0.000083 | 0.003643 | 0.001506 | 0.000638 | 0.002135 | 0.000389 | 0.000791 | 0.002140 | 0.000393 | 0.000249 | 0.000438 | 0.000617 | 0.001266 | 0.001365 |
| 1 | 30001.jpg | 0.998968 | 0.001570 | 0.001198 | 0.001324 | 0.000692 | 0.164543 | 0.002550 | 0.000277 | 0.002828 | 0.000271 | 0.015516 | 0.132975 | 0.001608 | 0.000189 | 0.001716 | 0.000302 | 0.014787 | 0.002360 |
| 2 | 30002.jpg | 0.995096 | 0.004213 | 0.000795 | 0.001416 | 0.000388 | 0.002184 | 0.004425 | 0.122538 | 0.020935 | 0.001455 | 0.002439 | 0.002995 | 0.002450 | 0.000806 | 0.001965 | 0.002318 | 0.003221 | 0.002456 |
 | 0 | 30000.jpg | 0.999971 | 0.000002 | 8.951434e-07 | 0.000007 | 4.309512e-08 | 0.001723 | 0.000429 | 0.000111 | 0.000993 | 0.000006 | 0.000253 | 0.001222 | 7.742877e-07 | 2.734892e-07 | 0.000014 | 6.954837e-07 | 0.000204 | 0.000002 |
| 1 | 30001.jpg | 0.998041 | 0.005104 | 2.146458e-03 | 0.001849 | 3.282930e-04 | 0.070261 | 0.002508 | 0.000139 | 0.002634 | 0.000187 | 0.007488 | 0.718282 | 9.322132e-04 | 1.164360e-03 | 0.130176 | 3.507837e-05 | 0.354144 | 0.001698 |
| 2 | 30002.jpg | 0.984484 | 0.005114 | 1.032739e-03 | 0.001152 | 4.530661e-05 | 0.051483 | 0.020992 | 0.006257 | 0.031368 | 0.000459 | 0.003815 | 0.018324 | 4.192708e-04 | 6.712324e-03 | 0.000940 | 7.478876e-04 | 0.006368 | 0.001133 |
merged_df = pd.merge(preds_txts, preds_imgs, on='ImageID')
# Get the column names to be averaged
columns_to_average = ['1', '10', '11', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9']
# Create a new dataset with ImageID and averaged columns
averaged_df = pd.DataFrame({'ImageID': merged_df['ImageID']})
for column in columns_to_average:
 column_x = f"{column}_x"
 column_y = f"{column}_y"
 averaged_df[column] = (merged_df[column_x] + merged_df[column_y]) / 2
 | 0 | 30000.jpg | 0.999136 | 0.000226 | 0.000106 | 0.000177 | 0.000042 | 0.002683 | 0.000967 | 0.000374 | 0.001564 | 0.000198 | 0.000522 | 0.001681 | 0.000197 | 0.000125 | 0.000226 | 0.000309 | 0.000735 | 0.000684 |
| 1 | 30001.jpg | 0.998504 | 0.003337 | 0.001672 | 0.001587 | 0.000510 | 0.117402 | 0.002529 | 0.000208 | 0.002731 | 0.000229 | 0.011502 | 0.425629 | 0.001270 | 0.000676 | 0.065946 | 0.000169 | 0.184465 | 0.002029 |
| 2 | 30002.jpg | 0.989790 | 0.004664 | 0.000914 | 0.001284 | 0.000216 | 0.026834 | 0.012708 | 0.064397 | 0.026151 | 0.000957 | 0.003127 | 0.010660 | 0.001435 | 0.003759 | 0.001453 | 0.001533 | 0.004794 | 0.001795 |
def create_labels_df(df, threshold=0.5):
 df = df.copy()
 labels = []
 for i in range(len(df)):
 label_list = [col for col in df.columns[1:] if df.iloc[i][col] > threshold]
 labels.append(" ".join(label_list))
 df["Labels"] = labels
 return df[["ImageID", "Labels"]]
submission_df = create_labels_df(averaged_df)
 | 9994 | 39994.jpg | 1 |
| 9995 | 39995.jpg | 1 |
| 9996 | 39996.jpg | 3 4 |
| 9997 | 39997.jpg | 1 |
| 9998 | 39998.jpg | 1 |
| 9999 | 39999.jpg | 1 |
submission_df.to_csv('submission.csv', index=False)
Step 8 - Kfold Image and Ensamble with Text Deberta-v3
In the final stage of my approach, I employed the entire dataset for training by utilizing k-fold cross-validation for the image model. By combining the predictions of this model with those of the text model using deberta-v3, I achieved my highest score in the competition, reaching 0.90265. It is noteworthy that the transition from BERT cased to deberta-v3 was as simple as modifying the line lmodel = “bert-base-cased” to lmodel = “microsoft/deberta-v3”. This adjustment played a crucial role in enhancing the model’s performance.
Final Remarks
The next logical step in my approach would have been to implement specific augmentations tailored to the distribution of samples per class. Unfortunately, due to time constraints, I was unable to pursue this avenue. However, I hope that my journey and experiences serve as inspiration for you to further enhance your skills and excel in future competitions. Remember, continuous learning and improvement are key to success. Good luck on your future endeavors!

