Evaluating your custom model

Each custom model you train has a corresponding details page, viewable from the Hume Portal. The model details page displays metrics and visualizations to evaluate your model’s performance. This document serves to help you interpret those metrics and provide guidance on ways to improve your custom model.

Custom model details

Limitations of model validation metrics

Model validation metrics are estimates based on a split of your dataset into training and evaluation parts. The larger the training set, the more reliable the metrics. However, it’s important to remember that these metrics are indicative and do not guarantee performance on unseen data.

Assessing ‘good’ performance

  • Task-specific variances and performance metrics: with expression analysis, the complexity of your task determines the range of model performance, which in the case of classification models can technically vary from zero to perfect accuracy. Depending on the complexity of your task, less than perfect performance may still be very useful to serve as an indication of likelihood for your given target.
  • Influence of number of classes: prediction gets more difficult as the number of classes in your dataset increases, particularly when distinction between classes is more subtle. Inherently the level of chance will be higher with a lower number of classes. For example, for 3-classes your low-end performance is 33% accuracy vs 50% for a binary problem.
  • Application-specific requirements: when establishing acceptable accuracy for a model, it’s important to consider the sensitivity and impact of its application. An appropriate accuracy threshold varies with the specific demands and potential consequences of the model’s use, requiring a nuanced understanding of how accuracy levels intersect with the objectives and risks of each unique application.

How is it possible that my model achieved 100% accuracy?

Achieving 100% accuracy is possible, however it is important to consider, especially in small datasets, that this might indicate model overfitting, caused by feature leakage or other data anomalies. Feature leakage occurs when your model inadvertently learns from data that explicitly includes label information (e.g., sentences of ‘I feel happy’ for a target label ‘happy’) leading to skewed results. To ensure more reliable performance, it’s advisable to use larger datasets and check that your data does not unintentionally contain explicit information about the labels.

Advanced evaluation metrics

In addition to accuracy, advanced metrics for a deeper evaluation of your custom model’s performance are also provided.

Advanced evaluation metrics

These metrics can be viewed on each custom model’s details page.
TermDefinition
AccuracyA fundamental metric in model performance evaluation which measures the proportion of correct predictions (true positives and true negatives) against the total number made. It’s straightforward and particularly useful for balanced datasets. However, accuracy can be misleading in imbalanced datasets where one class predominates, as a model might seem accurate by mainly predicting the majority class, neglecting the minority. This limitation underscores the importance of using additional metrics like precision, recall, and F1 score for a more nuanced assessment of model performance across different classes.
PrecisionScore which measures how often the model detects positives correctly. (e.g., When your model identifies a customer’s expression as ‘satisfied’, how often is the customer actually satisfied? Low precision would mean the model often misinterprets other expressions as satisfaction, leading to incorrect categorization.)
RecallScore which measures how often the model correctly identifies actual positives. (e.g., Of all the genuine expressions of satisfaction, how many does your model accurately identify as ‘satisfied’?” Low recall implies the model is missing out on correctly identifying many true instances of customer satisfaction, failing to recognize them accurately.)
F1A metric that combines precision and recall, providing a balanced measure of a model’s accuracy, particularly useful in scenarios with class imbalance or when specific decision thresholds are vital.
Average PrecisionA metric that calculates the weighted average of precision at each threshold, providing a comprehensive measure of a model’s performance across different levels of recall.
Roc Auc(Area under the ROC curve) a comprehensive measure of a model’s ability to distinguish between classes across all possible thresholds, making it ideal for overall performance evaluation and comparative analysis of different models.

Improving model performance

  • Increase data quantity: adding more data will often help a model to learn a broader range of the given target’s representation, increasing the likelihood of capturing outliers from diverse patterns and scenarios.
  • Improve label quality: ensure that each data point in your dataset is well-labeled with clear, accurate, and consistent annotations. Properly defined labels are essential for reducing misinterpretations and confusion, allowing the model to accurately represent and learn from the dataset’s true characteristics. Ensuring balance in the distribution of labels is important to ensure that the model is not biased towards a specific label.
  • Enhance data quality: refine your dataset to ensure it is free from noise and irrelevant information. High-quality data (in terms of your target) enhances the model’s ability to make precise predictions and learn effectively from relevant features, critical in complex datasets.
  • Incorporate clear audio data: when working with models analyzing vocal expressions, ensure audio files include clear, audible spoken language. This enhances the model’s ability to accurately interpret and learn from vocal nuances. Explore various segmentation strategies which evaluate the effect that environmental sound may have on your model’s performance.