Data Tips: What should your data look like?

Building your Dataset

Our custom model pipeline is designed to accommodate a wide range of data types, including audio, videos, and text, automatically integrating multimodal patterns of expression and language. However, not all datasets are created equal. For best results, we recommend using a dataset that meets certain standards:

Data Quantity: Ideally, use a dataset consisting of a minimum of 20 files, but more data is always better for model performance.

Data Consistency: Make sure the media types of each file (video, audio, image, text...etc.) is consistent across the dataset. In other words, all the files within a given dataset should be of the same media type.

It’s generally wise to maintain a consistent naming convention and file format for your dataset. At minimum, ensure files have appropriate extensions, such as .wav, .mp3, .aif, .mov, or .mp4.

Labeling Options: Depending on your model's objective (classification or regression), you can use different labeling approaches.

  • Classification Labels: use either integers or strings as labels (e.g., "confused," "focused"). We limit the number of categorical labels to 50, and you must have at least two (binary).
  • Regression Targets: use either integers or float values as targets. For ease of interpretation, consider scaling regression targets within a range, often 0 to 1, or another scale relevant to your data.

Label Consistency: We recommend that your labels follow a consistent format; e.g, do not mix integers and strings. Furthermore, be sure to check for any typos in your labels, as these will be considered as separate classes, e.g, “happy” vs. “hapy.”

Label Balance: If possible, it helps to have a balanced distribution of labels in your dataset. For example, if you have 50 files and two classes, the best case is to have 25 samples per class. Generally, you need at least 10 samples per class to train a useful model, but more data per class is always better.