Dataset

Dataset

Downloading & Preprocessing

Let’s face it: video data is not easily accessible, and there aren’t many publicly available sources. In this section, we’ll guide you through downloading the necessary datasets, preprocessing the data, and ensuring it’s ready for training the Open-Sora model.

Note: Ensure you have sufficient storage space and bandwidth to download these large datasets. The total required disk space is ~37TB.

Download the Datasets

We’ll be using two primary datasets for our reproduction experiment:

  • OpenVid: Contains 1 million short video clips and corresponding captions.
    You can download the dataset from their Huggingface link.
  • MiraData: Contains 330k (but different splits exists too) long video clips and corresponding captions.
    For MiraData, we’ll follow the guidance from the author’s repository to download the 330K version of the dataset (the meta file we use is miradata_v1_330k.csv).
  • Custom Dataset: Our guide also covers how to use your own video data set consisting of video clips and corresponding captions.

Dataset Summary

DatasetLicenseDataset SizeClip DimensionsRequired Disk Space
OpenVidCC-BY-4.01M Clips & CaptionsVarious Resolutions & Aspect Ratios7.9TB
MiraDataGPL-3.0330k Clips & Captions1280x720 and 1920x108029TB

Preprocessing the Datasets

Both OpenVid and MiraData come with video clips and captions. Therefore, we can skip most of the preprocessing steps outlined in the Open-Sora data processing guide. However, we still need to add missing metadata to the CSV file for training purposes and filter out any large or unsupported files.

Required Columns for Training

To train using the Open-Sora code base, a CSV file with specific columns is required. The necessary columns are: path, text, num_frames, height, width, aspect_ratio, fps, resolution, and file_size.

But thankfully, there’s a script to generate most of these columns from only path and text. If you have a CSV file (dataset.csv) containing the path and text columns, you can compute the remaining required columns from these two by executing the following command:

1
python -m tools.datasets.datautil dataset.csv --info --remove-empty-caption

The command will execute concurrently, generating a new file named dataset_info_noempty.csv. This file will contain all the required metadata columns and exclude any entries with empty captions.

Filtering Large Video Clips

To optimize training performance, we remove video clips larger than 50MB, as they are more expensive to load during training.

1
python -m tools.datasets.filter_large_videos dataset_info_noempty.csv 50

This results in a new file called dataset_info_noempty_le50MB.csv.

Filtering Broken and Unsupported Files

Open-Sora uses ffmpeg under the hood to open files on-the-fly. Some video clips may cause FFMPEG warnings or errors, and, in the worst case: crash the training process. To prevent this, we need to filter out files that FFMPEG cannot decode. This process is CPU-intensive, so we’ll parallelize it across multiple servers.

The idea of filtering is simple: read each file with ffmpeg, write to a file called $filename.err and then filter using the file size of the error file.

Warning: This filtering process can be time-consuming depending on the size of your dataset and the number of nodes available for parallel processing.

Steps to Filter Out Problematic Video Clips

  1. Create filenames.txt containing only all filenames:
    To extract the path column from dataset_info_noempty_le50MB.csv and save it to filenames.txt:
    1
    
    python -c "import pandas as p, sys; p.read_csv(sys.argv[1]).path.to_csv(sys.argv[2], index=0, header=0)" dataset_info_noempty_le50MB.csv filenames.txt
    
  2. Split filenames.txt into Sub-Lists for Parallel Processing:
    Assuming you have 24 nodes available for checking, split filenames.txt into 24 sub-lists:
    1
    
    split -n l/24 filenames.txt part_
    

    This will create files named part_aa, part_ab, …, part_az.

  3. Adapt and Run the FFMPEG Check Script on All Nodes
    The following script will create .err files alongside each video file in filenames.txt. An empty .err file indicates no errors, while a non-empty file signifies an FFMPEG error with that video.

    1
    
    paste nodes.txt <(ls ./data_csvs/part_* | sort) | parallel --colsep '\t' ssh -tt {1} "bash $(pwd)/tools/datasets/ffmpeg_check_parallel.sh $(pwd)/{2}"
    
  4. Filter Out Files with FFMPEG Errors
    Use the following Python script to filter out video files that have FFMPEG errors:
    1
    
    python -m tools.datasets.ffmpeg_filter_without_errors dataset_info_noempty_le50Mb.txt
    

    This will generate a new file named dataset_info_noempty_le50Mb_withouterror.txt excluding the problematic video clips, ensuring a stable training dataset.

Storing the Dataset on Shared Storage

After preprocessing, we make sure that all compute nodes have access to the preprocessed dataset, store it on a shared storage system accessible by all nodes.

For the remainder of this tutorial, we’ll suggest that the filtered CSVs are saved in the training repository as follows:

  • the CSV for OpenVid data under OpenVid1M.csv
  • the combined CSV for OpenVid and MiraData data under OpenVid1M-Miradata330k.csv

Important: Ensure that the shared storage is mounted and accessible from all nodes in your cluster before initiating the training process.



What Next?:
By following these steps, you’ve successfully downloaded, preprocessed, and prepared the dataset required for training the Open-Sora model. You’re now ready to proceed to the next stage: training the model on your cluster.

Proceed to the Training — Get the Ball Rolling section to begin training!