Generating Synthetic Time Series with GANs: Overcoming Data Shortages for Machine Learning

Artificial neural networks rely heavily on large datasets to train their models. However, with the constantly evolving data privacy regulations worldwide, obtaining high-quality data has become increasingly challenging, primarily due to security or privacy concerns. Read on to learn how to generate synthetic time series for your machine learning models.

The rise of generative AI and similar technologies requires businesses to innovate to stay ahead of the competition. Without access to secure training data, engineers cannot experiment and advance new technologies. The lack of diverse, secure datasets at scale makes it difficult to train predictive models. Synthesizing time series data poses unique challenges because it requires preserving both the relationships between variables and time, as well as the relationships among the variables themselves. A successful time-series model must capture not only the features and distributions within each time point but also the complex dynamics of these features over time.

Generating the synthetic data

The most critical component of time series synthesis is the generator. Instead of developing one from scratch, I will use the Gretel-synthetics generator package. This package provides a set of synthetic data generators for structured and unstructured text and time series, featuring differentially private learning.

Prepare the real data

To run the generator, we need to prepare real data that will be used to create the synthetic time series. The synthetic data should closely mirror the statistical properties of the real data. I used the open-source Bike Sharing Demand dataset, which, as the name suggests, contains information about the number of bikes shared among users.


def get_data():
    bike_sharing = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True)
    df = bike_sharing.frame
    return df['count']

0         16
1         40
2         32
3         13
4          1
        ... 
17374    119
17375     89
17376     90
17377     61
17378     49

Once the data is ready, I need to reshape it to match the requirements of the generator model class.


MAX_SEQUENCE_LEN = 100
PLOT_VIEW = 600

# Get real data
bike_count = get_data()

# Prepare data format
features_2d = to_windows(bike_count, MAX_SEQUENCE_LEN)
features_3d = np.expand_dims(features_2d, axis=2)

The to_windows function implements a sliding window technique and converts a time series into a sequence of vectors. I detailed this function here.

Create the GAN

Next, I prepare the generating model. The model definition is the core function in our code. We must specify parameters such as max_sequence_len (the length of our time series), sample_len, and batch_size, among others. Documentation for DGAN and DGANConfig can be found here.


def create_model(features):
    model = DGAN(DGANConfig(
        max_sequence_len=features.shape[1],
        sample_len=20,
        batch_size=min(1000, features.shape[0]),
        apply_feature_scaling=True,
        apply_example_scaling=False,
        use_attribute_discriminator=False,
        generator_learning_rate=1e-4,
        discriminator_learning_rate=1e-4,
        epochs=100,
    ))
    model.train_numpy(
        features,
        feature_types=[OutputType.CONTINUOUS] * features.shape[2],
    )
    return model

Finally, the model is ready to run. Since training takes a significant amount of time, I save the model to facilitate code modifications and debugging. I use the SERIALISED flag for this purpose.


# Create the model
if not SERIALISED:
    model = create_model(features_3d)
    model.save(model_dump)
else:
    model = DGAN.load(model_dump)

# Generate synthetic data
BATCHES_NUMBER = 60
_, synth_chunk = model.generate_numpy(BATCHES_NUMBER)
synthetic = np.concatenate(
        synth_chunk[0:synth_chunk.shape[0], :, :], axis=0)
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(6, 12),
                         layout='constrained')

Results

The results are promising. The time series exhibit similarity, which means they retain comparable statistical properties such as mean, variance, and autocorrelation. The first two properties appear similar, but autocorrelation is difficult to assess without visualization. As shown in the figure, the autocorrelation is roughly similar but not perfect. Real vs. generated time series

Code

Summary

Generating synthetic time series can be challenging. There are several open-source tools available for this purpose, such as ydata-synthetic. In this post, I used the Gretel-synthetics generator. Initial tests indicate that this tool can generate synthetic time series of decent quality. However, a comprehensive evaluation requires additional tests with various types of time series. Neural networks need large datasets for training. This post shows how to generate synthetic time series for machine learning models.