Artificial neural networks rely heavily on large datasets to train their models. However, with the constantly evolving data privacy regulations worldwide, obtaining high-quality data has become increasingly challenging, primarily due to security or privacy concerns. Read on to learn how to generate synthetic time series for your machine learning models.
The rise of generative AI and similar technologies requires businesses to innovate to stay ahead of the competition. Without access to secure training data, engineers cannot experiment and advance new technologies. The lack of diverse, secure datasets at scale makes it difficult to train predictive models. Synthesizing time series data poses unique challenges because it requires preserving both the relationships between variables and time, as well as the relationships among the variables themselves. A successful time-series model must capture not only the features and distributions within each time point but also the complex dynamics of these features over time.
The most critical component of time series synthesis is the generator. Instead of developing one from scratch, I will use the Gretel-synthetics generator package. This package provides a set of synthetic data generators for structured and unstructured text and time series, featuring differentially private learning.
To run the generator, we need to prepare real data that will be used to create the synthetic time series. The synthetic data should closely mirror the statistical properties of the real data. I used the open-source Bike Sharing Demand dataset, which, as the name suggests, contains information about the number of bikes shared among users.
def get_data():
bike_sharing = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True)
df = bike_sharing.frame
return df['count']
0 16
1 40
2 32
3 13
4 1
...
17374 119
17375 89
17376 90
17377 61
17378 49
Once the data is ready, I need to reshape it to match the requirements of the generator model class.
MAX_SEQUENCE_LEN = 100
PLOT_VIEW = 600
# Get real data
bike_count = get_data()
# Prepare data format
features_2d = to_windows(bike_count, MAX_SEQUENCE_LEN)
features_3d = np.expand_dims(features_2d, axis=2)
The to_windows function implements a sliding window technique and converts a time series into a sequence of vectors. I detailed this function here.
def create_model(features):
model = DGAN(DGANConfig(
max_sequence_len=features.shape[1],
sample_len=20,
batch_size=min(1000, features.shape[0]),
apply_feature_scaling=True,
apply_example_scaling=False,
use_attribute_discriminator=False,
generator_learning_rate=1e-4,
discriminator_learning_rate=1e-4,
epochs=100,
))
model.train_numpy(
features,
feature_types=[OutputType.CONTINUOUS] * features.shape[2],
)
return model
Finally, the model is ready to run. Since training takes a significant amount of time, I save the model to facilitate code modifications and debugging. I use the SERIALISED flag for this purpose.
# Create the model
if not SERIALISED:
model = create_model(features_3d)
model.save(model_dump)
else:
model = DGAN.load(model_dump)
# Generate synthetic data
BATCHES_NUMBER = 60
_, synth_chunk = model.generate_numpy(BATCHES_NUMBER)
synthetic = np.concatenate(
synth_chunk[0:synth_chunk.shape[0], :, :], axis=0)
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(6, 12),
layout='constrained')