Data Generation Templates
Welcome to Starfishdata.ai Data Generation Templates
Data Generation Templates
Data Generation Templates in Starfish provide a structured way to create and manage reusable workflows for generating synthetic data. These templates encapsulate the logic, input schemas, and output schemas required for data generation tasks, enabling developers to easily define and execute complex data generation processes. The templates are registered and managed through a central registry, allowing for easy discovery and reuse. src/starfish/data_template/template_gen.py
The system allows for the creation of templates with pre- and post- hooks, and uses Pydantic models for input and output schema definitions, ensuring type safety and data validation. These templates can be combined with data_factory
to create scalable data pipelines. src/starfish/data_template/templates/starfish/math_problem_gen_wf.py, src/starfish/data_template/template_gen.py
Template Registry
The data_gen_template
object acts as a registry for data generation templates. It allows templates to be registered, listed, and retrieved. src/starfish/data_template/template_gen.py
Listing Templates
The list()
method of the data_gen_template
object returns a list of available templates. The templates are identified by a name, which typically follows the format subfolder_name/template_name
. src/starfish/data_template/template_gen.py
This code snippet demonstrates how to list the available templates. src/starfish/data_template/examples.py:9-10
Registering Templates
The @data_gen_template.register
decorator is used to register a function as a data generation template. This decorator takes several arguments, including the name of the template, input schema, output schema, description, author, Starfish version, and dependencies. src/starfish/data_template/template_gen.py, src/starfish/data_template/templates/community/topic_generator.py:20-26
This code snippet shows an example of registering a template named community/topic_generator
. src/starfish/data_template/templates/community/topic_generator.py:20-32
Retrieving Templates
The get()
method of the data_gen_template
object retrieves a registered template by its name. This method returns the registered function, which can then be executed with appropriate input data. src/starfish/data_template/template_gen.py
This code snippet retrieves the community/topic_generator
template and executes it. src/starfish/data_template/examples.py:25-27
Template Structure
Data generation templates typically consist of the following components:
- Input Schema: Defines the structure and data types of the input parameters required by the template. This is typically defined using Pydantic models. src/starfish/data_template/templates/community/topic_generator.py:4-9
- Output Schema: Defines the structure and data types of the data generated by the template. This is also typically defined using Pydantic models. src/starfish/data_template/templates/community/topic_generator.py:11-16
- Template Function: Contains the core logic for generating data. This function takes input data conforming to the input schema and returns data conforming to the output schema. src/starfish/data_template/templates/community/topic_generator.py:30-43
- Dependencies: Specifies any external libraries or packages required by the template. src/starfish/data_template/templates/community/topic_generator.py:25
This diagram illustrates the basic structure of a data generation template. Sources: src/starfish/data_template/templates/community/topic_generator.py:4-43, src/starfish/data_template/template_gen.py
Input and Output Schemas
Pydantic models are used to define the input and output schemas for data generation templates. These models provide type safety and data validation, ensuring that the template receives valid input data and generates data in the expected format. src/starfish/data_template/templates/community/topic_generator.py:4-16
This code snippet shows an example of defining input and output schemas using Pydantic models. src/starfish/data_template/templates/community/topic_generator.py:4-16
Template Function Implementation
The template function contains the core logic for generating data. This function takes input data conforming to the input schema and returns data conforming to the output schema. The function can use any necessary libraries or algorithms to generate the data. src/starfish/data_template/templates/community/topic_generator.py:30-43
This code snippet shows an example of a template function that generates topics for community discussions. src/starfish/data_template/templates/community/topic_generator.py:30-43
Workflow Integration
Data generation templates can be integrated into data generation workflows using the data_factory
decorator. This decorator allows the template function to be executed in parallel across multiple inputs, enabling scalable data generation. src/starfish/data_template/templates/starfish/get_city_info_wf.py, src/starfish/data_template/templates/community/topic_generator.py:33
This code snippet shows an example of using the data_factory
decorator to create a parallel data processing function. src/starfish/data_template/templates/community/topic_generator.py:35-37
This diagram illustrates how data generation templates are integrated into data generation workflows using data_factory
. src/starfish/data_template/templates/community/topic_generator.py, src/starfish/data_template/templates/starfish/get_city_info_wf.py
Example Templates
The starfish
repository includes several example data generation templates, including:
community/topic_generator
: Generates relevant topics for community discussions using AI models. src/starfish/data_template/templates/community/topic_generator.pystarfish/math_problem_gen_wf
: Generates math problem-solution pairs. src/starfish/data_template/templates/starfish/math_problem_gen_wf.pystarfish/get_city_info_wf
: Retrieves information about cities. src/starfish/data_template/templates/starfish/get_city_info_wf.pycommunity/topic_generator_success
: Generates relevant topics for community discussions using AI models. src/starfish/data_template/templates/community/topic_generator_success.py
These templates provide a starting point for creating custom data generation workflows. src/starfish/data_template/examples.py
Conclusion
Data Generation Templates provide a flexible and scalable way to generate synthetic data in Starfish. By using Pydantic models for schema definition and the data_factory
decorator for workflow integration, developers can easily create and execute complex data generation processes. The template registry allows for easy discovery and reuse of templates, promoting code reuse and simplifying the development of data generation pipelines. src/starfish/data_template/template_gen.py