Data Generation Templates

Data Generation Templates in Starfish provide a structured way to create and manage reusable workflows for generating synthetic data. These templates encapsulate the logic, input schemas, and output schemas required for data generation tasks, enabling developers to easily define and execute complex data generation processes. The templates are registered and managed through a central registry, allowing for easy discovery and reuse. src/starfish/data_template/template_gen.py

The system allows for the creation of templates with pre- and post- hooks, and uses Pydantic models for input and output schema definitions, ensuring type safety and data validation. These templates can be combined with data_factory to create scalable data pipelines. src/starfish/data_template/templates/starfish/math_problem_gen_wf.py, src/starfish/data_template/template_gen.py

Template Registry

The data_gen_template object acts as a registry for data generation templates. It allows templates to be registered, listed, and retrieved. src/starfish/data_template/template_gen.py

Listing Templates

The list() method of the data_gen_template object returns a list of available templates. The templates are identified by a name, which typically follows the format subfolder_name/template_name. src/starfish/data_template/template_gen.py

result = data_gen_template.list()
print(result)

This code snippet demonstrates how to list the available templates. src/starfish/data_template/examples.py:9-10

Registering Templates

The @data_gen_template.register decorator is used to register a function as a data generation template. This decorator takes several arguments, including the name of the template, input schema, output schema, description, author, Starfish version, and dependencies. src/starfish/data_template/template_gen.py, src/starfish/data_template/templates/community/topic_generator.py:20-26

@data_gen_template.register(
    name="community/topic_generator",
    input_schema=TopicGeneratorInput,
    output_schema=TopicGeneratorOutput,
    description="Generates relevant topics for community discussions using AI models",
    author="Your Name",
    starfish_version="0.1.0",
    dependencies=["transformers_1>=4.0.0"],
)
def topic_generator(input_data: TopicGeneratorInput) -> TopicGeneratorOutput:
    # Template implementation
    ...

This code snippet shows an example of registering a template named community/topic_generator. src/starfish/data_template/templates/community/topic_generator.py:20-32

Retrieving Templates

The get() method of the data_gen_template object retrieves a registered template by its name. This method returns the registered function, which can then be executed with appropriate input data. src/starfish/data_template/template_gen.py

topic_generator = data_gen_template.get("community/topic_generator")
result = topic_generator.run(input_data)
print(result)

This code snippet retrieves the community/topic_generator template and executes it. src/starfish/data_template/examples.py:25-27

Template Structure

Data generation templates typically consist of the following components:

  1. Input Schema: Defines the structure and data types of the input parameters required by the template. This is typically defined using Pydantic models. src/starfish/data_template/templates/community/topic_generator.py:4-9
  2. Output Schema: Defines the structure and data types of the data generated by the template. This is also typically defined using Pydantic models. src/starfish/data_template/templates/community/topic_generator.py:11-16
  3. Template Function: Contains the core logic for generating data. This function takes input data conforming to the input schema and returns data conforming to the output schema. src/starfish/data_template/templates/community/topic_generator.py:30-43
  4. Dependencies: Specifies any external libraries or packages required by the template. src/starfish/data_template/templates/community/topic_generator.py:25

This diagram illustrates the basic structure of a data generation template. Sources: src/starfish/data_template/templates/community/topic_generator.py:4-43, src/starfish/data_template/template_gen.py

Input and Output Schemas

Pydantic models are used to define the input and output schemas for data generation templates. These models provide type safety and data validation, ensuring that the template receives valid input data and generates data in the expected format. src/starfish/data_template/templates/community/topic_generator.py:4-16

from pydantic import BaseModel

# Define input schema
class TopicGeneratorInput(BaseModel):
    community_name: str
    seed_topics: list[str]
    num_topics: int
    language: str = "en"


# Define output schema
class TopicGeneratorOutput(BaseModel):
    generated_topics: list[str]
    success: bool
    message: str

This code snippet shows an example of defining input and output schemas using Pydantic models. src/starfish/data_template/templates/community/topic_generator.py:4-16

Template Function Implementation

The template function contains the core logic for generating data. This function takes input data conforming to the input schema and returns data conforming to the output schema. The function can use any necessary libraries or algorithms to generate the data. src/starfish/data_template/templates/community/topic_generator.py:30-43

def topic_generator(input_data: TopicGeneratorInput) -> TopicGeneratorOutput:
    try:
        # Step 1: Generate initial topics
        generated_topics = generate_initial_topics(input_data)

        # Step 2: Process topics in parallel
        @data_factory(max_concurrency=10)
        async def process_topics(topics: list[str]) -> list[str]:
            return [refine_topic(topic) for topic in topics]

        refined_topics = process_topics.run(generated_topics)

        return TopicGeneratorOutput(generated_topics=refined_topics, success=True, message="Topics generated successfully")
    except Exception as e:
        return TopicGeneratorOutput(generated_topics=[], success=False, message=str(e))

This code snippet shows an example of a template function that generates topics for community discussions. src/starfish/data_template/templates/community/topic_generator.py:30-43

Workflow Integration

Data generation templates can be integrated into data generation workflows using the data_factory decorator. This decorator allows the template function to be executed in parallel across multiple inputs, enabling scalable data generation. src/starfish/data_template/templates/starfish/get_city_info_wf.py, src/starfish/data_template/templates/community/topic_generator.py:33

@data_factory(max_concurrency=10)
async def process_topics(topics: list[str]) -> list[str]:
    return [refine_topic(topic) for topic in topics]

This code snippet shows an example of using the data_factory decorator to create a parallel data processing function. src/starfish/data_template/templates/community/topic_generator.py:35-37

This diagram illustrates how data generation templates are integrated into data generation workflows using data_factory. src/starfish/data_template/templates/community/topic_generator.py, src/starfish/data_template/templates/starfish/get_city_info_wf.py

Example Templates

The starfish repository includes several example data generation templates, including:

These templates provide a starting point for creating custom data generation workflows. src/starfish/data_template/examples.py

Conclusion

Data Generation Templates provide a flexible and scalable way to generate synthetic data in Starfish. By using Pydantic models for schema definition and the data_factory decorator for workflow integration, developers can easily create and execute complex data generation processes. The template registry allows for easy discovery and reuse of templates, promoting code reuse and simplifying the development of data generation pipelines. src/starfish/data_template/template_gen.py