Introduction

Related topics: Installation, Architecture Overview

Introduction

Starfish is a Python library designed to streamline the creation of synthetic data. It adapts to user workflows by combining structured LLM outputs with efficient parallel processing. Starfish allows users to define data structure and scale seamlessly from experiments to production. README.md

The library provides tools for structured outputs, model flexibility, dynamic prompts, easy scaling, resilient pipelines, and complete control over the data generation process. Starfish supports structured data through JSON schemas or Pydantic models and is compatible with various LLM providers. README.md

Key Features

Starfish offers several key features that facilitate synthetic data generation.

Structured Outputs

Starfish provides first-class support for structured data through JSON schemas or Pydantic models. This allows users to define the structure of the generated data, ensuring consistency and ease of use. README.md

from starfish import StructuredLLM
from pydantic import BaseModel

class QnASchema(BaseModel):
    question: str
    answer: str

json_schema = [
    {'name': 'question', 'type': 'str'},
    {'name': 'answer', 'type': 'str'}, 
]

qna_llm = StructuredLLM(
    model_name="openai/gpt-4o-mini",
    prompt="Generate facts about {{city}}",
    output_schema=QnASchema  # or json_schema
)

This code snippet demonstrates how to define structured outputs using Pydantic models and JSON schemas with StructuredLLM. Sources: README.md

Model Flexibility

Starfish is designed to be model-agnostic, allowing users to use any LLM provider. This includes local models, OpenAI, Anthropic, or custom implementations via LiteLLM. README.md

Dynamic Prompts

The library features dynamic prompts with built-in Jinja2 templates. This enables users to create flexible and customizable prompts for data generation. README.md

from starfish.llm.prompt.prompt_template import COMPLETE_PROMPTS

COMPLETE_PROMPTS = {
    "data_gen": """
You are a data generation expert. Your primary objective is to create
high-quality synthetic data that strictly adheres to the provided guidelines.
...
""",
}

This code shows an example of a complete prompt using Jinja2 templates for dynamic data generation. Sources: src/starfish/llm/prompt/prompt_template.py:4-42

Easy Scaling

Starfish allows users to transform any function to run in parallel across thousands of inputs with a single decorator. This simplifies the process of scaling data generation tasks. README.md

from starfish import data_factory

@data_factory(max_concurrency=50)
async def parallel_qna_llm(city):
    response = await qna_llm.run(city=city)
    return response.data

cities = ["San Francisco", "New York", "Tokyo", "Paris", "London"] * 20
results = parallel_qna_llm.run(city=cities)

This snippet illustrates how to use the @data_factory decorator to scale a function for parallel execution. Sources: README.md

Resilient Pipeline

Starfish includes automatic retries, error handling, and job resumption. This ensures that data generation pipelines are resilient and can be paused and continued at any time. README.md

Complete Control

Starfish allows users to share state across pipelines and extend functionality with custom hooks, providing complete control over the data generation process. README.md

Storage Layer

The storage layer in Starfish is responsible for persisting metadata and data artifacts for synthetic data generation jobs. It provides a pluggable interface for different storage backends and a hybrid local implementation using SQLite for metadata and JSON files for data. tests/data_factory/storage/README.md

Local Storage Implementation

The local storage implementation uses SQLite for metadata and JSON files for data artifacts. It includes tables for projects, jobs, and records. tests/data_factory/storage/README.md

from starfish.data_factory.storage.local.local_storage import LocalStorage

class LocalStorage:
    async def save_project(self, project_data: Project) -> None:
        await self._metadata_handler.save_project_impl(project_data)

    async def get_project(self, project_id: str) -> Optional[Project]:
        return await self._metadata_handler.get_project_impl(project_id)

This code snippet shows how the LocalStorage class is used to save and retrieve project data. Sources: src/starfish/data_factory/storage/local/local_storage.py:56-62

Data Models

Starfish uses data models to represent projects, jobs, and records. These models are defined using Pydantic and include fields for metadata and data storage. src/starfish/data_factory/storage/models.py

from pydantic import BaseModel, Field
import uuid
from typing import Optional, Dict, Any

class Project(BaseModel):
    project_id: str = Field(default_factory=lambda: str(uuid.uuid4()), description="Unique project identifier.")
    name: str = Field(..., description="Project name.")
    description: Optional[str] = Field(None, description="Optional project description.")

This code snippet shows the definition of the Project data model. Sources: src/starfish/data_factory/storage/models.py:7-11

Storage Workflow

The storage workflow involves creating projects, master jobs, and execution jobs, and then generating records. The metadata and data for these components are stored in the storage layer. tests/data_factory/storage/local/test_local_storage.py

from starfish.data_factory.storage.models import GenerationMasterJob, Project

async def test_complete_workflow(storage):
    project = Project(project_id=str(uuid.uuid4()), name="Workflow Test Project")
    await storage.save_project(project)

    master_job_id = str(uuid.uuid4())
    master_job = GenerationMasterJob(
        master_job_id=master_job_id,
        project_id=project.project_id,
        name="Workflow Test Job",
        status="pending",
        request_config_ref=config_ref,
        output_schema={"type": "object"},
        storage_uri=TEST_DB_URI,
        target_record_count=100,
    )
    await storage.log_master_job_start(master_job)

This code demonstrates the creation of a project and a master job in the storage workflow. Sources: tests/data_factory/storage/local/test_local_storage.py:42-61

Mermaid Diagram of Storage Workflow

This diagram illustrates the flow of operations in the storage workflow, from creating a project to logging the end of the master job. Sources: tests/data_factory/storage/local/test_local_storage.py, src/starfish/data_factory/storage/local/local_storage.py

Conclusion

Starfish provides a comprehensive set of tools and features for synthetic data generation, including structured outputs, model flexibility, easy scaling, and a resilient pipeline. The storage layer ensures that metadata and data artifacts are persisted and managed effectively. The combination of these features makes Starfish a powerful library for creating high-quality synthetic data. README.md


Installation

Related topics: Introduction, Configuration

Installation

Starfish is a Python library designed to facilitate the creation of synthetic data. Installation primarily involves using pip to install the core library. Optional dependencies are available for specific file parsing capabilities. This page provides a guide to installing Starfish and its optional components. README.md

Core Installation

The base Starfish library can be installed using pip. This provides the core functionalities for synthetic data generation.

pip install starfish-core

Sources: README.md

Optional Dependencies

Starfish supports optional dependencies for specific file parsers. These can be installed individually or all together. README.md

Installing Specific Parsers

To install support for a specific file type, use the corresponding extra specifier with pip.

pip install "starfish-core[pdf]"       # PDF support
pip install "starfish-core[docx]"      # Word document support
pip install "starfish-core[ppt]"       # PowerPoint support
pip install "starfish-core[excel]"     # Excel support
pip install "starfish-core[youtube]"   # YouTube support

Sources: README.md

Installing All Parser Dependencies

To install all supported parser dependencies at once, use the all extra.

pip install "starfish-core[all]"

Sources: README.md

Configuration

Starfish relies on environment variables for configuration. A .env.template file is provided to assist with initial setup. README.md

Setting Up Environment Variables

  1. Copy the .env.template file to .env.

    cp .env.template .env
    

    Sources: README.md

  2. Edit the .env file to include API keys, model configurations, and other runtime parameters.

    nano .env  # or use your preferred editor
    

    Sources: README.md


Architecture Overview

Related topics: Data Factory, Structured LLM

Architecture Overview

The Starfish project employs a modular architecture centered around synthetic data generation. It leverages Language Model (LLM) capabilities, structured data outputs, and scalable data pipelines. The architecture supports flexibility in model selection, dynamic prompt engineering, and resilient job execution. Key components include the StructuredLLM for type-safe LLM interactions, the data_factory decorator for parallel processing, and a pluggable storage layer for managing metadata and data artifacts.

This document provides a high-level overview of the architecture, focusing on the interaction between these components. It also details the storage mechanisms used to persist project metadata, job configurations, and generated data.

Structured LLM Component

The StructuredLLM component facilitates structured data extraction from LLMs using JSON schemas or Pydantic models. It allows for type-safe outputs from any LLM provider, including local models, OpenAI, and Anthropic. Sources: src/starfish/llm/structured_llm.py, src/starfish/llm/prompt/prompt_template.py

Key Features

  • Model Flexibility: Supports various LLM providers via LiteLLM.
  • Dynamic Prompts: Uses Jinja2 templates for dynamic prompt generation.
  • Structured Outputs: Enables structured data output through JSON schemas or Pydantic models.

Architecture Diagram

This diagram illustrates the flow from user input to a validated structured output using an LLM. Sources: src/starfish/llm/structured_llm.py, src/starfish/llm/prompt/prompt_template.py

Code Snippet

from starfish import StructuredLLM
from pydantic import BaseModel

class QnASchema(BaseModel):
    question: str
    answer: str

qna_llm = StructuredLLM(
    model_name="openai/gpt-4o-mini",
    prompt="Generate facts about {{city}}",
    output_schema=QnASchema
)

This snippet demonstrates how to define a structured LLM with a Pydantic model for type safety. Sources: src/starfish/llm/structured_llm.py

Data Factory Component

The data_factory decorator transforms any function into a scalable data pipeline, enabling parallel processing across thousands of inputs. It provides automatic retries, error handling, and job resumption. Sources: src/starfish/data_factory/factory.py

Key Features

  • Easy Scaling: Transforms functions for parallel execution with a single decorator.
  • Resilient Pipeline: Includes automatic retries, error handling, and job resumption.
  • Complete Control: Allows sharing state across the pipeline and extending functionality with custom hooks.

Architecture Diagram

This diagram illustrates how the data_factory enables scalable and resilient data processing. Sources: src/starfish/data_factory/factory.py

Code Snippet

from starfish import data_factory

@data_factory(max_concurrency=50)
async def parallel_qna_llm(city):
    response = await qna_llm.run(city=city)
    return response.data

This snippet shows how to use the data_factory decorator to create a parallel data pipeline. Sources: src/starfish/data_factory/factory.py

Storage Layer

The storage layer persists metadata and data artifacts for synthetic data generation jobs. It provides a pluggable interface for different storage backends, with a hybrid local implementation using SQLite for metadata and JSON files for data. Sources: src/starfish/data_factory/storage/local/local_storage.py, src/starfish/data_factory/storage/models.py, tests/data_factory/storage/test_storage_main.py

Key Features

  • Pluggable Interface: Supports different storage backends.
  • Hybrid Implementation: Uses SQLite for metadata and JSON files for data.
  • Comprehensive APIs: Provides APIs for storing projects, jobs, and records.

Architecture Diagram

This diagram illustrates the storage architecture, including metadata and data artifact persistence. Sources: src/starfish/data_factory/storage/local/local_storage.py, src/starfish/data_factory/storage/models.py

Data Models

The storage layer uses the following data models:

FieldTypeDescriptionSource
project_idstrUnique identifier for the project.src/starfish/data_factory/storage/models.py
namestrUser-friendly name for the project.src/starfish/data_factory/storage/models.py
master_job_idstrUnique identifier for the master job.src/starfish/data_factory/storage/models.py
statusstrOverall status of the job request.src/starfish/data_factory/storage/models.py
request_config_refstrReference to external request config JSON.src/starfish/data_factory/storage/models.py
output_schemadictJSON definition of expected primary data structure.src/starfish/data_factory/storage/models.py
storage_uristrPrimary storage location config.src/starfish/data_factory/storage/models.py
job_idstrUnique identifier for the execution job.src/starfish/data_factory/storage/models.py
run_configstrConfiguration for the execution job.src/starfish/data_factory/storage/models.py
record_uidstrUnique identifier for the record.src/starfish/data_factory/storage/models.py
output_refstrReference to the output data.src/starfish/data_factory/storage/models.py

Code Snippet

from starfish.data_factory.storage.models import Project, GenerationMasterJob

project = Project(project_id=str(uuid.uuid4()), name="Test Project")
master_job = GenerationMasterJob(
    master_job_id=str(uuid.uuid4()),
    project_id=project.project_id,
    name="Test Job",
    status="pending",
    request_config_ref="config.json",
    output_schema={"type": "object"},
    storage_uri="file:///tmp/starfish_test_db",
    target_record_count=100,
)

This snippet demonstrates how to create Project and GenerationMasterJob data model instances. Sources: src/starfish/data_factory/storage/models.py

Prompt Engineering

Starfish utilizes dynamic prompts with built-in Jinja2 templates for flexible and context-aware data generation. This allows for adapting prompts based on input parameters and external data. Sources: src/starfish/llm/prompt/prompt_template.py

Code Snippet

COMPLETE_PROMPTS = {
    "data_gen": """
You are a data generation expert. Your primary objective is to create
high-quality synthetic data that strictly adheres to the provided guidelines.

user_instruction: {{user_instruction}}
"""
}

This snippet shows an example of a complete prompt template using Jinja2 syntax. Sources: src/starfish/llm/prompt/prompt_template.py

Conclusion

The Starfish architecture combines the power of LLMs with scalable data pipelines and a robust storage layer. The StructuredLLM component provides type-safe interactions with LLMs, while the data_factory enables parallel processing and resilient job execution. The storage layer ensures persistence of metadata and data artifacts, supporting various storage backends. This modular architecture enables flexible and scalable synthetic data generation.