Introduction
Related Pages
Related topics: Installation, Architecture OverviewIntroduction
Starfish is a Python library designed to streamline the creation of synthetic data. It adapts to user workflows by combining structured LLM outputs with efficient parallel processing. Starfish allows users to define data structure and scale seamlessly from experiments to production. README.md The library provides tools for structured outputs, model flexibility, dynamic prompts, easy scaling, resilient pipelines, and complete control over the data generation process. Starfish supports structured data through JSON schemas or Pydantic models and is compatible with various LLM providers. README.mdKey Features
Starfish offers several key features that facilitate synthetic data generation.Structured Outputs
Starfish provides first-class support for structured data through JSON schemas or Pydantic models. This allows users to define the structure of the generated data, ensuring consistency and ease of use. README.mdStructuredLLM
. Sources: README.md
Model Flexibility
Starfish is designed to be model-agnostic, allowing users to use any LLM provider. This includes local models, OpenAI, Anthropic, or custom implementations via LiteLLM. README.mdDynamic Prompts
The library features dynamic prompts with built-in Jinja2 templates. This enables users to create flexible and customizable prompts for data generation. README.mdEasy Scaling
Starfish allows users to transform any function to run in parallel across thousands of inputs with a single decorator. This simplifies the process of scaling data generation tasks. README.md@data_factory
decorator to scale a function for parallel execution. Sources: README.md
Resilient Pipeline
Starfish includes automatic retries, error handling, and job resumption. This ensures that data generation pipelines are resilient and can be paused and continued at any time. README.mdComplete Control
Starfish allows users to share state across pipelines and extend functionality with custom hooks, providing complete control over the data generation process. README.mdStorage Layer
The storage layer in Starfish is responsible for persisting metadata and data artifacts for synthetic data generation jobs. It provides a pluggable interface for different storage backends and a hybrid local implementation using SQLite for metadata and JSON files for data. tests/data_factory/storage/README.mdLocal Storage Implementation
The local storage implementation uses SQLite for metadata and JSON files for data artifacts. It includes tables for projects, jobs, and records. tests/data_factory/storage/README.mdLocalStorage
class is used to save and retrieve project data. Sources: src/starfish/data_factory/storage/local/local_storage.py:56-62
Data Models
Starfish uses data models to represent projects, jobs, and records. These models are defined using Pydantic and include fields for metadata and data storage. src/starfish/data_factory/storage/models.pyProject
data model. Sources: src/starfish/data_factory/storage/models.py:7-11
Storage Workflow
The storage workflow involves creating projects, master jobs, and execution jobs, and then generating records. The metadata and data for these components are stored in the storage layer. tests/data_factory/storage/local/test_local_storage.pyMermaid Diagram of Storage Workflow
This diagram illustrates the flow of operations in the storage workflow, from creating a project to logging the end of the master job. Sources: tests/data_factory/storage/local/test_local_storage.py, src/starfish/data_factory/storage/local/local_storage.pyConclusion
Starfish provides a comprehensive set of tools and features for synthetic data generation, including structured outputs, model flexibility, easy scaling, and a resilient pipeline. The storage layer ensures that metadata and data artifacts are persisted and managed effectively. The combination of these features makes Starfish a powerful library for creating high-quality synthetic data. README.mdInstallation
Related Pages
Related topics: Introduction, ConfigurationInstallation
Starfish is a Python library designed to facilitate the creation of synthetic data. Installation primarily involves usingpip
to install the core library. Optional dependencies are available for specific file parsing capabilities. This page provides a guide to installing Starfish and its optional components. README.md
Core Installation
The base Starfish library can be installed usingpip
. This provides the core functionalities for synthetic data generation.
Optional Dependencies
Starfish supports optional dependencies for specific file parsers. These can be installed individually or all together. README.mdInstalling Specific Parsers
To install support for a specific file type, use the corresponding extra specifier withpip
.
Installing All Parser Dependencies
To install all supported parser dependencies at once, use theall
extra.
Configuration
Starfish relies on environment variables for configuration. A.env.template
file is provided to assist with initial setup. README.md
Setting Up Environment Variables
-
Copy the
.env.template
file to.env
.Sources: README.md -
Edit the
.env
file to include API keys, model configurations, and other runtime parameters.Sources: README.md
Architecture Overview
Related Pages
Related topics: Data Factory, Structured LLMArchitecture Overview
The Starfish project employs a modular architecture centered around synthetic data generation. It leverages Language Model (LLM) capabilities, structured data outputs, and scalable data pipelines. The architecture supports flexibility in model selection, dynamic prompt engineering, and resilient job execution. Key components include theStructuredLLM
for type-safe LLM interactions, the data_factory
decorator for parallel processing, and a pluggable storage layer for managing metadata and data artifacts.
This document provides a high-level overview of the architecture, focusing on the interaction between these components. It also details the storage mechanisms used to persist project metadata, job configurations, and generated data.
Structured LLM Component
TheStructuredLLM
component facilitates structured data extraction from LLMs using JSON schemas or Pydantic models. It allows for type-safe outputs from any LLM provider, including local models, OpenAI, and Anthropic. Sources: src/starfish/llm/structured_llm.py, src/starfish/llm/prompt/prompt_template.py
Key Features
- Model Flexibility: Supports various LLM providers via LiteLLM.
- Dynamic Prompts: Uses Jinja2 templates for dynamic prompt generation.
- Structured Outputs: Enables structured data output through JSON schemas or Pydantic models.
Architecture Diagram
This diagram illustrates the flow from user input to a validated structured output using an LLM. Sources: src/starfish/llm/structured_llm.py, src/starfish/llm/prompt/prompt_template.pyCode Snippet
Data Factory Component
Thedata_factory
decorator transforms any function into a scalable data pipeline, enabling parallel processing across thousands of inputs. It provides automatic retries, error handling, and job resumption. Sources: src/starfish/data_factory/factory.py
Key Features
- Easy Scaling: Transforms functions for parallel execution with a single decorator.
- Resilient Pipeline: Includes automatic retries, error handling, and job resumption.
- Complete Control: Allows sharing state across the pipeline and extending functionality with custom hooks.
Architecture Diagram
This diagram illustrates how thedata_factory
enables scalable and resilient data processing. Sources: src/starfish/data_factory/factory.py
Code Snippet
data_factory
decorator to create a parallel data pipeline. Sources: src/starfish/data_factory/factory.py
Storage Layer
The storage layer persists metadata and data artifacts for synthetic data generation jobs. It provides a pluggable interface for different storage backends, with a hybrid local implementation using SQLite for metadata and JSON files for data. Sources: src/starfish/data_factory/storage/local/local_storage.py, src/starfish/data_factory/storage/models.py, tests/data_factory/storage/test_storage_main.pyKey Features
- Pluggable Interface: Supports different storage backends.
- Hybrid Implementation: Uses SQLite for metadata and JSON files for data.
- Comprehensive APIs: Provides APIs for storing projects, jobs, and records.
Architecture Diagram
This diagram illustrates the storage architecture, including metadata and data artifact persistence. Sources: src/starfish/data_factory/storage/local/local_storage.py, src/starfish/data_factory/storage/models.pyData Models
The storage layer uses the following data models:Field | Type | Description | Source |
---|---|---|---|
project_id | str | Unique identifier for the project. | src/starfish/data_factory/storage/models.py |
name | str | User-friendly name for the project. | src/starfish/data_factory/storage/models.py |
master_job_id | str | Unique identifier for the master job. | src/starfish/data_factory/storage/models.py |
status | str | Overall status of the job request. | src/starfish/data_factory/storage/models.py |
request_config_ref | str | Reference to external request config JSON. | src/starfish/data_factory/storage/models.py |
output_schema | dict | JSON definition of expected primary data structure. | src/starfish/data_factory/storage/models.py |
storage_uri | str | Primary storage location config. | src/starfish/data_factory/storage/models.py |
job_id | str | Unique identifier for the execution job. | src/starfish/data_factory/storage/models.py |
run_config | str | Configuration for the execution job. | src/starfish/data_factory/storage/models.py |
record_uid | str | Unique identifier for the record. | src/starfish/data_factory/storage/models.py |
output_ref | str | Reference to the output data. | src/starfish/data_factory/storage/models.py |
Code Snippet
Project
and GenerationMasterJob
data model instances. Sources: src/starfish/data_factory/storage/models.py
Prompt Engineering
Starfish utilizes dynamic prompts with built-in Jinja2 templates for flexible and context-aware data generation. This allows for adapting prompts based on input parameters and external data. Sources: src/starfish/llm/prompt/prompt_template.pyCode Snippet
Conclusion
The Starfish architecture combines the power of LLMs with scalable data pipelines and a robust storage layer. TheStructuredLLM
component provides type-safe interactions with LLMs, while the data_factory
enables parallel processing and resilient job execution. The storage layer ensures persistence of metadata and data artifacts, supporting various storage backends. This modular architecture enables flexible and scalable synthetic data generation.