Introduction
Welcome to Starfishdata.ai Documentation
Introduction
Related Pages
Related topics: Installation, Architecture Overview
Introduction
Starfish is a Python library designed to streamline the creation of synthetic data. It adapts to user workflows by combining structured LLM outputs with efficient parallel processing. Starfish allows users to define data structure and scale seamlessly from experiments to production. README.md
The library provides tools for structured outputs, model flexibility, dynamic prompts, easy scaling, resilient pipelines, and complete control over the data generation process. Starfish supports structured data through JSON schemas or Pydantic models and is compatible with various LLM providers. README.md
Key Features
Starfish offers several key features that facilitate synthetic data generation.
Structured Outputs
Starfish provides first-class support for structured data through JSON schemas or Pydantic models. This allows users to define the structure of the generated data, ensuring consistency and ease of use. README.md
This code snippet demonstrates how to define structured outputs using Pydantic models and JSON schemas with StructuredLLM
. Sources: README.md
Model Flexibility
Starfish is designed to be model-agnostic, allowing users to use any LLM provider. This includes local models, OpenAI, Anthropic, or custom implementations via LiteLLM. README.md
Dynamic Prompts
The library features dynamic prompts with built-in Jinja2 templates. This enables users to create flexible and customizable prompts for data generation. README.md
This code shows an example of a complete prompt using Jinja2 templates for dynamic data generation. Sources: src/starfish/llm/prompt/prompt_template.py:4-42
Easy Scaling
Starfish allows users to transform any function to run in parallel across thousands of inputs with a single decorator. This simplifies the process of scaling data generation tasks. README.md
This snippet illustrates how to use the @data_factory
decorator to scale a function for parallel execution. Sources: README.md
Resilient Pipeline
Starfish includes automatic retries, error handling, and job resumption. This ensures that data generation pipelines are resilient and can be paused and continued at any time. README.md
Complete Control
Starfish allows users to share state across pipelines and extend functionality with custom hooks, providing complete control over the data generation process. README.md
Storage Layer
The storage layer in Starfish is responsible for persisting metadata and data artifacts for synthetic data generation jobs. It provides a pluggable interface for different storage backends and a hybrid local implementation using SQLite for metadata and JSON files for data. tests/data_factory/storage/README.md
Local Storage Implementation
The local storage implementation uses SQLite for metadata and JSON files for data artifacts. It includes tables for projects, jobs, and records. tests/data_factory/storage/README.md
This code snippet shows how the LocalStorage
class is used to save and retrieve project data. Sources: src/starfish/data_factory/storage/local/local_storage.py:56-62
Data Models
Starfish uses data models to represent projects, jobs, and records. These models are defined using Pydantic and include fields for metadata and data storage. src/starfish/data_factory/storage/models.py
This code snippet shows the definition of the Project
data model. Sources: src/starfish/data_factory/storage/models.py:7-11
Storage Workflow
The storage workflow involves creating projects, master jobs, and execution jobs, and then generating records. The metadata and data for these components are stored in the storage layer. tests/data_factory/storage/local/test_local_storage.py
This code demonstrates the creation of a project and a master job in the storage workflow. Sources: tests/data_factory/storage/local/test_local_storage.py:42-61
Mermaid Diagram of Storage Workflow
This diagram illustrates the flow of operations in the storage workflow, from creating a project to logging the end of the master job. Sources: tests/data_factory/storage/local/test_local_storage.py, src/starfish/data_factory/storage/local/local_storage.py
Conclusion
Starfish provides a comprehensive set of tools and features for synthetic data generation, including structured outputs, model flexibility, easy scaling, and a resilient pipeline. The storage layer ensures that metadata and data artifacts are persisted and managed effectively. The combination of these features makes Starfish a powerful library for creating high-quality synthetic data. README.md
Installation
Related Pages
Related topics: Introduction, Configuration
Installation
Starfish is a Python library designed to facilitate the creation of synthetic data. Installation primarily involves using pip
to install the core library. Optional dependencies are available for specific file parsing capabilities. This page provides a guide to installing Starfish and its optional components. README.md
Core Installation
The base Starfish library can be installed using pip
. This provides the core functionalities for synthetic data generation.
Sources: README.md
Optional Dependencies
Starfish supports optional dependencies for specific file parsers. These can be installed individually or all together. README.md
Installing Specific Parsers
To install support for a specific file type, use the corresponding extra specifier with pip
.
Sources: README.md
Installing All Parser Dependencies
To install all supported parser dependencies at once, use the all
extra.
Sources: README.md
Configuration
Starfish relies on environment variables for configuration. A .env.template
file is provided to assist with initial setup. README.md
Setting Up Environment Variables
-
Copy the
.env.template
file to.env
.Sources: README.md
-
Edit the
.env
file to include API keys, model configurations, and other runtime parameters.Sources: README.md
Architecture Overview
Related Pages
Related topics: Data Factory, Structured LLM
Architecture Overview
The Starfish project employs a modular architecture centered around synthetic data generation. It leverages Language Model (LLM) capabilities, structured data outputs, and scalable data pipelines. The architecture supports flexibility in model selection, dynamic prompt engineering, and resilient job execution. Key components include the StructuredLLM
for type-safe LLM interactions, the data_factory
decorator for parallel processing, and a pluggable storage layer for managing metadata and data artifacts.
This document provides a high-level overview of the architecture, focusing on the interaction between these components. It also details the storage mechanisms used to persist project metadata, job configurations, and generated data.
Structured LLM Component
The StructuredLLM
component facilitates structured data extraction from LLMs using JSON schemas or Pydantic models. It allows for type-safe outputs from any LLM provider, including local models, OpenAI, and Anthropic. Sources: src/starfish/llm/structured_llm.py, src/starfish/llm/prompt/prompt_template.py
Key Features
- Model Flexibility: Supports various LLM providers via LiteLLM.
- Dynamic Prompts: Uses Jinja2 templates for dynamic prompt generation.
- Structured Outputs: Enables structured data output through JSON schemas or Pydantic models.
Architecture Diagram
This diagram illustrates the flow from user input to a validated structured output using an LLM. Sources: src/starfish/llm/structured_llm.py, src/starfish/llm/prompt/prompt_template.py
Code Snippet
This snippet demonstrates how to define a structured LLM with a Pydantic model for type safety. Sources: src/starfish/llm/structured_llm.py
Data Factory Component
The data_factory
decorator transforms any function into a scalable data pipeline, enabling parallel processing across thousands of inputs. It provides automatic retries, error handling, and job resumption. Sources: src/starfish/data_factory/factory.py
Key Features
- Easy Scaling: Transforms functions for parallel execution with a single decorator.
- Resilient Pipeline: Includes automatic retries, error handling, and job resumption.
- Complete Control: Allows sharing state across the pipeline and extending functionality with custom hooks.
Architecture Diagram
This diagram illustrates how the data_factory
enables scalable and resilient data processing. Sources: src/starfish/data_factory/factory.py
Code Snippet
This snippet shows how to use the data_factory
decorator to create a parallel data pipeline. Sources: src/starfish/data_factory/factory.py
Storage Layer
The storage layer persists metadata and data artifacts for synthetic data generation jobs. It provides a pluggable interface for different storage backends, with a hybrid local implementation using SQLite for metadata and JSON files for data. Sources: src/starfish/data_factory/storage/local/local_storage.py, src/starfish/data_factory/storage/models.py, tests/data_factory/storage/test_storage_main.py
Key Features
- Pluggable Interface: Supports different storage backends.
- Hybrid Implementation: Uses SQLite for metadata and JSON files for data.
- Comprehensive APIs: Provides APIs for storing projects, jobs, and records.
Architecture Diagram
This diagram illustrates the storage architecture, including metadata and data artifact persistence. Sources: src/starfish/data_factory/storage/local/local_storage.py, src/starfish/data_factory/storage/models.py
Data Models
The storage layer uses the following data models:
Field | Type | Description | Source |
---|---|---|---|
project_id | str | Unique identifier for the project. | src/starfish/data_factory/storage/models.py |
name | str | User-friendly name for the project. | src/starfish/data_factory/storage/models.py |
master_job_id | str | Unique identifier for the master job. | src/starfish/data_factory/storage/models.py |
status | str | Overall status of the job request. | src/starfish/data_factory/storage/models.py |
request_config_ref | str | Reference to external request config JSON. | src/starfish/data_factory/storage/models.py |
output_schema | dict | JSON definition of expected primary data structure. | src/starfish/data_factory/storage/models.py |
storage_uri | str | Primary storage location config. | src/starfish/data_factory/storage/models.py |
job_id | str | Unique identifier for the execution job. | src/starfish/data_factory/storage/models.py |
run_config | str | Configuration for the execution job. | src/starfish/data_factory/storage/models.py |
record_uid | str | Unique identifier for the record. | src/starfish/data_factory/storage/models.py |
output_ref | str | Reference to the output data. | src/starfish/data_factory/storage/models.py |
Code Snippet
This snippet demonstrates how to create Project
and GenerationMasterJob
data model instances. Sources: src/starfish/data_factory/storage/models.py
Prompt Engineering
Starfish utilizes dynamic prompts with built-in Jinja2 templates for flexible and context-aware data generation. This allows for adapting prompts based on input parameters and external data. Sources: src/starfish/llm/prompt/prompt_template.py
Code Snippet
This snippet shows an example of a complete prompt template using Jinja2 syntax. Sources: src/starfish/llm/prompt/prompt_template.py
Conclusion
The Starfish architecture combines the power of LLMs with scalable data pipelines and a robust storage layer. The StructuredLLM
component provides type-safe interactions with LLMs, while the data_factory
enables parallel processing and resilient job execution. The storage layer ensures persistence of metadata and data artifacts, supporting various storage backends. This modular architecture enables flexible and scalable synthetic data generation.