Welcome to Starfishdata.ai Documentation
StructuredLLM
. Sources: README.md
@data_factory
decorator to scale a function for parallel execution. Sources: README.md
LocalStorage
class is used to save and retrieve project data. Sources: src/starfish/data_factory/storage/local/local_storage.py:56-62
Project
data model. Sources: src/starfish/data_factory/storage/models.py:7-11
pip
to install the core library. Optional dependencies are available for specific file parsing capabilities. This page provides a guide to installing Starfish and its optional components. README.md
pip
. This provides the core functionalities for synthetic data generation.
pip
.
all
extra.
.env.template
file is provided to assist with initial setup. README.md
.env.template
file to .env
.
.env
file to include API keys, model configurations, and other runtime parameters.
StructuredLLM
for type-safe LLM interactions, the data_factory
decorator for parallel processing, and a pluggable storage layer for managing metadata and data artifacts.
This document provides a high-level overview of the architecture, focusing on the interaction between these components. It also details the storage mechanisms used to persist project metadata, job configurations, and generated data.
StructuredLLM
component facilitates structured data extraction from LLMs using JSON schemas or Pydantic models. It allows for type-safe outputs from any LLM provider, including local models, OpenAI, and Anthropic. Sources: src/starfish/llm/structured_llm.py, src/starfish/llm/prompt/prompt_template.py
data_factory
decorator transforms any function into a scalable data pipeline, enabling parallel processing across thousands of inputs. It provides automatic retries, error handling, and job resumption. Sources: src/starfish/data_factory/factory.py
data_factory
enables scalable and resilient data processing. Sources: src/starfish/data_factory/factory.py
data_factory
decorator to create a parallel data pipeline. Sources: src/starfish/data_factory/factory.py
Field | Type | Description | Source |
---|---|---|---|
project_id | str | Unique identifier for the project. | src/starfish/data_factory/storage/models.py |
name | str | User-friendly name for the project. | src/starfish/data_factory/storage/models.py |
master_job_id | str | Unique identifier for the master job. | src/starfish/data_factory/storage/models.py |
status | str | Overall status of the job request. | src/starfish/data_factory/storage/models.py |
request_config_ref | str | Reference to external request config JSON. | src/starfish/data_factory/storage/models.py |
output_schema | dict | JSON definition of expected primary data structure. | src/starfish/data_factory/storage/models.py |
storage_uri | str | Primary storage location config. | src/starfish/data_factory/storage/models.py |
job_id | str | Unique identifier for the execution job. | src/starfish/data_factory/storage/models.py |
run_config | str | Configuration for the execution job. | src/starfish/data_factory/storage/models.py |
record_uid | str | Unique identifier for the record. | src/starfish/data_factory/storage/models.py |
output_ref | str | Reference to the output data. | src/starfish/data_factory/storage/models.py |
Project
and GenerationMasterJob
data model instances. Sources: src/starfish/data_factory/storage/models.py
StructuredLLM
component provides type-safe interactions with LLMs, while the data_factory
enables parallel processing and resilient job execution. The storage layer ensures persistence of metadata and data artifacts, supporting various storage backends. This modular architecture enables flexible and scalable synthetic data generation.