Scale LLM data generations
data_factory
decorator, task runners, state management, storage interfaces, and metadata models.
The diagram above illustrates the high-level architecture of the Data Factory. A user-defined function is decorated with @data_factory
, which configures and manages the execution of the function. The decorated function is then executed by a Task Runner, which handles parallel processing, retries, and error handling. The Task Runner interacts with an LLM or other data generation process, and the resulting data artifacts are stored using a Storage Interface. State Management allows for sharing state across the pipeline, and Hooks enable extending functionality with custom logic. Metadata about the data generation process is also stored via the Storage Interface. Sources: src/starfish/data_factory/factory.py, src/starfish/data_factory/task_runner.py, src/starfish/data_factory/storage/models.py
@data_factory
decorator is central to the Data Factory. It transforms a regular function into a scalable data pipeline. src/starfish/data_factory/factory.py The decorator accepts several parameters to configure the data generation process, including:
max_concurrency
: The maximum number of concurrent tasks. src/starfish/data_factory/factory.pystorage
: The storage type to use for persisting metadata and data artifacts. src/starfish/data_factory/factory.pyinitial_state_values
: Initial values for the state management. src/starfish/data_factory/factory.pyon_record_complete
: Hooks to execute when a record is completed. src/starfish/data_factory/factory.pyon_record_error
: Hooks to execute when a record encounters an error. src/starfish/data_factory/factory.pyshow_progress
: A flag to display progress information. src/starfish/data_factory/factory.pytask_runner_timeout
: Timeout for the task runner. src/starfish/data_factory/factory.pyasyncio.Semaphore
to limit the number of concurrent tasks. src/starfish/data_factory/task_runner.py It also manages the state of each task and persists metadata about the task execution. src/starfish/data_factory/task_runner.py
State
class provides methods for setting, getting, and updating state variables. src/starfish/data_factory/utils/state.py The state is persisted using the storage interface, allowing for job resumption and error recovery. src/starfish/data_factory/utils/state.py
LocalStorage
class implements the Storage Interface using SQLite for metadata and JSON files for data artifacts. src/starfish/data_factory/storage/local/local_storage.py It provides methods for saving and retrieving projects, jobs, and records. src/starfish/data_factory/storage/local/local_storage.py
models.py
file defines the data models used to persist metadata about the data generation process. src/starfish/data_factory/storage/models.py These models include:
Project
: Represents a data generation project. src/starfish/data_factory/storage/models.pyGenerationMasterJob
: Represents a master job for data generation. src/starfish/data_factory/storage/models.pyGenerationJob
: Represents an execution job for data generation. src/starfish/data_factory/storage/models.pyRecord
: Represents a single record generated by the data generation process. src/starfish/data_factory/storage/models.py@data_factory
decorator. Data Factory submits tasks to the Task Runner, which executes the data generation function, saves record data to the Storage, logs record metadata, and returns the results to the user. Sources: src/starfish/data_factory/factory.py, src/starfish/data_factory/task_runner.py, src/starfish/data_factory/storage/local/local_storage.py
@data_factory
decorator to transform a function into a scalable data pipeline. The parallel_qna_llm
function is decorated with @data_factory
, which configures it to run with a maximum concurrency of 50. The run
method is then called with a list of cities, which will be processed in parallel. README.md