Storage Layer

The Storage Layer in Starfish is responsible for persisting metadata and data artifacts generated during synthetic data creation jobs. It provides a pluggable interface for different storage backends and a local implementation using SQLite for metadata and JSON files for data. The storage layer offers APIs for storing projects, jobs, and records, ensuring data integrity and facilitating efficient retrieval. The primary goal is to manage and organize the data produced by the Data Factory component. Sources: tests/data_factory/storage/README.md

Architecture

The storage layer architecture consists of a core LocalStorage class that interacts with metadata and data handlers. The metadata handler manages the persistence of project, job, and record metadata using SQLite. The data handler is responsible for storing and retrieving data artifacts, such as request configurations and record data, using JSON files. Sources: src/starfish/data_factory/storage/local/local_storage.py, tests/data_factory/storage/README.md

The LocalStorage orchestrates interactions between the metadata and data handlers, providing a unified interface for the rest of the system. Sources: src/starfish/data_factory/storage/local/local_storage.py

Components

LocalStorage

LocalStorage is the main class that implements the storage layer functionality. It provides methods for setting up the storage, saving and retrieving projects, logging master and execution jobs, and managing record data. It uses MetadataHandler and DataHandler for specific tasks. Sources: src/starfish/data_factory/storage/local/local_storage.py

Key features of LocalStorage:

  • Setup: Initializes the storage, creating necessary directories and database tables.
  • Project Management: Saves and retrieves project metadata.
  • Job Management: Logs the start and end of master and execution jobs, updates job statuses.
  • Record Management: Saves and retrieves record data, logs record metadata.

Metadata Handler

The metadata handler is responsible for interacting with the SQLite database to store and retrieve metadata related to projects, jobs, and records. Sources: src/starfish/data_factory/storage/local/local_storage.py

Key functions of the metadata handler include:

  • Saving and retrieving project information.
  • Logging the start and end of master and execution jobs.
  • Updating the status of master jobs.
  • Retrieving master and execution job details.
  • Managing record metadata, including saving, retrieving, and counting records.

Data Handler

The data handler manages the storage and retrieval of data artifacts, such as request configurations and record data. It uses JSON files to persist these artifacts in a structured manner. Sources: src/starfish/data_factory/storage/local/local_storage.py

Key functions of the data handler include:

  • Saving and retrieving request configurations.
  • Saving and retrieving record data.
  • Generating paths for request configurations.

Data Flow

The data flow within the storage layer involves interactions between the LocalStorage, metadata handler, and data handler.

  1. Project Creation: The client initiates the creation of a project by calling the save_project method in LocalStorage. The LocalStorage then delegates the task to the metadata handler, which interacts with the SQLite database to persist the project metadata. Sources: src/starfish/data_factory/storage/local/local_storage.py:104-105, src/starfish/data_factory/storage/local/local_storage.py:107-108
  2. Job Logging: When a master or execution job starts, the LocalStorage uses the metadata handler to log the job’s start information in the SQLite database. Similarly, when a job ends, the metadata handler updates the job’s status and summary information. Sources: src/starfish/data_factory/storage/local/local_storage.py:113-114, src/starfish/data_factory/storage/local/local_storage.py:116-118
  3. Record Storage: To store record data, the LocalStorage calls the data handler’s save_record_data method. The data handler then writes the record data to a JSON file and returns a reference to the file. Sources: src/starfish/data_factory/storage/local/local_storage.py:98-100
  4. Data Retrieval: When data is requested, the LocalStorage uses the appropriate handler to retrieve the data from either the SQLite database (for metadata) or the JSON files (for data artifacts). Sources: src/starfish/data_factory/storage/local/local_storage.py:109-110, src/starfish/data_factory/storage/local/local_storage.py:101-102

API Endpoints

The LocalStorage class provides the following key API endpoints:

Here are the key API endpoints for the LocalStorage class, grouped by functionality:

LocalStorage API Endpoints

Storage Management

  • setup(): Initializes metadata DB schema and base file directories
  • close(): Closes underlying connections/resources

Configuration Persistence

  • save_request_config(config_ref: str, config_data: Dict[str, Any]) -> str: Saves request configuration
  • get_request_config(config_ref: str) -> Dict[str, Any]: Retrieves request configuration
  • generate_request_config_path(master_job_id: str) -> str: Generates path for request config

Data Artifact Persistence

  • save_record_data(record_uid: str, master_job_id: str, job_id: str, data: Dict[str, Any]) -> str: Saves record data
  • get_record_data(output_ref: str) -> Dict[str, Any]: Retrieves record data

Project Management

  • save_project(project_data: Project): Saves project metadata
  • get_project(project_id: str) -> Optional[Project]: Retrieves project metadata
  • list_projects(limit: Optional[int], offset: Optional[int]) -> List[Project]: Lists projects

Job Management

  • log_master_job_start(job_data: GenerationMasterJob): Logs master job start
  • log_master_job_end(master_job_id: str, final_status: str, summary: Optional[Dict[str, Any]], end_time: datetime, update_time: datetime): Logs master job end
  • update_master_job_status(master_job_id: str, status: str, update_time: datetime): Updates master job status
  • get_master_job(master_job_id: str) -> Optional[GenerationMasterJob]: Retrieves master job
  • list_master_jobs(project_id: Optional[str], status_filter: Optional[List[str]], limit: Optional[int], offset: Optional[int]) -> List[GenerationMasterJob]: Lists master jobs

Execution Job Management

  • log_execution_job_start(job_data: GenerationJob): Logs execution job start
  • log_execution_job_end(job_id: str, final_status: str, counts: Dict[str, int], end_time: datetime, update_time: datetime, error_message: Optional[str]): Logs execution job end
  • get_execution_job(job_id: str) -> Optional[GenerationJob]: Retrieves execution job
  • list_execution_jobs(master_job_id: str, status_filter: Optional[List[str]], limit: Optional[int], offset: Optional[int]) -> List[GenerationJob]: Lists execution jobs

Record Management

  • log_record_metadata(record_data: Record): Logs record metadata
  • get_record_metadata(record_uid: str) -> Optional[Record]: Retrieves record metadata
  • get_records_for_master_job(master_job_id: str, status_filter: Optional[List[StatusRecord]], limit: Optional[int], offset: Optional[int]) -> List[Record]: Gets records for master job
  • count_records_for_master_job(master_job_id: str, status_filter: Optional[List[StatusRecord]]) -> Dict[str, int]: Counts records for master job
  • list_record_metadata(master_job_uuid: str, job_uuid: str) -> List[Record]: Lists record metadata