Skip to main content
Storage Layer
The Storage Layer in Starfish is responsible for persisting metadata and data artifacts generated during synthetic data creation jobs. It provides a pluggable interface for different storage backends and a local implementation using SQLite for metadata and JSON files for data. The storage layer offers APIs for storing projects, jobs, and records, ensuring data integrity and facilitating efficient retrieval. The primary goal is to manage and organize the data produced by the Data Factory component. Sources: tests/data_factory/storage/README.md
Architecture
The storage layer architecture consists of a core LocalStorage class that interacts with metadata and data handlers. The metadata handler manages the persistence of project, job, and record metadata using SQLite. The data handler is responsible for storing and retrieving data artifacts, such as request configurations and record data, using JSON files. Sources: src/starfish/data_factory/storage/local/local_storage.py, tests/data_factory/storage/README.md
The LocalStorage orchestrates interactions between the metadata and data handlers, providing a unified interface for the rest of the system. Sources: src/starfish/data_factory/storage/local/local_storage.py
Components
LocalStorage
LocalStorage is the main class that implements the storage layer functionality. It provides methods for setting up the storage, saving and retrieving projects, logging master and execution jobs, and managing record data. It uses MetadataHandler and DataHandler for specific tasks. Sources: src/starfish/data_factory/storage/local/local_storage.py
Key features of LocalStorage:
- Setup: Initializes the storage, creating necessary directories and database tables.
- Project Management: Saves and retrieves project metadata.
- Job Management: Logs the start and end of master and execution jobs, updates job statuses.
- Record Management: Saves and retrieves record data, logs record metadata.
The metadata handler is responsible for interacting with the SQLite database to store and retrieve metadata related to projects, jobs, and records. Sources: src/starfish/data_factory/storage/local/local_storage.py
Key functions of the metadata handler include:
- Saving and retrieving project information.
- Logging the start and end of master and execution jobs.
- Updating the status of master jobs.
- Retrieving master and execution job details.
- Managing record metadata, including saving, retrieving, and counting records.
Data Handler
The data handler manages the storage and retrieval of data artifacts, such as request configurations and record data. It uses JSON files to persist these artifacts in a structured manner. Sources: src/starfish/data_factory/storage/local/local_storage.py
Key functions of the data handler include:
- Saving and retrieving request configurations.
- Saving and retrieving record data.
- Generating paths for request configurations.
Data Flow
The data flow within the storage layer involves interactions between the LocalStorage, metadata handler, and data handler.
- Project Creation: The client initiates the creation of a project by calling the
save_project method in LocalStorage. The LocalStorage then delegates the task to the metadata handler, which interacts with the SQLite database to persist the project metadata. Sources: src/starfish/data_factory/storage/local/local_storage.py:104-105, src/starfish/data_factory/storage/local/local_storage.py:107-108
- Job Logging: When a master or execution job starts, the
LocalStorage uses the metadata handler to log the job’s start information in the SQLite database. Similarly, when a job ends, the metadata handler updates the job’s status and summary information. Sources: src/starfish/data_factory/storage/local/local_storage.py:113-114, src/starfish/data_factory/storage/local/local_storage.py:116-118
- Record Storage: To store record data, the
LocalStorage calls the data handler’s save_record_data method. The data handler then writes the record data to a JSON file and returns a reference to the file. Sources: src/starfish/data_factory/storage/local/local_storage.py:98-100
- Data Retrieval: When data is requested, the
LocalStorage uses the appropriate handler to retrieve the data from either the SQLite database (for metadata) or the JSON files (for data artifacts). Sources: src/starfish/data_factory/storage/local/local_storage.py:109-110, src/starfish/data_factory/storage/local/local_storage.py:101-102
API Endpoints
The LocalStorage class provides the following key API endpoints:
Here are the key API endpoints for the LocalStorage class, grouped by functionality:
LocalStorage API Endpoints
Storage Management
setup(): Initializes metadata DB schema and base file directories
close(): Closes underlying connections/resources
Configuration Persistence
save_request_config(config_ref: str, config_data: Dict[str, Any]) -> str: Saves request configuration
get_request_config(config_ref: str) -> Dict[str, Any]: Retrieves request configuration
generate_request_config_path(master_job_id: str) -> str: Generates path for request config
Data Artifact Persistence
save_record_data(record_uid: str, master_job_id: str, job_id: str, data: Dict[str, Any]) -> str: Saves record data
get_record_data(output_ref: str) -> Dict[str, Any]: Retrieves record data
Project Management
save_project(project_data: Project): Saves project metadata
get_project(project_id: str) -> Optional[Project]: Retrieves project metadata
list_projects(limit: Optional[int], offset: Optional[int]) -> List[Project]: Lists projects
Job Management
log_master_job_start(job_data: GenerationMasterJob): Logs master job start
log_master_job_end(master_job_id: str, final_status: str, summary: Optional[Dict[str, Any]], end_time: datetime, update_time: datetime): Logs master job end
update_master_job_status(master_job_id: str, status: str, update_time: datetime): Updates master job status
get_master_job(master_job_id: str) -> Optional[GenerationMasterJob]: Retrieves master job
list_master_jobs(project_id: Optional[str], status_filter: Optional[List[str]], limit: Optional[int], offset: Optional[int]) -> List[GenerationMasterJob]: Lists master jobs
Execution Job Management
log_execution_job_start(job_data: GenerationJob): Logs execution job start
log_execution_job_end(job_id: str, final_status: str, counts: Dict[str, int], end_time: datetime, update_time: datetime, error_message: Optional[str]): Logs execution job end
get_execution_job(job_id: str) -> Optional[GenerationJob]: Retrieves execution job
list_execution_jobs(master_job_id: str, status_filter: Optional[List[str]], limit: Optional[int], offset: Optional[int]) -> List[GenerationJob]: Lists execution jobs
Record Management
log_record_metadata(record_data: Record): Logs record metadata
get_record_metadata(record_uid: str) -> Optional[Record]: Retrieves record metadata
get_records_for_master_job(master_job_id: str, status_filter: Optional[List[StatusRecord]], limit: Optional[int], offset: Optional[int]) -> List[Record]: Gets records for master job
count_records_for_master_job(master_job_id: str, status_filter: Optional[List[StatusRecord]]) -> Dict[str, int]: Counts records for master job
list_record_metadata(master_job_uuid: str, job_uuid: str) -> List[Record]: Lists record metadata