Storage Layer
Welcome to Starfishdata.ai Storage Layer
Storage Layer
The Storage Layer in Starfish is responsible for persisting metadata and data artifacts generated during synthetic data creation jobs. It provides a pluggable interface for different storage backends and a local implementation using SQLite for metadata and JSON files for data. The storage layer offers APIs for storing projects, jobs, and records, ensuring data integrity and facilitating efficient retrieval. The primary goal is to manage and organize the data produced by the Data Factory component. Sources: tests/data_factory/storage/README.md
Architecture
The storage layer architecture consists of a core LocalStorage
class that interacts with metadata and data handlers. The metadata handler manages the persistence of project, job, and record metadata using SQLite. The data handler is responsible for storing and retrieving data artifacts, such as request configurations and record data, using JSON files. Sources: src/starfish/data_factory/storage/local/local_storage.py, tests/data_factory/storage/README.md
The LocalStorage
orchestrates interactions between the metadata and data handlers, providing a unified interface for the rest of the system. Sources: src/starfish/data_factory/storage/local/local_storage.py
Components
LocalStorage
LocalStorage
is the main class that implements the storage layer functionality. It provides methods for setting up the storage, saving and retrieving projects, logging master and execution jobs, and managing record data. It uses MetadataHandler
and DataHandler
for specific tasks. Sources: src/starfish/data_factory/storage/local/local_storage.py
Key features of LocalStorage
:
- Setup: Initializes the storage, creating necessary directories and database tables.
- Project Management: Saves and retrieves project metadata.
- Job Management: Logs the start and end of master and execution jobs, updates job statuses.
- Record Management: Saves and retrieves record data, logs record metadata.
Metadata Handler
The metadata handler is responsible for interacting with the SQLite database to store and retrieve metadata related to projects, jobs, and records. Sources: src/starfish/data_factory/storage/local/local_storage.py
Key functions of the metadata handler include:
- Saving and retrieving project information.
- Logging the start and end of master and execution jobs.
- Updating the status of master jobs.
- Retrieving master and execution job details.
- Managing record metadata, including saving, retrieving, and counting records.
Data Handler
The data handler manages the storage and retrieval of data artifacts, such as request configurations and record data. It uses JSON files to persist these artifacts in a structured manner. Sources: src/starfish/data_factory/storage/local/local_storage.py
Key functions of the data handler include:
- Saving and retrieving request configurations.
- Saving and retrieving record data.
- Generating paths for request configurations.
Data Flow
The data flow within the storage layer involves interactions between the LocalStorage
, metadata handler, and data handler.
- Project Creation: The client initiates the creation of a project by calling the
save_project
method inLocalStorage
. TheLocalStorage
then delegates the task to the metadata handler, which interacts with the SQLite database to persist the project metadata. Sources: src/starfish/data_factory/storage/local/local_storage.py:104-105, src/starfish/data_factory/storage/local/local_storage.py:107-108 - Job Logging: When a master or execution job starts, the
LocalStorage
uses the metadata handler to log the job’s start information in the SQLite database. Similarly, when a job ends, the metadata handler updates the job’s status and summary information. Sources: src/starfish/data_factory/storage/local/local_storage.py:113-114, src/starfish/data_factory/storage/local/local_storage.py:116-118 - Record Storage: To store record data, the
LocalStorage
calls the data handler’ssave_record_data
method. The data handler then writes the record data to a JSON file and returns a reference to the file. Sources: src/starfish/data_factory/storage/local/local_storage.py:98-100 - Data Retrieval: When data is requested, the
LocalStorage
uses the appropriate handler to retrieve the data from either the SQLite database (for metadata) or the JSON files (for data artifacts). Sources: src/starfish/data_factory/storage/local/local_storage.py:109-110, src/starfish/data_factory/storage/local/local_storage.py:101-102
API Endpoints
The LocalStorage
class provides the following key API endpoints:
Here are the key API endpoints for the LocalStorage
class, grouped by functionality:
LocalStorage API Endpoints
Storage Management
setup()
: Initializes metadata DB schema and base file directoriesclose()
: Closes underlying connections/resources
Configuration Persistence
save_request_config(config_ref: str, config_data: Dict[str, Any]) -> str
: Saves request configurationget_request_config(config_ref: str) -> Dict[str, Any]
: Retrieves request configurationgenerate_request_config_path(master_job_id: str) -> str
: Generates path for request config
Data Artifact Persistence
save_record_data(record_uid: str, master_job_id: str, job_id: str, data: Dict[str, Any]) -> str
: Saves record dataget_record_data(output_ref: str) -> Dict[str, Any]
: Retrieves record data
Project Management
save_project(project_data: Project)
: Saves project metadataget_project(project_id: str) -> Optional[Project]
: Retrieves project metadatalist_projects(limit: Optional[int], offset: Optional[int]) -> List[Project]
: Lists projects
Job Management
log_master_job_start(job_data: GenerationMasterJob)
: Logs master job startlog_master_job_end(master_job_id: str, final_status: str, summary: Optional[Dict[str, Any]], end_time: datetime, update_time: datetime)
: Logs master job endupdate_master_job_status(master_job_id: str, status: str, update_time: datetime)
: Updates master job statusget_master_job(master_job_id: str) -> Optional[GenerationMasterJob]
: Retrieves master joblist_master_jobs(project_id: Optional[str], status_filter: Optional[List[str]], limit: Optional[int], offset: Optional[int]) -> List[GenerationMasterJob]
: Lists master jobs
Execution Job Management
log_execution_job_start(job_data: GenerationJob)
: Logs execution job startlog_execution_job_end(job_id: str, final_status: str, counts: Dict[str, int], end_time: datetime, update_time: datetime, error_message: Optional[str])
: Logs execution job endget_execution_job(job_id: str) -> Optional[GenerationJob]
: Retrieves execution joblist_execution_jobs(master_job_id: str, status_filter: Optional[List[str]], limit: Optional[int], offset: Optional[int]) -> List[GenerationJob]
: Lists execution jobs
Record Management
log_record_metadata(record_data: Record)
: Logs record metadataget_record_metadata(record_uid: str) -> Optional[Record]
: Retrieves record metadataget_records_for_master_job(master_job_id: str, status_filter: Optional[List[StatusRecord]], limit: Optional[int], offset: Optional[int]) -> List[Record]
: Gets records for master jobcount_records_for_master_job(master_job_id: str, status_filter: Optional[List[StatusRecord]]) -> Dict[str, int]
: Counts records for master joblist_record_metadata(master_job_uuid: str, job_uuid: str) -> List[Record]
: Lists record metadata