Related topics: Installation, Telemetry

Configuration

Configuration within the Starfish project involves setting up the environment and managing various parameters that control the behavior of the application. This includes setting API keys, model configurations, and other runtime parameters. The project uses environment variables for configuration, providing flexibility and ease of setup. The configuration also extends to the storage layer, which is responsible for persisting metadata and data artifacts. This page outlines the different aspects of configuration within the Starfish project.

Environment Variables

The Starfish project utilizes environment variables for configuration. A .env.template file is provided to help users get started quickly. This file includes settings for API keys, model configurations, and other runtime parameters. Users are expected to copy the template to .env and edit it with their specific configurations. cp .env.template .env, nano .env Sources: README.md

Setting Up Environment Variables

To configure the Starfish project, follow these steps:

  1. Copy the .env.template file to .env:

    cp .env.template .env
    

    Sources: README.md

  2. Edit the .env file with your preferred editor to set the necessary API keys and configurations:

    nano .env  # or use your preferred editor
    

    Sources: README.md

Telemetry Configuration

Starfish collects minimal and anonymous telemetry data to help improve the library. Participation is optional, and users can opt out by setting TELEMETRY_ENABLED=false in their environment variables. Sources: README.md

Storage Layer Configuration

The storage layer is responsible for persisting metadata and data artifacts for synthetic data generation jobs. It provides a pluggable interface for different storage backends and a hybrid local implementation using SQLite for metadata and JSON files for data. Sources: tests/data_factory/storage/README.md

Local Storage Configuration

The local storage implementation uses SQLite for metadata and JSON files for data artifacts. The tests use separate test databases (by default in /tmp/starfish_test_* directories) to avoid interfering with production data. Sources: tests/data_factory/storage/README.md

Setting Up Local Storage

The LocalStorage class in src/starfish/data_factory/storage/local/local_storage.py handles the local storage implementation. The setup method creates the necessary directories and database. Sources: tests/data_factory/storage/local/test_local_storage.py

@pytest.mark.asyncio
async def test_storage_setup():
    """Test that storage setup creates necessary directories and database."""
    # Ensure clean directory
    if os.path.exists(TEST_DB_DIR):
        shutil.rmtree(TEST_DB_DIR)

    storage = LocalStorage(TEST_DB_URI)
    await storage.setup()

    # Check that dirs exist
    assert os.path.exists(os.path.join(TEST_DB_DIR, "configs"))
    assert os.path.exists(os.path.join(TEST_DB_DIR, "data"))
    assert os.path.exists(os.path.join(TEST_DB_DIR, "metadata.db"))

    await storage.close()
    shutil.rmtree(TEST_DB_DIR)

Sources: tests/data_factory/storage/local/test_local_storage.py:17-35

Configuration Paths

The local storage implementation uses the following directory structure:

  • Configs: {storage_uri}/configs/{master_job_id}.request.json
  • Record Data: {storage_uri}/data/{record_uid[:2]}/{record_uid[2:4]}/{record_uid}.json

Sources: src/starfish/data_factory/storage/models.py

Data Handler

The FileSystemDataHandler class in src/starfish/data_factory/storage/local/data_handler.py manages interactions with data and config files on the local filesystem. It ensures that all top-level data directories exist. Sources: src/starfish/data_factory/storage/local/data_handler.py

class FileSystemDataHandler:
    """Manages interactions with data/config files on the local filesystem."""

    def __init__(self, data_base_path: str):
        """Args:
        data_base_path: The root directory where configs/, data/, etc. will live.
        """
        self.data_base_path = data_base_path
        self.config_path = os.path.join(self.data_base_path, CONFIGS_DIR)
        self.record_data_path = os.path.join(self.data_base_path, DATA_DIR)
        self.assoc_path = os.path.join(self.data_base_path, ASSOCIATIONS_DIR)
        # TODO: Consider locks if implementing JSONL appends for associations

    async def ensure_base_dirs(self):
        """Ensure all top-level data directories exist."""
        logger.debug("Ensuring base data directories exist...")
        await self._ensure_dir(self.config_path + os.sep)  # Trailing sep ensures dir
        await self._ensure_dir(self.record_data_path + os.sep)
        # TODO: Add associations directory back later
        # await self._ensure_dir(self.assoc_path + os.sep)

    async def _ensure_dir(self, path: str):
        """Asynchronously ensures a directory exists."""
        dir_path = os.path.dirname(path)
        if not await aio_os.path.isdir(dir_path):
            try:
                await aio_os.makedirs(dir_path, exist_ok=True)
                logger.debug(f"Created directory: {dir_path}")
            except Exception as e:
                if not await aio_os.path.isdir(dir_path):
                    # Race condition
                    logger.warning(f"Directory creation failed for {dir_path} due to a race condition.")
                    raise

Sources: src/starfish/data_factory/storage/local/data_handler.py:14-48

Data Handler Directories

DirectoryDescription
CONFIGS_DIRDirectory where request configuration files are stored.
DATA_DIRDirectory where record data files are stored.
ASSOCIATIONS_DIRDirectory where associations files are stored (currently not in use).

Sources: src/starfish/data_factory/storage/local/data_handler.py

Configuration Workflow

The following diagram illustrates the configuration workflow for the storage layer:

Sources: tests/data_factory/storage/local/test_local_storage.py, tests/data_factory/storage/test_storage_main.py

Test Configuration

The tests use a specific configuration for the storage layer. The TEST_DB_DIR and TEST_DB_URI variables define the location of the test database. The TEST_MODE variable determines whether to run a basic or full test. Sources: tests/data_factory/storage/test_storage_main.py

# Test database location - can be overridden with env var
TEST_DB_DIR = os.environ.get("STARFISH_TEST_DB_DIR", "/tmp/starfish_test_db")
TEST_DB_URI = f"file://{TEST_DB_DIR}"

# Test mode - 'basic' (quick test) or 'full' (comprehensive)
TEST_MODE = os.environ.get("STARFISH_TEST_MODE", "basic")

Sources: tests/data_factory/storage/test_storage_main.py:18-23

Conclusion

Configuration in the Starfish project involves setting up environment variables and configuring the storage layer. The environment variables are used to set API keys, model configurations, and other runtime parameters. The storage layer is responsible for persisting metadata and data artifacts for synthetic data generation jobs. The local storage implementation uses SQLite for metadata and JSON files for data artifacts. The configuration workflow involves creating a project, master job, and execution job, saving the necessary data, and completing the jobs.