Zenml stack setup thoughts - /s (strangemonad's notes)

These are a collection of thoughts and issues we've run into as part of integrating zenml into our environment. Overall, I'd like to stress that we're excited that something like zenml exists and how the project is taking shape. The team seems thoughtful and thinking about the right issues. The notes that follow will focus mainly on the pain points but keep in mind that we're really happy for the work that's being put into creating zenml. I'm going to focus primarily on our experience trying to run 1 trivial pipeline that captures the core elements of many of our other pipelines. ^[The pipeline and step development experience probably deserves its own set of notes.] The pipeline we're using to drive validation of the stack setup is simple `snowflake query -> computation -> slack alert`. ## The Setup ### Our stack **Python** 3.9 (***all*** our data science projects are managed with poetry but use a private package registry) **Zenml** 0.10.0 **Kubeflow** 1.5 (multi-tenant) on aws eks 1.22 **Cloud** AWS **Environments** local, ci, dev, stage, prod ### User experiences Overall, we're trying to allow the following experiences: #### Official pipelines Run pipelines from code that's checked into main. - These might use a schedule or not (e.g. triggered by an event) - They might run in env=ci (e.g. nightly continuous testing) or in env=prod (e.g. data quality monitoring, data transformations, feast incremental materialize or other rollups) - They might run in dev / stage as part of a rollout process #### Cloud development Because setting up a local environment can be hard in some cases, we run Kubeflow Notebooks with a docker image that's setup with all the dependencies we need and has correct session tokens to *internally* reach the KF components (via the internal k8s istio svc's). ```ad-note When we're talking about running a pipeline from a "cloud notebook", we mean a jupyuter terminal on a Kubeflow notebook instance that's running in the correct KF namespace and is authorrized to communicate with the various KF components. We don't mean we're trying to run a pipeline from a jupyter notebook code cell^[but we would like to be able to do that and we'll touch on it bellow]. ``` #### Local dev / cloud execution We want to be able to develop and run pipelines locally but otherwise orchestrate in KFP and use all of the other components in the cloud stack. #### Local orchestration We want to be able use the local orchestrator in a bunch of cases (e.g. quicker to iterate when the dataset is small, we're already running on an image that has appropriate HW or a copy of the data we need). In this case we still want to use the remainder of the cloud stack e.g. MLFlow for tracking. #### Unit testing We want to be able to run a fully local stack that also mocks out certain components to easily create a suite of unit tests. e.g. fake AWS using `localstack` and fake slack alerter. ## The pain points I've tried to categorize the pain points we had to overcome into the following categories: - Code Stability - KFP - Stack validation (or lack there of) - Lack of specificity about what should run where - Secrets management - Component configuration - Dependencies and image building ### Code Stability A good example of this was [#736](https://github.com/zenml-io/zenml/issues/736). This sort of thing could easily be solved by a unit test or by structuring the code in such a way that it's not using a string but instead tries to use the module and class directly to generate the fully qualified name. This sort of issue is pretty frustrating, it took time to dig into and makes the team loose trust in the case I'm trying to make to adopt zenml because it's perceived as very buggy and incomplete. ### KFP Ahh Kubeflow! Some of the issues here are related to "what should run where" and "component configuration". Let's focus on the "Cloud Development" experience above. When we're creating and triggering a pipeline from a terminal notebook instance that's running in the cluster, all that really needs to happen is `kfp.Client()`. The KFP client is smart enough to inspect the environment to figure out the kfp server endpoint (by using the ***internal*** SVC) and the Service Account token. Related issue https://github.com/zenml-io/zenml/issues/720 **Workarounds** - We had to create a bogus kube config file and context to satisfy the `KubeflowOrchestrator.validator`. In particular we also had to install `kubectl` into the notebook image just to make `_validate_local_requirements` pass. When we're running within the cluster, there's no need to run `kubectl` or have an active context. We also had to set the orchestrator component `kubernetes_context: bogus_context` - We had to set `skip_cluster_provisioning=True` and `skip_ui_daemon_provisioning=True` to prevent an invocation of kubectl port forwarding. - We had to set `kubeflow_hostname: ""` rather than something like `http://ml-pipeline.kubeflow.svc.cluster.local:8888`. If you look at `kfp._client._load_config` You'll notice that if the host is explicitly set, it does an early return at lines `274-276`. This means it won't use the in-cluster Service Account token credentials (line `286-289`) which calls `client._get_config_with_default_credentials`. - We had to give the kfp step runner pods auth access back to the ml-pipelines api server. As far as we can tell, the only reason this was needed is because when invoking the step entry point, `base_orchestrator.run_step` wants a `run_name`. the KF orchestrator gets this via `get_run_name` which creates a `kfp.Client()` that fetches the current run name using the already accessible `KFP_RUN_ID` - it's unclear why `run_step` needs a name rather than just using the run id - if it really does need a name... we're already in the entry point for the step, everything we need should be baked into the environment of the step, needing to access and authenticate from a KFP step pod back to the ml-pipelines api isn't ideal. - Custom base images were completely unaware of the stack's container registry domain. We had to fully specify the ECR domain and region to allow creating a pipeline with a custom base image. Because so many things ended up depending on where we're creating the pipeline and because of the workarounds make above, we can't easily use the same profile component for the **Local dev / cloud execution** workflow and this complicates how we streamline this experience for Data Scientists. Even though we're trying to target the same KFP environment, we either need different zenml profiles or different kubeflow components for the same KFP. At the very least, I think some of the workarounds and internal implementation details of stack up and stack validate would be simplified if local k3d were pulled out into a different orchestrator flavor. Local kf == port forward while local k3d means to spin one up on the local node. This allows a much cleaner story for validation of remote and in_cluster kf. ### KF metadata store We had to hack this to make it work. When we're running `in_cluster` as part of our cloud development experience, we can't stack up to port-forward the metadata store (but there's already an env var set for the mlmd grpc service endpoint). Related issue https://github.com/zenml-io/zenml/issues/756 **Workaround** Before running the pipeline, we do the following ```python repo = Repository() if repo.active_stack.metadata_store.FLAVOR == "kubeflow": # noinspection PyUnresolvedReferences with open(repo.active_stack.metadata_store._pid_file_path, "w") as f: f.write(str(os.getpid())) ``` This tricks the metadata stack component to think it's already running. I set the pid in the pidfile to be the current python process that's creating the pipeline so if something happens to try to shut down the metadata component, in the worst case, it will kill the current process. I understand the reasoning in [#756](https://github.com/zenml-io/zenml/issues/756) but I think this needs more nuance and is related to the **what should run where** topic. 1. I don't need the metadata store proxied to create and run a pipeline. 2. Specifically, in the case of the production pipeline use case mentioned above, we likely have no need to do any pipeline post-processing steps, we just need to trigger a pipeline when an even occurs. 3. Maybe there need to be richer phase distinctions under the covers for "stack-up" and there's a way to express if my current workflow will have any post processing steps? E.g. a stack component could be tagged with all the lifecycle phases it participates in (possibly even marked as required vs optional). This would allow creating and running a pipeline without a proxied metadata store. Then, if I try to create a repository and fetch historical run information, *that* could provide an error that the requires stack components aren't available and should be started. 4. Similar to the issues we had to work around to get `kfp.Client` working `in_cluster`, there should be similar fixes to the in_cluster kf metadata store where the `in_cluster` notebook instance already knows the `METADATA_STORE_HOST` e.g. https://blog.kubeflow.org/jupyter/2020/10/01/lineage.html ### Stack validation (or lack there of) In some ways, the current stack validators try to do too much (see metadata store above). In other ways, the validators don't do enough. Because pipelines can take a long time to run, I wish there were more robust stack validation as part of pipeline pre-flight. For example, assume I'm running a long data fetch and computation and finally sending out a slack alert. I only find out if the slack_token is wrong at the moment I try to send the lack alert. 1. When creating the pipeline, there needs to be some richer pre-flight validation 2. Because pipeline creation happens in a different environment, it might make sense to create some sort of preflight validation step e.g. validating the slack alerter would check if it can authenticate into the slack api (without sending a message) from a pod running a step. Maybe the right way to think about this is more like a stack integration test suite. We could write a bunch of pipelines and steps that check core functionality of the stack as configured - a stack validation pipeline that's dynamic based on the registered stack components. It would be great if zenml provided this so we didn't have to build it. We'd also use this when validating new versions of zenml and I'd imagine this would be a useful integration test for zenml development itself. ### Lack of specificity about what should run where We've touched on this in a couple points above - kubeflow orchestrator might either try to locally run k3d or port-forward - kfp step might try to create a kfp.Client - it's not clear what `stack up` will do (e.g. does it try to start remote resources or just verify that we can connect to them). I think there needs to be a clearer definition of what's supposed to run where and enforce those constructs in the implementation. ### Secrets management This is specifically about using AWS secrets manager. Because secrets in a single AWS project share a global namespace, it's common practice to name them with a hierarchical prefix. Because we're setting up multiple kubeflow profiles in the same kubeflow cluster (and hence same aws project), we need to manage multiple zenml secrets. We're currently manually enforcing the following naming convention for secrets in aws secret manager: ``` `zenml/<ZENML-PROFILE>/<COMPONENT_NAME>/<SECRET_NAME>` ``` This needs more thought but where I think I'm leaning is that an AWS zenml secret-manager should probably take an extra configuration parameter that allows scoping all its secrets to `zenml/<ZENML_PROFILE>/...`. That way, if a user just runs `zenml secret add ...` that won't pollute the global namespace and potentially clash. This also allows much better auditing and control of secrets rotation because we can track what secrets are in use by each zenml profile. I think this needs to be given more thought in regards to [#752](https://github.com/zenml-io/zenml/issues/752) to make sure that works smoothly. Specifically, fetching a secret this way would apply the prefix e.g. if the slack component were configured to lookup the `slack` secret, that would resolve to `zenml/<ZENML_PROFILE>/slack`. It's probably a good idea of default to name-spacing secrets by component e.g. `<COMPENENT_NAME>/slack` -> `zenml/<ZENML_PROFILE>/<COMPONENT_NAME><SECRET_NAME>` . I'd just see this as a best practice for picking default secret names (e.g. when prompting a user to enter the `slack_token` when they're setting up a slack alerter component) and not a hard requirement. ### Component configuration As mentioned above, component configuration seems to mix up a few concerns. Ideally, for a single profile, for the same slack component (e.g. the kubeflow orchestrator), we'd have a single component definition that could be used everywhere (triggering nightly continuous testing, launching experiments from a cloud notebook, or launching experiments from a local laptop). If we have to manage variants of stack components for what's effectively the same cloud service, it becomes very confusing to document and explain to Data Scientists how to run stuff. ### Dependencies and image building OK this one's probably the more complex and contentious topic. Before diving into things, I fully appreciate what kind of a rats nest python dependency management can be. Part of making sure zenml is successful is that as adoption grows, you don't have an endless deluge of users submitting issues that are just related to python dependency issues when building the docker images. ^[For what it's worth, we've had a lot of success with our Data Scientists documenting a simple set of best practices for poetry projects and using poetry-kernel.] The elephant in the room. The current way zenml resolves dependencies and builds docker images simply doesn't work in our workflow. We have to apply several workarounds to get our first KFP pipeline running end to end. 1. All our projects use poetry 2. We are in a repo that has several projects and libraries (similar to what's described https://medium.com/opendoor-labs/our-python-monorepo-d34028f2b6fa). In many cases we'd like to have a poetry path dev dependency. 3. The split of `zenml integration install` and poetry really doesn't play well. 1. To properly resolve dependencies (e.g. if a pipeline lists required integrations) the dependencies need to be available in the currently active python venv. This means, at the very least we'd have to install the zenml integration dependencies *inside* the poetry managed venv. e.g. by running `poetry run zenml integration install` 2. a common step in our workflow is to run `poetry install --remove-untracked` to make sure our venv matches the poetry.lock file and that our builds are reproducible. Our projects look like this ```console /libs /my-lib .venv pyproject.toml /src /tests ... /projects /my-project .venv .zen pyproject.toml # might have a source dev dependency on ../../libs/** /notebooks /src/my_project/pipelines/my_pipeline.py /src/my_project/steps/my_step.py /tests/... ... ``` I think we can probably break this up into a few topics: - Docker image building - Zenml integrations dependencies and poetry #### Docker image building We needed several workarounds to get the docker images to build. - We couldn't use `poetry export` to generate a `requirements.txt` - because we use a private package registry, the first line of the generated `requirements.txt` is the index_url. The Dockerfile generator tries to blindly assume that each line in a requirements.txt is a requirement spec, which isn't true. Even if this did work, that line would also include the http basic auth for our private package registry into the docker image - more importantly, because the generated docker file tries to `RUN pip install <ALL_DEPS>` and `poetry export` includes all transitive dependencies (and SHA hashes), the line is longer than the 4kb docker `RUN` command limit. - **workaround** We hand generated a requirements.txt - We use a private package repository - **workaround** we build a custom base docker image that has a custom pipyrc config to override the default PIP_INDEX_URL. We pass in the secret via the k8s kubeflow kustomize configs. - We've already run `poetry install` and have `./.venv` available to the docker build context. It would be nice to just re-use this to speedup the pipeline build and run turnaround time - This still doesn't solve for `.pth` dev dependencies since those live outside the docker build context but it's incredibly useful for MLOps engineers to be able to make cross-cutting changes to libraries *and* pipelines so they can test changes. Then publish a new library version to our package registry so Data Scientists can depend on a published version. - The KFP step entry point doesn't understand python project `src` layout. We use `src` layout for all our projects. So our pipelines live at `./src/my_project/pipelines/my_pipeline`. When building the entry point, the `zenml.entrypoints.step_entrypoint` `--step_source` doesn't get set correctly. It should be set to `my_project.pipelines.my_pipeline` but instead gets set to `src.my_project.pipelines.my_pipeline` - **workaround** we currently do the following ``` cd src ln -s ../.zen # creates a symlink ./src/.zen -> ./.zen/ poetry run python -m my_project.piplines.my_pipeline # has a __main__() ``` I'm not sure yet what the best way to handle this is. Possibly some "advanced mode" where we're able to hook into the docker image build process where it's clear that "we're overriding the default build process and we're on our own to ensure the resulting environment works?" Alternatively, maybe there's a cleaner way to support poetry as a first class citizen? This might be similar to how poe-the-poe or pipx try to detect if they're already running inside an active virtual environment and then get the effective requirements via pip-freeze. Maybe it's as simple as specifying a `venv_path` instead of a `requirements.txt` (has tradeoffs for reproducibility from source, it would only be reproducible with the same docker image). If the Dockerfile generator didn't try to parse the supplied `requirements.txt` and instead simply ran `RUN python -m pip install -r requirements.txt` then we could make sure to always run `poetry export` prior to a pipeline run and we wouldn't hit the max docker command length limit. #### Zenml integrations dependencies and poetry I don't think that `zenml integration install` works very well. Partly I think this is because it ends up installing dependencies into a different environment than the project (e.g. if you installed zenml via `pipx install zenml`) the dependencies will likely end up in that pipx managed venv. Partly, I think this is because required integrations seem to have multiple concerns. For example, a pipeline should be stack agnostic. I want the ability to say that a pipeline requires an alerter or feature store maybe a specific set of secrets. Instead, the mechanism we have is to require a slack alerter or feast feature store. As an MLOps engineer, `zenml integration install` (possibly) makes sense as I'm configuring a profile and stack/ As a Data Scientist, I want to be able to just run based on my current environment. e.g. In our pre-built notebook images we set the active profile and stack based on the environment where the notebook is running and user permissions. We've also checked in `poetry.lock` files for projects. Ideally I'd like something as easy as `poetry install && poetry run ...`. That means I'd need some way of adding the integration dependencies into the poetry dependencies. A poetry plugin could be a way of achieving that. We've held off on poetry plugins until poetry 1.2 eventually comes out (I'm not sure when that's supposed to happen). --- - Links: [[MLOps]] [[Zenml]] [[Python poetry]] [[Python monorepo techniques]] [[Kubeflow]] [[Kubernetes]] [[Secrets management]] - Created at: [[2022-07-16]]