It’s much easier to test Temporal Workflow in Python by invoking the contents of the individual Activities first, in the shell or via a separate script, then composing them into a Workflow. I need to see if there’s a better way to surface exceptions and failures through Temporal directly to make the feedback loop faster.

From this paper:

62% of the generated code contains API misuses, which would cause unexpected consequences if the code is introduced into real-world software

This work further reinforces some recent thoughts on the importance of measuring the quality of a language model’s output for a use case.

It had been a while since I thought about this 😬