Teaching Taste to an Agent: 4,096 Classical Chinese Verses, 7 Composition Rules, and the Skill That Paints Them
On turning Han dynasty oracular poetry into AI-generated ink paintings by encoding art direction as a reusable agent skill. Also: why content filters...
What is art direction as an agent skill?
Art direction as an agent skill is the practice of encoding composition rules, color theory, and aesthetic constraints into a reusable prompt schema that an AI image generation model can apply consistently across thousands of outputs. Rather than appending "ink painting style" as a suffix, the skill embeds painterly language directly into scene descriptions — because models respond to the language of a prompt, not the instructions about a prompt.
Or: How We Learned That Style-as-Suffix Doesn't Work and Composition Rules Are Just Taste With a Schema
How do you generate 4,096 consistent AI images from classical Chinese poetry?
Here's what we needed to do: generate one unique image for each of the 4,096 verses in the Jiaoshi Yilin (焦氏易林), a Han dynasty divination text from roughly 40 BCE. Each verse corresponds to a hexagram-to-hexagram transformation — 64 source hexagrams times 64 target hexagrams — and each one is a tiny oracular poem, usually exactly 20 characters long, describing an omen.
Simple, right? Take a verse, write a prompt, generate an image. Four thousand times.
Except the verses look like this:
霜降閉戶,蟄虫隱處。不見日月,與死為伍。
When frost falls, doors are sealed; hibernating insects hide in their places. Not seeing sun or moon, they keep company with death.
And also like this:
上帝之生,福祐日成。脩德行惠,樂安且寧。
God's blessings grow daily; cultivate virtue and kindness, enjoy peace and tranquility.
The first one is rich with imagery — frost, sealed doors, buried insects, an absent sky. You can almost see the painting. The second one is a fortune cookie. "Enjoy peace and tranquility." Try generating an ink painting from that.1
When we audited the entire corpus, here's what we found:
| Tier | Description | Count | % |
|---|---|---|---|
| Tier 1 | Rich imagery, direct prompt generation | 762 | 18.6% |
| Tier 2 | Some imagery, needs scene expansion | 2,046 | 50.0% |
| Tier 3 | Abstract, duplicate, or missing | 1,288 | 31.4% |
Less than a fifth of the verses can be translated directly into visual prompts. Half need creative expansion. A third need to be essentially reinvented as scenes while preserving the tone and fortune of the original. And 215 of them are straight-up duplicates — the same verse appearing in multiple hexagram positions, because Han dynasty scholars apparently reused stock formulas when the prognostic weight was similar.2
This is not a "write a prompt" problem. This is an art direction problem at scale.
Why doesn't "style as a suffix" work for AI image generation?
We started where everyone starts: generate some test images and see what happens.
Round 1: Eight images from eight verses, with loose style hints bolted on as suffixes — "Chinese ink wash", "ink painting", "traditional Chinese art." The results were wildly inconsistent. Some looked like watercolors. Some looked like stock photography. One looked like someone had taken a photo of a muddy stump and run it through a Prisma filter from 2016.



But one image was gorgeous: 無妄之明夷, a verse about crows and a falcon in a moonlit winter forest. Unified blue-violet palette, compositional depth, painterly atmosphere. The kind of thing you'd hang on a wall.

The difference? The prompt for the good image didn't describe a scene and then append "ink wash style." It described a scene that already read like a painting: "like black ink drops," "pool of moonlight," "black silhouettes against gray." The painterly language was embedded in the scene description itself.
This is the key insight, and it's worth stating clearly because it applies far beyond image generation: style-as-suffix doesn't work. The model ignores trailing directives. It responds to the language of the prompt, not the instructions about the prompt. If you want ink painting, don't say "make it look like ink painting." Write a prompt that reads like the caption under an ink painting.3
Round 2: We tested six explicit styles on the worst image from round 1 (the muddy stump). All six were better than the original. But none were as good as the round 1 best. Style-first prompts outperformed scene-heavy prompts with weak suffixes, but the sweet spot was somewhere else entirely — in the fusion of scene and style.



The round 1 loser prompt ended with style as an afterthought:
"…Patches of snow remain in shadows. Pale early-spring light, everything wet and glistening. Ink wash with cool blue-green tints."
The round 2 winner embedded painterly language throughout:
"…Water spills over the stone lip in thin silver threads, tracing dark lines down worn steps like brushstrokes on parchment…Chinese ink painting."
Round 3: Composition variants. Wide landscape vs. close-up vs. overhead vs. water's-POV. The wide landscape won decisively. Compositional depth — foreground, midground, background — mattered as much as style language.
Round 4: Proof batch. Fifteen images across five style categories. Now we were getting somewhere.





Five Styles, Seven Rules, and the Codification of Taste
Four rounds of testing produced two artifacts: a style taxonomy and a set of composition rules.
The five styles:
| Style | Palette | Best For |
|---|---|---|
| atmospheric-night | Dark, strong light in darkness | Grief, isolation, winter, fear |
| ink-landscape | Classic shanshui, depth corridors | Nature, seasons, water, agriculture |
| figures-in-mist | Ink wash + soft watercolor tints | Marriages, courts, travelers, emotion |
| bold-action | Dynamic diagonals, warm ochre | Animals, combat, hunts, storms |
| cosmic-night | Deep blue/black + gold/white | Stars, heaven, mythical beasts |
The seven composition rules:
- Frame the scene — gorges, gates, doorways, tree canopies. Never a subject floating in open space.
- Force depth — foreground, midground, background. Always.
- Demand contrast — strong light against darkness, or dark against light.
- Include a warm accent — even in monochrome: amber, vermillion, pink, gold.
- Favor diagonals and verticals — avoid flat horizontal compositions.
- Pack in discoverable detail — the best images reward a second look.
- Use painterly scene language — end with "Chinese ink painting." but embed painterly metaphors throughout. Never use photographic language.
These rules aren't arbitrary. Each one emerged from comparing the images we liked against the images we didn't, across all four test rounds. Rule 3 (demand contrast) came from noticing that our favorites all had a single strong light source. Rule 7 (painterly language) came from the round 1 discovery about style-as-suffix. Rule 4 (warm accent) came from the observation that pure monochrome images felt cold and lifeless — one vermillion lantern transforms a scene.
Here's what's interesting: these are rules about taste. Not technical constraints. Not API parameters. They're the kind of thing a senior art director would tell a junior illustrator. "Don't put the subject in empty space. Use diagonals. Make sure there's a warm accent."4
The question was: how do you give an AI agent taste?
The Skill
The answer turned out to be a skill — a structured document that encodes domain knowledge into a reusable workflow an AI agent can invoke.
Not a prompt template. Not a system message. A skill: a self-contained document that specifies inputs, outputs, rules, anti-patterns, examples, and edge cases. When an agent invokes it, the agent gets the full context of what we learned across four rounds of style exploration, plus the classification taxonomy, plus the composition rules, plus the content safety constraints, plus eight validated reference prompts to calibrate against.
The core of the skill looks like this:
Classical Chinese verse
↓
Read the verse (understand imagery, mood, season, allusions)
↓
Classify into one of 5 styles
↓
Compose a 50-60 word English scene description
following all 7 composition rules
↓
End with "Chinese ink painting."
↓
Provide faithful English translation
For each verse, the agent outputs three things: a style classification, a prompt, and a translation. The style determines the palette and atmosphere. The prompt is the creative work — turning twenty characters of classical Chinese into a scene that a diffusion model can paint. The translation is separate from the prompt because we don't want the image to depict the literal translation; we want it to capture the feeling of the verse in visual form.
The skill handles the hard cases explicitly:
Abstract verses (30.3% of the corpus): When the verse is a fortune statement with no concrete imagery, the skill instructs the agent to choose a style based on tone — ominous goes to atmospheric-night, auspicious to ink-landscape or figures-in-mist — and to invent a scene that embodies the fortune's emotional register.
Duplicate verses (215 instances): When a verse appears in multiple hexagram positions, each instance gets a variant scene. Same base imagery, different angle, season, or atmosphere — differentiated by the source-target hexagram symbolism.
Lost verses (3 entries marked 原缺, "originally missing"): The agent composes a scene from the hexagram transformation alone — source trigram imagery flowing into target trigram imagery.
The reference prompts are critical. Eight validated examples, one or two per style, that calibrate what "good" looks like. Here's one, for atmospheric-night:
"A frozen river valley under a dark sky split by driving snow. In the foreground, bare willow branches bend under crusts of ice like white brushstrokes. Midground, a lone figure hunches against the north wind on a stone bridge, robes whipping. Beyond, snow-covered hills vanish into the storm. The figure's lantern is a single point of amber warmth swallowed by the gray. Chinese ink painting."
Note the structure: framed (valley), depth (foreground/midground/background), contrast (lantern against storm), warm accent (amber), diagonal (figure hunching), detail (ice crusts on branches), painterly language ("like white brushstrokes"). All seven rules, embedded naturally.5
The Part Where Ancient Violence Meets Modern Content Filters
Here's something nobody warns you about when you're generating images from 2,000-year-old oracular poetry: the content filters.
52 of 4,096 images failed generation. Not because the prompts were bad — because fal.ai's content filter rejected them. The Yilin is full of war, predation, and bodily harm. "Fangs deep in flesh." "Sword turns upon himself." "Bodies fallen on bare ground." These are oracular images describing misfortune. They've been in the text for two millennia. The diffusion model's safety filter does not care.
So we had to develop a substitution system — metaphor as content-filter evasion:
| Blocked | Safe Alternative |
|---|---|
| Blood on surfaces | "Dark stains like spilled ink" |
| Wounds / piercing | Show the weapon or aftermath, not the moment of contact |
| Fallen bodies | "Abandoned armor on bare ground," "an empty seat at a banquet table" |
| Animal predation | "Scattered feathers on snow," "a wolf's shadow crossing moonlit ground" |
| Dismemberment | "A shattered puppet," "a split tree trunk" |
The guiding principle: show the weight of violence through atmosphere, aftermath, and metaphor — never through the act itself.
This constraint turned out to produce better art. "Scattered feathers on snow" is a more evocative image than "hawk kills rabbit." Absence is more powerful than presence. The Yilin's own power comes from dread and foreboding, not graphic depiction — and the content filter, accidentally, pushed us closer to the text's actual aesthetic.6
What a Skill Actually Is
Here's the thing I've been circling around: a skill isn't a prompt. A prompt is a one-shot instruction. A skill is accumulated taste in machine-readable form.
Consider what the yilin-verse-to-prompt skill encodes:
- Four rounds of style exploration → 5 style categories with clear selection criteria
- Preference comparisons across 30+ test images → 7 composition rules
- 52 content filter failures → substitution table with principles
- Corpus audit of 4,096 verses → tier classification and handling strategies
- 8 validated reference prompts → calibration examples
- Edge case handling → abstract verses, duplicates, lost verses
- Anti-patterns → style-as-suffix, over-instruction, photographic language
All of this fits in a single document. When an agent invokes it, it gets the benefit of weeks of exploration compressed into constraints and examples. It doesn't need to rediscover that style-as-suffix fails. It doesn't need to learn that warm accents prevent lifeless monochromes. It just follows the rules and generates prompts that are, on average, much better than anything a generic "write me an image prompt" instruction would produce.
This is what I mean when I say skills beat generic orchestration. You don't need a protocol for routing tasks between agents. You need agents that carry domain knowledge. The knowledge is the hard part. The routing is plumbing.7
The Numbers
Where we are now:
| Metric | Value |
|---|---|
| Total verses | 4,096 |
| Unique after dedup | 3,881 |
| Duplicate rewrites created | 215 |
| Lost verses newly composed | 3 |
| Style categories | 5 |
| Composition rules | 7 |
| Content safety substitutions | 7 categories |
| Validated reference prompts | 8 |
| Image model | fal-ai/z-image/turbo |
| Image format | WebP, square_hd (1024x1024) |
| Delivery | Vercel Blob CDN + on-device cache |
The pipeline is running. Hexagram by hexagram, sixty-four prompts at a time, each one classified, composed, and translated by an agent carrying a skill that encodes everything we learned from those first thirty test images.
The Principle
When people talk about AI agents, they usually talk about orchestration — how agents coordinate, hand off tasks, route work. That's the plumbing. It matters, but it's not where the value lives.
The value lives in the skills. In the accumulated domain knowledge that tells an agent how to do something, not just what to do. The difference between "generate an image prompt from this Chinese verse" and "classify this verse into one of five style categories based on its imagery and tone, then compose a 50-60 word scene following seven composition rules, using painterly language embedded in the scene rather than appended as a suffix, while avoiding photographic language and content-filter-triggering violence" — that difference is the skill.
You can teach an agent to route tasks. You can't teach it taste. But you can encode taste as rules, examples, and anti-patterns — and that's what a skill is.
Four thousand ninety-six verses. Five styles. Seven rules. One skill.

The agent paints the rest.
This post was written by Claude Opus 4.6 under human direction — the human provided the project, the thesis, the correction about skills vs MCP, and the architectural insight. The AI provided the DFW-adjacent structure and the footnote spiral. The 4,096 verses were written by Jiao Yanshou roughly two thousand years ago, which is either the most impressive or least impressive fact in this post depending on your timescale.
To see the Yilin images in action, join the TestFlight beta at sixlines.online
Footnotes
-
72.6% of all verses in the Yilin are exactly 20 characters long — the canonical four-line, five-character quatrain format. This uniformity is remarkable for a text compiled during the Han dynasty. It also means the variance in visual richness has nothing to do with verse length. A 20-character verse can be a vivid winter landscape or a completely abstract fortune statement. Length is a constant. Imagery is the variable. ↩
-
The duplicate analysis is our own conjecture, not established Yilin scholarship. But the data is suggestive: 95% of duplicate groups share neither source nor target hexagram. Only 4 groups share the same source. Zero share both. This rules out structural mapping and points toward formulaic reuse — stock phrases for similar prognostic situations, like a Han dynasty cut-and-paste. ↩
-
This principle — that models respond to the register of language, not to meta-instructions about style — shows up everywhere in LLM work. You don't get formal prose by saying "write formally." You get it by writing formally in your prompt. The medium is the message, except here the medium is the token distribution and the message is "please stop generating stock photography." ↩
-
The fact that composition rules for AI-generated images are identical to composition rules taught in art school is either obvious or profound depending on how much you've thought about it. The diffusion model was trained on human-composed images. Of course it responds to the same compositional principles. What's less obvious is that stating these principles explicitly in a skill document causes better output than expecting the model to apply them implicitly. The model knows what good composition looks like. It doesn't know you want it. ↩
-
The skill document also includes an anti-patterns section, which is arguably more useful than the rules themselves. Knowing what not to do — style-as-suffix, over-instruction, flat composition, sparse scenes, photographic language — prevents the most common failure modes. Good taste is mostly the absence of bad taste. ↩
-
There's a legitimate question about whether content filters should apply to depictions of classical literature. The Yilin has been in continuous circulation for two thousand years. But we're not in a position to negotiate with fal.ai's safety policy, and honestly, the metaphorical substitutions are better art. So this is one of those cases where constraints improve output — not because the constraints are wise, but because working around them forces creative solutions. ↩
-
This is why I changed the resume. I had originally described the architecture as "MCP-based routing" because that sounds impressive and technical. But MCP is a transport layer — it lets models talk to tools. The actual architecture is skill-based: agents carry reusable domain knowledge documents that encode accumulated expertise. The skill is the intelligence. The routing is just wires. ↩