OC 3 – Chapter 14: Project Evaluation

«The great tragedy of creativity — the breaking of a beautiful artifact on a neglected reality.»¹
Daniel Wessel

During project realization, ideas encounter reality — they collide with constraints. The resistance and friction are the feedback that is needed to improve the artifact. Like a smith’s hammer hitting metal or a knife on a whetstone, they shape the ideas and the project into something of value.²

That is the main goal of feedback — and of evaluations more specifically — improving the work.

Thus, an evaluation is not an abstract judgment or casual opinion. It is reality-based feedback that provides the needed friction to shape creative work into something that actually works.

A good evaluation arranges contact with reality so the next decision is better informed. So to improve creative work, the question is: What conditions make that reality feedback happen sooner, cheaper, and more honestly?

Willingness to Improve

Few people want negative feedback — feedback that shows that things do not work or are plainly wrong. However, being able to seek out and engage with such feedback honestly is the only way to improve ideas and projects deliberately. This willingness includes openness, honesty, and humility, which allows people to take feedback seriously — but not personally (see Craft to Create on page 54).

This willingness to improve also preserves momentum when the quality of creative work varies. Many factors influence how good something is — the quality of the ideas, focus, materials, and so on. Sometimes these factors align to create the best conditions, sometimes they misalign and create the worst, and usually they land somewhere in between.³ There should be overall improvement over time, but occasional flukes are to be expected. We simply do not usually see them in others, because the worst results vanish quietly or remain in R&D labs or artists’ studios. The path to successful artifacts therefore runs through many attempts, failures, corrections, and restarts.

It is this non-defensive contact with reality — the ability to see what is actually true without immediately protecting your ego, your story, or your habits — that allows for quick course corrections, fast learning, less wasted time on dead strategies, and better investment of your energy where the returns are real.

Note that this is not self-hatred or a «woe is me» attitude, but accurate observation with low ego interference.⁴ From the outside, it may look humble but not self-flagellating. In reality, it is an extremely aggressive strategy for improving the work. It makes it possible to say — cleanly and without melodrama — what the real problem is, even if the problem is the own work as it currently stands, and then actually solve it.

Curiosity is very helpful here. For example, in questions such as «What might I be missing?» and «How might I do better?». It also helps to keep believing in the idea while deliberately looking for misconceptions, issues, and limitations.⁵

While feedback can point to a lack of knowledge or skill, it is directed at the work, not at identity. Success takes time, and criticism keeps things alive and makes them better. Given how much successful creative work and personal improvement depend on feedback, the worst thing that can happen is not getting any. Even harsh and valid — but not abusive — feedback means that the other person believes you can do better.⁶

Resistance Criteria

Creative work improves only when it meets resistance, when there is friction and pushback. That resistance can come from your own view of what the project should be and from your standards for how it should be realized (see Standards, Release and Kill Criteria on page 179). It can also come from the work itself — wood splintering, code not compiling. Or it can come from others, for example feedback from the field or the target audience, who ultimately decide whether something is creative. Usually, during realization, the criteria shift from one’s own standards toward field or target-audience feedback.

Without some criteria — explicit or implicit — there is nothing for the work to push against. Everything remains equally acceptable, which means nothing can be meaningfully revised. Thus, these criteria are not abstract ideals, but tools that generate usable feedback. They make the gap between your current work and what you want to achieve visible, and therefore workable.

In practice, you can use criteria that allow you to ask:

Does this version solve the problem better than the previous one?
Where does it fail under real use?
What specifically needs to change next?

The answers guide the iteration. You can also evaluate the resistance criteria themselves by how well they guide these iterations. As they are feedback instruments, not moral obligations, they need to be calibrated to generate movement, not abstract judgment. If they prevent you from producing work, they are too rigid. If they never force revision, they are too weak.

In many domains, these criteria — especially one’s own standards — improve with experience. Learning what is good, what counts as quality, even developing taste, is part of learning a domain. This is very obvious in domains such as music, where well-played versus misplayed notes provide immediate feedback, but it happens in every domain.

Repeated Reality Contact

In creative work, evaluation is a series of contacts with reality during the realization of an idea. Thus, an iterative approach with repeated reality contact is usually best suited to improving the work (see Realization Approaches on page 195).

These iterations can happen at the level of an idea, in an Internal Simulation or External Representation (e.g., a sketch or a draft), then become more concrete in a prototype, or even take the form of different versions of a realized and released project (e.g., apps, books, painting series).⁷

Each evaluated iteration generates knowledge that guides the next one by reducing uncertainty. It allows you to find out what is already happening, where the friction is, to try adjustments, and to see whether they improve the situation.

Feedback is most useful when it arrives early enough to influence direction and concretely enough to suggest change. The further along the work is, the easier it becomes to get accurate data, because the artifact is more concrete. But the cost of change also rises, because more has been invested — resources, time, ego. The worst case is feedback at the end of development that calls the whole idea into question.

Timing of Evaluations

Evaluations can focus on different aspects of feedback:

Exploratory Feedback («Ex-Ante Evaluation» or «Preliminary Analysis»): Feedback while shaping the idea, usually before the implementation is started. The focus is on understanding the target audience and the problem — the situation, the requirements, the design space. See Understanding a Situation on page 221.
Directional Feedback («Formative Evaluation»): Feedback during realization, once the work begins to take form. It is used to shape possible solutions — to decide what to pursue, drop, or adjust. This is often done after each iteration. A few people (5–10) are usually enough to spot major issues, especially if members of the target audience can interact with mockups or prototypes.⁸
Validation feedback («Summative Evaluation»): Validation feedback examines whether a near-finished or finished project actually works in real use — or «in the wild». It is less about exploring possibilities and more about checking whether the work holds up under real conditions. Because it usually involves more people, it requires methods that scale easily (e.g., survey questions rather than direct observation).
Project Work Feedback («Post-Mortem»): Feedback concerned less with improving that particular project than with improving future ones. The work on that project is examined from the initial idea to the period after release. What can be learned for future projects? What actionable feedback should be integrated?
Continuous Feedback («Continuous Monitoring»): Feedback generated continuously by the project itself, e.g., user feedback, usage numbers, visitor numbers, and so on. If possible, it also comes from the developers themselves if they are part of the target audience («Eat your own dogfood»⁹).

While early feedback is easier to integrate and comes with lower change costs, early ideas are also fragile. With little invested in them, they can be dropped too easily, even if they are good, simply because the feedback was too hard.¹⁰ So at the very least, the idea should be robust enough to survive contact with reality and withstand criticism. For example, by first exploring it before seeking feedback, looking for likely criticism, knowing its strengths, or being willing to defend it.

Feedback is useless if it destroys the idea it is supposed to improve.

Exposing ideas and projects to feedback prevents you from continuing in a vacuum — without it, projects drift and usefulness is likely to decline. But with too much feedback, solutions never settle. The aim is to introduce feedback where it changes decisions, not where it merely produces opinions.

Valid Feedback

To actually improve ideas, projects, or future work, the feedback needs to be valid. However, getting valid feedback is surprisingly difficult. Even one’s own judgment of the work can switch quickly between «This is shit» and «This is the shit».

Relevant Factors in General

For some projects, especially in science and engineering, the standards are high. Measurements must be objective, reliable, and valid — and these are only the main quality criteria. There are other criteria as well. Things become even more complicated when causal effects are involved, for example whether an artifact does actually cause a certain improvement, such as a fitness app leading to more exercise and a fitter body. Working at that level requires training in methodology and statistics.¹¹

However, in many cases, being directionally correct is a sufficient standard. The question is not «Is that true?» but «Does this change improve my output enough to keep it?»¹²

Factors that influence this criterion are:

Understanding: The idea or project, as well as the goal of the feedback, have to be understood. Ideas are usually rich in one’s own mind, but others need an externalized version in order to respond to them. It is therefore usually worth first spending time externalizing them well, e.g., with an annotated sketch or an elevator pitch. Then provide information about the goal of the work, the audience, what you have done, and what you plan to do.
Source of the Feedback: In general, the closer the source is to the field or the target audience, the more likely the feedback is valid.¹³ Expert opinion can be badly wrong — in both directions. Experts can argue that something that works is impossible (e.g., heavier-than-air flight), while entrepreneurs fail with ideas that «should have worked».
Representativeness: Target audiences are rarely homogeneous, so good representation matters (e.g., novice vs. expert users). Mind-reading is usually far off («It would be useful for that group that I do not belong to.»). Better ask representatives of that group directly.
Realistic Circumstances: The greater the gap between the evaluation situation and later actual use, the less trustworthy the feedback is. For example, if users first receive an explanation but will not get one when the artifact is in the store, the feedback will likely be off. Artifacts are best evaluated under realistic conditions, e.g., people can look at them, try them out, and interact with them. This does not mean the artifact has to be finished — only that it should look and act finished enough (e.g., mockup, prototype, Wizard-of-Oz).
Want vs. Need: The target audience can often say accurately what they want, e.g., an easier communication app. But they usually lack the expert knowledge to say what they actually need, e.g., not a better-designed app but a workflow that makes much of the communication unnecessary. Sometimes they may even hesitate to state what they want, because it is awkward — for example, a professional not wanting to admit that he needs more support in doing a task.¹⁴ Thus, needs require interpretation, and possible solutions have to be evaluated in practice.¹⁵
Watching Out for Biases: Irrelevant aspects easily distort feedback. For example, we look for confirming information (confirmation bias), are more invested in things we made ourselves (IKEA effect), cool new things often produce highly positive short-term reactions that do not survive long-term use (newness effect), people try to be nice when giving feedback even though «nice» is neither good nor useful (demand characteristics), and people behave differently when they know they are being observed (Hawthorne effect). This does not mean feedback is false or useless — that would be another bias (fallacy fallacy) — only that it should be treated skeptically.
Checking for Side Effects, Second-Order Effects, Long-Term Effects, and Non-Events: Because an artifact changes a system — your life and other people’s lives — it can have side effects and long-term effects. First-order effects are the direct, intended outcomes. Second-order risks are indirect consequences created by the existence or use of what was made. They are often delayed, non-linear, and outside your control. For example, the artifact may shift attention or behavior, destabilize a system, create dependency, or provoke similar and escalating reactions.¹⁶ These effects are often missed unless they are deliberately checked for. Similarly, noticing what did not happen, but should have happened, requires deliberate attention, because non-events do not draw attention.

There are many different ways to get feedback. For example, you can ask questions (surveys, interviews) or observe behavior.¹⁷ Questions are usually the easiest way to get feedback. Behavior is often more revealing and harder to fake than words.¹⁸ However, observing behavior is harder, because we easily interpret it in ways that are biased by our expectations.¹⁹ Combining different methods can compensate for their respective weaknesses.²⁰

Feedback from a Single Source

Feedback can provide useful insights even if it comes from a single source, e.g., one person or a single AI counsel. However, there are strong constraints if it is from a single source:

Grain of Salt: It is only one perspective. The target audience is usually broader, and the source is unlikely to represent the full variance. Even experts are often wrong.
Check Understanding: To verify understanding, the counsel should first summarize the idea or artifact, then point out positive aspects, then aspects to improve, and then possible next steps. If you need social support, ask for that explicitly. You can also test people’s ability to spot and communicate errors honestly by deliberately inserting one and seeing whether they mention it.
Different Perspective: The person should bring a different perspective (actual viewpoint diversity) and provide arguments and evidence. You need someone willing to tell you if your baby is ugly or weak — and, ideally, where to apply the scalpel or how to train it. In other words, the person should be good, not nice. That is a demanding standard and usually requires someone other than an easily available friend or acquaintance. It should also be someone whose advice you are actually willing to take.
Wants You to Become Better: Counsel should first try to understand the work, then criticize it. They should want it to become better, and provide actionable feedback in a way that keeps the receiver open to it. At its best, this is honesty that turns the situation into something positive. Because feedback usually contains negative information, it should be given in private. Public feedback usually only produces defensiveness.
Counsel or Co-Creator: Feedback differs in how strongly it shapes the work. Instrumental help points out strengths and flaws, but leaves you to develop the solution yourself. That lets you become competent and retain ownership. Executive help means someone doing part of the work for you, which can easily drift into co-creation. It helps to decide beforehand what role you want.
Ethics: Only ask for feedback if you are at least willing and able to use it, because otherwise you are wasting that person’s time and goodwill.
Kind of Feedback: Be explicit about the kind of feedback you want. For example, whether you want the idea itself challenged, the implementation criticized, or specific criteria such as originality, understandability, or impact assessed. If you do not get useful feedback, check whether it is the right audience, whether the idea is clearly presented, and whether you need to provide concrete entry points («I am interested to know whether …»).

Evaluation Targets

The goal of feedback is always to improve the work. Two main targets are understanding a situation in order to come up with suiting solutions and assessing the value of a solution. Not every project needs all of these tools. More expressive domains often use looser but still reality-bound forms of evaluation.

Understanding a Situation

Situations can be quite complex, so the following questions help ensure that the main aspects are covered. Not all of them will be relevant for every project. Answers can come from the target group itself (e.g., surveys, interviews, behavior observations), but also from prior work, literature, and other sources.

Target Audience: Who are the relevant groups and actors? What are their characteristics, goals, capabilities, and constraints (e.g., interests, technical skills, financial situation)? Are there subgroups (e.g., novices vs. experts)? Who will use this directly? Who is affected indirectly?
Tasks: What does the target audience actually do or have to do, step by step, including the workarounds they use? What decisions are made? Where do errors occur? What is frequent, rare, stressful, or interruptible? The answers reveal where to simplify, automate, or support judgment.
Problems, Needs, Pain Points: What is wrong with the current state? Is a design intervention appropriate? What is the actual problem — not the requested feature? What people want may not be what they need. Who experiences the problem, and how severely?
Goal and Outcome Analysis: What does «success» mean for the target audience? What changes if the solution works? What behaviors should become easier, faster, safer, or more meaningful?
Workflow/System: How do tasks, actors, tools, and information interact across a broader system? What happens before and after the interaction?
Competition: How is the problem already addressed, e.g., through tools, habits, substitutes? What do users do today instead? Which behaviors would have to change? This includes the status quo, including doing nothing, which is often a powerful competitor.
Constraints: Which non-negotiable limits shape the design space? What cannot change, e.g., technical, legal, cultural, or financial constraints? What are the hard versus soft constraints?
Risks: What could go wrong when another solution is used? What would the consequences be? Which errors and misuses must be avoided, and which safety nets are needed?
Adoption/Change: Would another solution be accepted by the target audience? Which learning curve is acceptable? Where are the likely resistance points, e.g., training demand or low perceived benefit?

As with organizing creativity itself, context usually has a strong influence on possible solutions.

Context while Interacting

Physical: What is the material environment in which the solution is used? Lighting, noise, temperature, or movement conditions? Which objects compete for space? For example, a tablet on a construction site has to tolerate dirt and drops, whereas in an office it does not.
Temporal: What is the time structure around the activity? Time pressure, frequency of use, continuous use or interruptions, or seasonal use? How long can attention be sustained? For example, emergency workflows where seconds matter versus rarely used software that needs stronger guidance.
Social: What are the roles, expectations, and presence of other people? Is the use individual, collaborative, or supervised? What about visibility, privacy, or authority? For example, a personal mental health app (avoiding embarrassment) versus a shared editor (awareness of others crucial).
Technical: What is the surrounding technological ecosystem? Which systems, devices, and infrastructure coexist? For example, office software has to fit into complex infrastructure, while outdoor navigation has to work remotely.

Shaping the Context

Organization: What are the structures, incentives, and constraints? What does the organization actually reward, measure, or forbid? Who owns errors? Who approves changes? For example, a home app versus an organizational workflow app aligned with KPIs.²¹
Culture: What are the shared meanings, habits, and expectations? What assumptions exist about authority, risk, meanings, or metaphors? For example, using baseball metaphors outside the US.
Economic: What cost structures and value perceptions affect adoption and use? Who pays versus who benefits? What is considered worth the effort? For example, a billing tool may save time but increase transparency, so users resist it.
Legal: What is allowed, recorded, or validated? Which documentation or liability standards apply? For example, a public-administration app conflicting with data-protection requirements.
Lifecycle: What happens across the product’s lifespan? Maintenance, repair, disposal, or environmental consequences? What happens after deployment? For example, products may be installed in the field but impossible to update there.
Cognitive/Emotional: What is the user’s mental state, stress, motivation, or confidence? Is he anxious, bored, overloaded, or confident? Is the task perceived as meaningful or imposed? For example, software used for exploration or play versus software for serious, low-error-tolerance work.

Whenever you try to understand a situation in order to create a solution, you are diagnosing motivations, frictions, risks, and behavioral change. It does not matter whether the project is product development, writing a thesis, organizing a workshop, or planning a romantic evening.

If the situation is misunderstood, the creative solution will likely mismatch the conditions of reality or the motives of the people inside it. For example, that romantic evening has social and cultural constraints. If those are ignored, the evening may be technically excellent but emotionally tone-deaf.

Value of a Solution

A more complete understanding of the situation allows for better ideas. But the only way to test whether that understanding actually translated into better solutions is by evaluating them. Does the specific solution actually have value for the target audience?

Depending on the artifact, what makes it valuable can vary. Usually, some version of usability (effectiveness, efficiency, learnability), user experience (satisfaction), and acceptance can be applied. These criteria are especially useful for artifacts that are directly used. In other domains, analogous criteria apply.

Effectiveness: Does it achieve the goal? For example, if you want to convey knowledge in a blog post, is the reader able to act on that knowledge correctly? Or if you want to evoke a particular sensation with a painting or poem, does it actually have that effect?
Efficiency: How much effort is needed to achieve the goal? For example, if you develop an app, can it be used quickly and without errors? In a text, can a reader access what you want to convey quickly?
Learnability: Is it easy to learn how to use it? This matters especially when no similar artifacts exist yet. For example, you introduce a new navigation concept in an app.
Satisfaction: Do people enjoy using it? Does it feel right? Function is crucial and usually best developed first, but form also matters. Style and substance are connected.²² Beauty matters as well²³ — it has power and compels beyond rationality.²⁴ This also applies to work that deals with negative aspects of life. Depending on how it is done and in what context, people may be willing or even glad to experience negative emotion (e.g., sad-film paradox).
Acceptance: Is the target audience willing to use it? A strong indicator is whether they are also willing to pay for it. Payment may be made in money, time, effort, or some other costly currency («value for value»). Acceptance is easily distorted by a «wow» effect, because early enthusiasm often does not survive long-term use (e.g., newness effect).²⁵

These aspects can be assessed through questions and through behavioral indicators, e.g., how people use the artifact, how often, whether they reach their goal, make errors, improve quickly, enjoy it, recommend it, or pay for it.

Besides these specific aspects, the following general feedback questions often produce useful insights:

«What is good and should be kept?»
«What could be improved?»
«What am I not seeing here?»
«If you were me, which questions would you ask?»
«Anything else that might be relevant that I did not ask, should have asked, or that you want to bring up?»

Assessing the value of a solution becomes easier with experience. In that sense, these evaluations are themselves iterative and improve with feedback.

Dealing with Feedback

Whether feedback comes from a counsel or from a formal evaluation, the question is the same: How do you use it to improve the work?

Understanding First: When you get feedback, listen first. Do not defend the idea, even though that will be hard. It is your idea, so of course you are protective of it. Ask questions instead of making assertions. If something was misunderstood, that is valuable feedback, because it tells you that you need to present the idea more clearly next time.
Acknowledge the Emotions: Feedback can feel uplifting or soul-crushing — both are biases. The first step is usually to acknowledge the emotion. Pride and happiness, doubt and anger, all show that you care about the work. And as for negative feedback, even harsh criticism at least shows that more is expected. Only indifference is deadly.²⁶
Relevance of Feedback: Mine the feedback for what is useful for the current and future work. The most important feedback concerns showstoppers — things that do not work or threaten the idea or artifact itself. That includes people not understanding the idea, the solution not working, or the artifact having no value to the audience. Unless these are outliers or come from people outside the target audience, they have to be addressed. Beyond showstoppers, differentiate by importance — high priority, important but not critical, only if there is time, nice to have. If everything is important, nothing is.
Draining the Poison: Feedback ranges from honest criticism that wants the work to improve, to positive or negative mindless comments, to trolling that just wants an emotional reaction. Some people are very good at spotting critical issues but bad at phrasing them. Then the information has to be separated from the style. As harsh feedback is aversive to read, you can use a filter, e.g., a friend, colleague, or an AI, to remove the emotional tone and focus on the usable content. It can also reduce interpretative feedback and foreground operational feedback.
Weight of Feedback: Some feedback consists of assertions without argument or evidence. Such comments gain weight if other people make them too. Some comments are just matters of taste, which are less relevant unless they prevent use by the target audience. Squelchers are vague assertions that can be thrown at almost any idea («We’ve done fine without it» or «We’ve always done it this way»). These usually express a personal or organizational dislike of change. In that case, the feedback is less about the idea itself and more about likely resistance to adoption.
Only Suggestion for Improvement: Feedback is always an approach to truth, shaped by the questions, methods, and conditions behind it. Feedback can be wrong, and you can always find something — good or bad. Unless it identifies a showstopper, feedback is a suggestion for improvement, not a final verdict. Also note that people are far more likely to criticize something negatively than positively, and negative feedback tends to linger longer.
Interpretation before Implementation: Feedback should rarely be implemented exactly as given. Otherwise, because tastes differ and goal conflicts are common, you risk oscillating between incompatible versions of the artifact. Instead, look at the overall pattern and decide what the feedback means for the next iteration.

Common Failure Modes of Evaluation

Common failure modes in evaluation are Analysis Paralysis, Tainted Identity, Self-Deception, Using Feedback for Other Purposes, and Not Achieving Standards.

Analysis Paralysis

It is easy to get lost in evaluation — you can always ask more people, ask different questions, change things slightly, and evaluate again. But the value of evaluation lies in informing the next step — and that step is meant to lead toward eventual release.

At some point, evaluation also has to be done by the field, outside your control (Project Release on page 233), because that is where the most valuable feedback ultimately comes from.

Tainted Identity

A creative project should be important and meaningful. But identity issues can easily confuse feedback.

Merged Identity: When creators identify themselves with their work, objectivity is lost («I am my work.»). Because the project and the person become one, even well-formulated feedback to the work feels like an attack on the person. That makes improving the project unlikely and often ruins the journey as well. So while the work should be taken seriously, some distance from one’s identity is necessary. The work is just one step in a longer creative path. Seen this way, even a misstep is still just that — something to correct and improve over time.
Identity Contamination: Feedback can also spill over and begin shaping identity («Other people’s reactions to my work define who I think I am.»). Interpretative feedback risks contaminating identity, e.g., opinions, meanings, reactions, or identity-shaping narratives («People love you!», «You disappointed fans!»). The work turns into a judgment on who you are. No feedback should be allowed to do that, because identity rests on more than any one artifact. Operational feedback is safer, more objective, and more useful for shaping the quality of the work rather than the identity of the creator, e.g., download numbers, sales, visitor numbers, usage patterns.

Self-Deception

Capable, creative people are also capable of sophisticated self-deception. They are just very good at using their intelligence and creativity against reality.

Because friction with reality can force change, or even break an idea or project, some people delay reality contact. They continue developing the work without correction, which increases doubt and fear, which further delays reality contact, and so on. The longer the delay, the more the work drifts away from something potentially useful, and the more painful and costly it becomes to change later.²⁷

Similarly, evaluations can be biased to produce only positive feedback — by asking the wrong questions, asking the wrong people, or creating unrealistic conditions. The result is positive feedback during development and failure upon release. That failure is then usually rationalized away, because it was «unexpected» the person «couldn’t have known», or worse, «the target group is at fault».²⁸

All these biases do is prevent the project from becoming better and more valuable to the target audience. A special case is when people assume that their ideas or projects are worse than they actually are. This is often untested or tested only under heavily biased conditions, such as asking people who find flaws in everything. The consequence is that they deny themselves and others a potentially strong project.

Using Feedback for Other Purposes

Feedback is decision input. Its purpose is to improve current and future work.

However, some people want to use feedback for social validation or psychological comfort. These may be legitimate needs, but mixing them with feedback destroys its value.

A strict separation between decision input and social support — ideally through different people — keeps both intact and lets each serve its function cleanly.

Not Achieving Standards

Evaluation shows whether standards were met, missed, or exceeded. That alone is useful information, because it allows you to diagnose why and adapt the craft or the standards later.

The kind of standards matters if discouragement is to be avoided. As a beginner, it makes little sense to compare yourself with mastery — the distance is too great and mastery will look unreachable. A more useful comparison is with your own earlier performance, oriented toward improvement and toward the standards of the domain. That helps you see progress without becoming discouraged.

Social comparison is only necessary if you want to work professionally. Then the question becomes: «Are you good enough relative to the competition?» But depending on the target audience, that may not matter. If you create for family and friends, then their standards and tastes are what count.

Underflow, Optimal Flow, and Overflow

Both underflow and overflow can happen easily in evaluation — from avoiding contact with reality to getting lost in too much evaluation data (Table 20).

Aspect	Underflow	Optimal Flow	Overflow
Willingness to Improve	seeing success and failure as judgment on the person	evaluation results show what is needed to grow and improve	staying in a creative endeavor with little chance of improvement
Resistance Criteria	vague criteria, no clear indication of progress or what to change	clear criteria and standards that guide iterations	overly specific criteria that do not fit the work
Repeated Reality Contact	avoiding reality contact to «protect» ideas or the project, drifting away from usefulness over time	evaluations before, during, and after realization, so the project develops in the right direction	evaluation so frequent that solutions do not stabilize and growth phases disappear
Valid Feedback	no or low-quality criteria, lots of noise, weak decision support	useful, directionally correct feedback, with awareness of its limits	discounting feedback due to too high truth standard
Understanding a Situation	stereotypes, unchecked assumptions, overconfidence	realistic understanding; looking at what is happening	getting lost in analysis, refusal to generalize at all
Value of a Solution	assuming value, confusing «nice» with «valuable»	clear criteria for what the work should achieve, cleanly evaluated	perfectionism, not allowing the artifact to stand on its own
Feedback from a Single Source	using whoever is convenient rather than useful, following counsel uncritically	deliberately choosing a source that can give actionable feedback	discounting well-argued input, impossible standards, too much weight
Dealing with Feedback	using only convenient feedback	openness to feedback and deliberate decisions	conflicting feedback without integration or prioritization
Common Failure Modes of Evaluation	Self-Deception, Using Feedback for Other Purposes, ignoring feedback	using evaluation as decision support for improving the work	Analysis Paralysis, Tainted Identity, Not Achieving Standards

Table 20: Project Evaluation — underflow, optimal flow, and overflow.

Evaluations are how creative projects maintain contact with reality and develop in the right direction.

Done well, they keep the work on track.

Where it fits into your current creative process:

Update your ▯ Creative System Map.
Mark whether it constrains output, i.e., is a potential candidate for an ▯ Integration Worksheet trial.

Endnotes

A variant of Thomas Henry Huxley’s «The great tragedy of science — the slaying of a beautiful hypothesis by an ugly fact.» ↩
Or, put differently, evaluation is the gym for ideas. They get stronger through hard work — not because that is comfortable, but because it is not. ↩
Compare «regression to the mean». ↩
For example, no elaborate explanations for why something should have worked but did not because — supposedly, the customers were the problem. No misreading of one’s own ability, misidentification of the problem, overestimation of one’s own work or of the current value of the product, avoidance of the market’s verdict, or confusion of intention with execution. ↩
Well put by Richard Bradley, «… because I was inclined to believe it, I abandoned my critical judgment. I lowered my guard. The lesson I learned: One must be most critical, in the best sense of that word, about what one is already inclined to believe.» ↩
Beautifully put by Randy Pausch, «When you see yourself doing something badly and nobody’s bothering to tell you anymore, that’s a bad place to be. You may not want to hear it, but your critics are often the ones telling you they still love you and care about you, and want to make you better.» ↩
Note that these iterations can also take the form of sketches, simulations, or prototypes. There is no need to build different bridge designs and test them under full load just to see whether they hold. Likewise, with apps, sketches, prototypes, or Wizard-of-Oz tests may already provide sufficient contact with reality. ↩
Behavior is usually more truthful than self-description. At this stage, asking more and more people quickly leads to diminishing returns. Doing another iteration based on the feedback and then evaluating that version usually provides more actionable feedback. ↩
Or, more socially acceptable: «Drink your own Champagne». The idea is that if you develop or produce something, using it yourself, if applicable, makes it easier to improve the work. However, there is a risk of overgeneralization, because the creators understand the artifact much better than the average user. It is therefore easy to overlook «trivial» issues. For example, engineers who developed a rotary clothes dryer included the instruction to pull the cord at a 30° angle in order to expand the dryer. Less formally educated customers neither understood 30° nor considered it relevant, so they pulled the cord straight up, which was also the easiest way to do it — the affordance worked against the instruction. The result was that the dryer was ripped out of its socket and gravity did what gravity does. Video recordings of these accidents were needed to convince the engineers that «pull at a 30° angle» was not «trivial». As usual, changing the environment is the better solution — change the mechanism so that the pulling angle no longer matters. ↩
As Charlie Brower put it, «A new idea is delicate. It can be killed by a sneer or a yawn; it can be stabbed to death by a quip, and worried to death by a frown on the right man’s brow.» ↩
For example, suppose you release a fitness app and the target audience exercises more. Was that due to the app? Perhaps. But perhaps it was because the weather improved, or you tested it in spring and people start to move more going into summer, or because there was a general trend toward more exercise, or because they also used something else that explains the effect. To make the justified claim that it was the app, you would need a randomized controlled trial (RCT). People would be recruited and randomly assigned either to a condition with the intervention (the fitness app, experimental group) or to one without it (usually a placebo or informational app, control group). Group sizes must be large enough to detect effects statistically, the trial must run long enough to rule out newness effects, participants must not know which group they are in, manipulation checks must be conducted, measurements must be valid, and much more. Doing RCTs correctly requires methodological and domain expertise far beyond the scope of this book. ↩
Readers from empirical disciplines, especially psychology, are probably a bit outraged now. Yes, randomized controlled trials are the only proper way to examine causal effects, and if, for example, a fitness app is advertised with «leads to improved health», I would want to see that well-conducted study. But for most projects, that is simply too much — too complex, too hard to do correctly, and too costly in time and resources. You could absolutely get better data, but the effort required would often make them useless for improving the work. Unless it is for a scientific publication, satisficing is often enough. ↩
As Wernher von Braun put it, «One good test is worth a thousand expert opinions.» ↩
Professions differ strongly in their error culture. A common example is surgeons versus pilots — how likely they think they can work well while tired, and how willing they are to accept correction from others. Aviation safety is impressive here — but then again, pilots cannot simply bury their mistakes and they suffer from them as well. ↩
Or, in another context, a partner might say that she wants a «nice» evening, but be vague about what that actually means. Identifying the requirements — e.g., what counts as good food, quiet or action, and so on — provides guidance that can then be refined through iterations. ↩
For example, a social networking site built on attention-economy principles can shift attention and behavior toward spending more time online, weaken existing social routines by displacing face-to-face contact, create dependency as interaction becomes increasingly platform-mediated, and trigger further escalation when competitors adopt the same mechanisms. ↩
Depending on the domain, more specialized methods may be available, e.g., eye tracking, physiological measures (heart rate, skin conductivity), physical and digital traces (signs of intensive use, log files), and so on. ↩
A main problem is that language is much easier to control — people embellish, obfuscate, and lie, including for apparently good reasons such as not wanting to hurt another person’s feelings. There are also many things that are hard to notice and verbalize, for example which contextual factors shape behavior. ↩
Ask one person to describe what another person is doing, and you will often get interpretations of the behavior rather than objective descriptions of the behavior itself. ↩
But even combined data never speak for themselves. They were generated in a specific way. Akin to throwing a net into the ocean — where you throw it and how wide the mesh is determine what you catch. Beautiful metaphor by Sir Arthur Eddington: «Let us suppose that an ichthyologist is exploring the life of the ocean. He casts a net into the water and brings up a fishy assortment. Surveying his catch, he proceeds in the usual manner of a scientist to systematise what it reveals. He arrives at two generalisations: (1) No sea-creature is less than two inches long. (2) All sea-creatures have gills. These are both true of his catch, and he assumes tentatively that they will remain true however often he repeats it.» Again, this is not an argument against evaluation, but an argument for care in conducting evaluations and interpreting results. ↩
As a practical example, a software project in Java failed in a hospital because the IT department prohibited Java applications for security reasons. Such organizational issues are often underestimated, because they are hard to change and can affect the legitimacy of organizations. ↩
The connection between style and substance was beautifully expressed by either Stephen Fry or Christopher Hitchens, «A true thing, badly expressed, is a lie.» ↩
As Roger Scruton put it, when he argued for the fundamental importance of beauty: «Beauty isn’t this casual thing that you might choose to be interested in or not — just as someone might choose to be interested in chocolate or something! It is, if you like, the thing that attaches us to the world in the first place. It is the thing that is telling us: ‹You belong here.›» ↩
That beauty compels beyond rationality makes it dangerous in science and engineering, because it can distract from the ultimate evaluation standard — reality. In statics, beauty does not matter. At the same time, beauty is not irrelevant in architecture, because it influences what happens to artifacts (e.g., graffiti, vandalism) and what artifacts evoke in people (e.g., whether they motivate us to greatness or numb us). As someone put it, «Once you understand that beauty motivates men to greatness … modern architecture makes a lot more sense.» ↩
The best time to buy a used home trainer cheaply is a few weeks after New Year. ↩
Alley (1996). ↩
As Steve Krug put it, «Testing with one user early in the project is better than testing with 50 near the end.» ↩
For example, Ashleigh Brilliant’s «My play was a complete success. The audience was a failure.» or PEBKAC (problem exists between keyboard and chair). A practical example, a museum exhibition was created to explain nanotechnology, but confused most visitors about the topic. The curator’s reaction: «Great, now they understand the complexity of nanotechnology.» ↩

Supplemental Materials

OC3 Navi

Home | Front Matter
Why Organize Creativity
Creativity as a System: 1. Creativity, 2. Creative System, 3. Application, Meta: Supplemental Materials
Framework: Foundation: 4. Person, 5. Environment, 6. Capabilities, Meta: Tools
Framework: Ideas: 7. Generating Ideas, 8. Capturing Ideas, 9. Collecting Ideas
Framework: Creative Focus: 10. Creative Direction, 11. Creative Energy, 12. Creative Commitment
Framework: Projects: 13. Project Realization, 14. Project Evaluation, 15. Project Release
Back Matter: Afterword by the Author, Afterword by AI, Sources and Foundations, References, About the Author, Feedback and Saying Thanks, Glossary, Appendix

ORGANIZING CREATIVITY

How to generate, capture, and collect ideas to realize creative projects.