cart;horse: Most startups fail, this one did too. Tried too much with too little, with a little too much ego.

Coordinal Research’s goal was to build an automated safety research platform. A researcher writes “Replicate X result from paper Y with tweak Z” and the system provisions a sandboxed compute environment, gathers context, writes code, runs experiments, and returns a research report plus oversight trail.

The bet was that AI-accelerated research is coming whether we like it or not, and having this differentially happen for safety work faster is good. The bet didn’t pay off — at least not in the form I tried to make it.

The main content is the condensed version. The full chronology, the final push, and the technical artifacts I shipped are appendices for anyone who wants them. See also our Manifund page for the original pitch/ask.

The arc, in brief.

The startup. Fall 2024 to Fall 2025: out of MATS 6.0, through Catalyze, into cofounding with Jacques. Most of Q1–Q2 2025 was grant applications and demo building (no dice on grants). In April a funder I’d previously pitched came back via a coworking connection. After a carefully-crafted email, we closed $125K on an MFN SAFE. Incorporated Delaware C-corp, put on some MATS workshops, built more, joined the 50/50 accelerator. Split with Jacques in October, started with Leo in November, split with Leo in late January. Two splits and a lot of admin between them.

The startdown. Q1 2026: a final push on two things. (1) Ship the user-facing app at coordinal.org/app. (2) Demonstrate SOTA on RE-Bench. The RE-Bench work went well — full-suite normalized average from 0.547 to 1.624 over a month, ~$30k of compute, 6/7 tasks reliably producing real non-reward-hacked results with Sonnet 4. The app demo didn’t — a friend couldn’t figure out the interface, which clarified I was much further from a shareable product than I’d thought. Burned out, I decided to stop unless funded. Coefficient Giving eventually declined the $1M budget I’d sent.

The salvage. A working full-stack platform on AWS, an agent orchestration system with real observability, an RE-Bench eval pipeline, and a tour through the entire legal/admin layer of incorporating and unwinding a small tech org. Detailed writeup in the appendix. Could have been worse.

Lessons.

Counterfactual safety from outside is structurally hard. Most platform/tooling-shaped safety work gets built by default by the frontier labs and the big agent frameworks, faster than an outside org can ship equivalents. An early scaffold I had was eaten by Claude Code; my tmux+worktree manager (geewit) is now essentially a CC feature; various hooks and oversight tools have since shown up in agent frameworks and provider tooling. For-profit AI safety has its own issues: as everyone in the 50/50 AI safety stream stumbled into, most fundable for-profit safety work is security middleware. Non-profits have a complementary problem: there’s effectively one big funder, and your job is to convince them. The broader worry is structural: selection pressure across the ecosystem — toward research credentials and known/agreed-on bets on the non-profit side, toward middleware and product on the for-profit side — systematically under-produces the kind of counterfactual work I think the field most needs.

Forever fundraising, and never salient enough to stop. $125K felt like a lot. It isn’t: more than half goes to compute and contracting, and you can’t pay two Bay Area salaries from the rest for more than six months. Six-month grant cycles with no in-cycle contact while burning $5k/mo of personal runway is hostile. The maybe interesting asymmetric lesson: small grants below the threshold of the work may be net-negative: they keep you alive enough to keep trying without enough to actually do the thing. I figured out too late that for what I wanted to build to succeed, I need proper support and to not think about fundraising for at least 6-12 months.o I figured out too late that for the work to succeed I need proper support and to not be thinking about fundraising for at least 6-12 months at a stretch. The ecosystem isn’t really set up to provide that, and I think it should be.

I should have played the game more. I thought BOTECs for alignment work were dumb — how would estimating differential x-risk reduction basis points mean anything? I figured smart funders would just get the obvious argument around reducing elicitation overhang. They didn’t, and ego stopped me from making up the numbers anyway. Probably should have.

I tried to do too much. Org-builder + research engineer + senior cloud/infra engineer is three jobs. Counterfactual cost was close to $1M/yr at industry rates; I was doing it for free, often alone, while fundraising. The conviction-vs-flexibility tradeoff cut against me too: I cared about the thing getting built, not the org, so I couldn’t easily pivot when feedback pushed that way. Maybe no org at all would have been better — independent funding, mild affiliation, much less admin. I should have set sharper milestone-shaped goals and ignored more sidequests (MATS workshops, oversight tools, etc.).

I built for myself, and didn’t show anyone. Standard mistake; doesn’t make it less true. My internal bar is so high that I rarely thought anything was ready to show, so scope creep + 100 self-generated P0 tickets meant I sat on the work. I should have hill-climbed on concrete, externally-legible outputs (e.g., the RE-Bench numbers I generated at the end) six to ten months earlier — way easier salience, fast feedback loops, the kind of thing fundraising actually responds to.

Cofounder fit: trust your gut. I ignored mine. “You need a cofounder” is partly a hedge for funders and VCs. On the founder side, given how often startups fail anyway, you spend a lot of time with cofounders who won’t work out. Both my splits were the right call; I should have made them faster, or trusted my “full body yes” calibration earlier and gone solo from the start. (I’m genuinely lucky both splits were with great people who operated in good faith — community norms made this a lot easier than it could have been.)

What’s Next.

So I’m winding down for now. The org will be in some hibernative state (I still may want to do something in the future, the AWS credits are still useful, the codebase could be returned to). The SAFE remains in place and if we reactivate the terms still hold. The METR thread eventually faded. We talked about collaboration, they pointed me at open roles, I applied and didn’t make it through the work trial. I’m doing some contract work building uplift and automation tooling, and am thinking about where and what is the best counterfactual use of my time and skills. It’s hard for me to not just full-send on what I think is the most important/ambitious/long-term thing, and right now I’m leaning towards automated philosophy or advocacy, so we’ll see where I end up…

I think I’m bearish on starting organizations right now given the funding landscape and what I think is necessary for an org to succeed and have a counterfactual impact, but I’d be happy to talk to anyone about how their situation may be different! Feel free to reach out for a chat about any of this. I’m particularly interested in if any of these things could point toward ecosystem-level improvements to make it more likely that others in similar positions succeed, or at least don’t waste too much of their precious time.


Appendix A: The Startup.

I had just finished MATS 6.0 in the Fall of 2024, was mildly working on some follow-up work, and decided to quit my MLE job that wasn’t really aligned with directly addressing the technical problems. There had of course been talk about needing more startups and founders, and Catalyze was starting so I applied. As part of the application, I built a weighted factor model. It was rough, but pointed me in a direction I hadn’t really considered, and realized was severely neglected and tractable (in my eyes): building out capacity for research automation differentially pointed at safety sooner rather than later. Catalyze was great to slowly figure out what it would look like to build an org. There was an emphasis on cofounder matching, I got to meet awesome people, learn from them, and start working with someone. We ended up splitting towards the end of January, and that’s when Jacques and I began working together.

For most of Q1 and Q2 2025 Jacques and I spent a lot of time figuring out what we wanted to do and how, splitting duties roughly along CEO/CTO lines, respectively. We spent a lot of time on grant applications. They say not to spend more than X hours, that is not reality. We worked on pitches for ARIA, OpenPhil’s TAIS (Technical AI Safety) RFP, SFF (Survival and Flourishing Fund), and Foresight. During this time I was also building out an MVP whose scope kept growing. People at EAG (Effective Altruism Global) Bay Area 2025 were interested and generally excited, but it wasn’t at a point to ‘Wow’. A funder who saw the demo saw some potential but suggested that I focus on making the interface much better (it was essentially a CLI with a rough pure JS ‘viewer’). Started spending time on that, as well as figuring out how to create an async demo I could share remotely (read: I started building a web app and figuring out what the cloud infrastructure would look like…). To be clear, this was a platform for autonomous safety research. As you might expect, this requires live compute for a ‘project’, this isn’t just a PostgreSQL database with a Next.js frontend. I needed to be able to allocate a full compute stack to run what was, at the time, a prototype agent framework with full read/write/execute permissions, custom tooling, internet access, filesystem management, and GPU access. You can see the issue here…

This was all burning personal runway, using my own money to spend on compute, and before Claude Code had been released or was any better than the scaffold I had built custom. In April, a co-working space connection reintroduced me to a funder who had originally passed during my Catalyze pitch (Jacques’ as well). The pitches weren’t great, and I understood that decision. Fortunately, with a carefully crafted email (spent about 3 hours on this one email), I was able to convince this funder that we were worthy of seed funding.

This was awesome! With some back and forth, plus some additional support from a few others, we were able to secure an initial investment of $125K. We took some time getting these funds to actually do something with; we still had massive uncertainty that wasn’t clarifying regarding whether to go non-profit or for-profit (see Jacques’s great post on this at the time). At some point I was tired of this and I just spent a day creating a Delaware C-Corp via Stripe Atlas. At the time, I was a bit frustrated with the non-profit AIS funding ecosystem, particularly around grant response times, and we were becoming a bit more concerned that scaling in the future would be difficult unless we had proper seed funding (that could potentially only come at a fast rate if we were a for-profit). Incorporated as 50/50 founders (I CTO, Jacques CEO), opened the bank account, received the funding via an MFN SAFE (Most Favored Nation Simple Agreement for Future Equity). This at least gave us some breathing room to pay for all of the tools we were using (at the time a lot of experimentation, close to $2k/month for each of us (see Jacques’ posts for more detail)) plus the compute costs (AWS EC2/S3/etc. alongside straight tokens).

With this money we felt some amount of comfort to build, but were frugal and conservative, because we had uncertainty about if we were going to pivot, Jacques’ ability to work as a Canadian citizen, and if we would need those funds fully for compute to push for the next fundraise. During the summer we put on some workshops on using Claude Code for MATS: I think these were generally not that successful, but it was a great learning experience.

We had decided to pursue the for-profit direction a bit more in August, building parallel MVPs (Jacques more toward uplift and Claude Code extensions, myself more toward full-stack research automation) and applying to accelerators and networking for VC intros. We were hoping to submit some results to Neel’s MATS stream application, but scope creep and failing to retrofit to the type of research needed for the work trial meant this failed. We started the 50/50 accelerator in September, alongside sprinting on a new product direction we felt could work very well as a for-profit startup (agent oversight products: at the time no one was using Claude’s hooks and we saw potentially high leverage in converting our internal tools to a product). 50/50 was great, but as we and everyone else in the AI safety ‘stream’ learned, you have to massively sacrifice on the safety vision if you want strong funding, or you have to just work on non-dilutive funding (e.g., fellowships, grants) until you have proof-of-concept and product-market fit to pivot out to venture support.

(I’m not going to characterize the cofounder splits too much, and I did share this with both of them before posting. You can assume standard reasons why such things don’t work out.)

In October Jacques and I split. Outside of the obvious reasons, this was tough — winding down a 50/50 incorporation cleanly takes time and care. We worked through it amicably.

I was fully set on only building the vision in my head at this point, and did not want to be distracted at all by for-profit incentives. I also met Leo around this time, and in November we started working together. I think we generally had alignment, and we spent a lot of time figuring out if we were on the same page about tons of stuff. We decided to go forward, and started talking with OpenPhil/Coefficient Giving in December. These discussions were promising but still slow.

Towards the end of January I started recognizing the same feeling I had had in October, and decided to part ways with Leo. Again this sucked, Leo’s awesome, but it just wasn’t clicking. I was also becoming highly uncertain of what was important, the original theory of change, and if I wanted to continue working on the startup. This was slightly easier admin wise, but still somewhat costly in time and money (IP agreement language, fair compensation, etc.).

Appendix B: The Startdown.

At this point I decided to make a final push on two pieces. I had been extremely distracted from what I felt was the core output that I wanted: to demonstrate automated safety research was possible. So I put my full effort into 1) building out the full end-to-end app that anyone could use, and 2) running my backend scaffold and system on RE-Bench to demonstrate SOTA performance.

So I grinded this out, hoping to be able to use the outputs of this to convince Coefficient Giving to fund me. I spent most of my time working on the frontend-backend infrastructure, debugging full-stack deployments, setting up account management, research project management, monitoring and oversight interfaces, and getting the end-to-end workflow of an end-user in a reasonable state. I also decided it was time to spend money (AWS Quota limits stopped me from taking advantage of Activate Credits, so I had to just go with Lambda Labs out of pocket) on running and hill-climbing on RE-Bench. RE-Bench results were rapidly improving as I patched holes in my interface, figured out workarounds for their Inspect integrations, identified scaffold improvements, etc. I ended up running about 17 full-suite runs over the 7 tasks, with about 77 other single-task probes. Removing outliers, my full-suite average norm went from 0.547 to 1.624 over the course of a month, spending about $30k on compute during this period. I’m also happy with the reliability improvements over that time, with 6/7 tasks confirmed to be real results without technical issues or reward hacking every single run. The underlying model here was claude-sonnet-4-20250514, so I was fairly excited about these results.

Mid-February I hit a semi-self-imposed deadline to demo the user-facing app to a friend. It was generally working, you could copy-paste a google doc into a box and kick it off after logging in to https://coordinal.org/app/ . This would do a ton of cloud and project setup, provision a VM and container, mount any existing project files, and kick off the scaffold. The oversight was still a bit janky, but it generally worked.

They couldn’t figure out the interface, obviously. They didn’t know how to use it, there were too many settings to choose from, it was unclear what the scope of a project or run should be (even though I was intentionally trying to have it be “literally put anything in this box”), and I didn’t have a practiced pitch or tutorial ready to explain it. The demo confusion clarified I was much further from a shareable product demo than I’d thought.

I was burned out at this point and realized it was time to stop. This wasn’t working and wasn’t sustainable. I went for a walk and generally decided that I wouldn’t be working any more on this unless it was funded. It was clear I couldn’t keep doing this all alone with no money if I wanted to build this platform, and if I just wanted to do capability elicitation I should consider working with/for METR (we had started some discussions at this time). I shared the RE-Bench results above with them, but didn’t get a clear follow-up, and didn’t push — newer model cards already match or beat these, so there isn’t much point pursuing it further.

Over the next few weeks I started doing some soul-searching; the models were just getting better and any alpha I was bringing was getting more and more swallowed by Claude Code and related tools plus better models. I was pretty burned out and was not very interested in continuing working alone, so I just started looking for other existing orgs that I felt excited about, whose mission I could help/accelerate. Discussions with CG continued, they were interested in a budget for an org that would work on building out automated research replication capacity. I sent them an ambitious budget for $1M for a year, that roughly broke down to salaries for 3 engineers/researchers, ~2 FTE equiv. contractors, 0.5 FTE COO, and then $250k compute for building out PaperBench-style CI/CD workflows and infrastructure. They eventually said no (April), citing not enough research experience; I was tired enough and had already mentally moved on so let it be.

Appendix C: The Salvage.

Holy shit I learned a lot. It’s not all bad! Look at all of these things I’ve built, new skills I have, topics and ideas I can speak confidently about, ways of thinking that will contribute strongly to any future work:

  • Shipped a real full-stack web app with cloud infrastructure to the public internet. coordinal.org/app/ — Clerk-based auth, multi-tenant RBAC, real-time job monitoring, file browsing, goal-hierarchy views, benchmark analytics dashboards. Mostly vibe-coded (React/TypeScript/Vite/Tailwind frontend, Flask backend, WebSockets); the full codebase is a monorepo with 350k LoC (probably closer to 1M written and discarded across 18 months of agent-assisted iteration). I now actually know how to deploy something real behind CloudFront with auth, not just a static GitHub page. Warm EC2 instance pools with GPU fallback across p3/g4dn/g5 (1-5s warm starts vs. 60-90s cold-start provisioning), EFS-shared workspace storage, S3 for run artifacts and progress tracking, CloudFront + Route53 + SES + Secrets Manager. Pulumi Automation API for IaC, GitHub Actions OIDC for AWS auth (no long-lived creds anywhere), multi-architecture ECR builds. I can now estimate AWS Bedrock quotas, hit them, file the right increase requests, and route around the ones I can’t get.

  • Built a real agent orchestration platform with run observability. Hierarchical multi-agent system and depth-limited delegation. Four configurable continuation policies (single-shot, fixed-iteration, time-budget, goal-driven), plus custom message-history compaction with prompt-cache integration so long runs survive past the context window. Three-container Docker architecture with strict isolation between orchestrator, agent, and workspace; JJ (Jujutsu) integration for immutable time-travel snapshots at every agent decision boundary. Full OpenTelemetry/Logfire instrumentation with a hierarchical span structure (run → agent/goal → tool → checkpoint), automatic reasoning capture at every key decision point, and structured post-run reports with root-cause analysis. Every tool call, every delegation, every error is traceable.

  • A real RE-Bench eval pipeline, plus the results. End-to-end pipeline on top of METR’s Inspect AI framework: normalized scoring per their paper’s formula, S3-based distributed progress tracking, per-task TOML configs, multi-replicate execution, score-log sanitization so the agent can’t binary-search the ground truth. Ran ~17 full-suite passes and ~77 single-task probes. By the end, 6/7 tasks were reliably producing real, non-reward-hacked results, with 3/7 significantly above the reference implementation. Bigger win: I now know what scaffold hardening actually looks like in practice: eliminating ground-truth leakage, gracefully recovering from context-overflow API errors, locking down score-log permissions, fixing scoring-script fallthrough bugs.

  • The non-technical scaffolding of a tech org. Incorporating a Delaware C-corp via Stripe Atlas, Google Workspace admin, backoffice and legal vendor management, payroll, state and federal taxes, IP agreements, cofounder agreements. Annoying but useful experience, but I now know how the legal/admin layer of a tech org actually works and how to navigate two cofounder splits without destroying the relationships or the org.

Not to mention all of the mini-projects that failed but taught me something anyway. Pretty sweet, I think.