AI use within software development is no longer a fringe experiment — it is the default mode of operation. Stack Overflow’s 2025 Developer Survey found that 84% of developers are already using or actively planning to use AI tools in their workflow — up from 76% the year prior. The number of developers who write their code entirely without AI assistance is a growing minority that continues to shrink each quarter. Yet buried in that same survey is a more interesting finding: developer trust in AI tools is low. 46% of respondents said they actively distrust the accuracy of AI output, up from 31% the year before.
This is not inherently a bad thing. AI has transformed the way we work — we can cover much more ground without having to be deeply involved in every implementation detail. Furthermore, as time passes, AI continues to improve. A well-prompted LLM can scaffold a service, write a test suite, or draft a migration in the time it would have taken a senior engineer to open the right file. We can stay at higher altitudes: architecture, tradeoffs, sequencing, and move faster therefor ship more.
But this productivity gain comes with a cost…
The Discipline Problem
There are two failure modes that AI-assisted development quietly introduces.
The first is the obvious one: slop. If you are not disciplined about reviewing what the AI produces — really reading it, questioning it, verifying it — low-quality code seeps into your codebase. It looks right. It passes a surface-level review. But it carries hidden assumptions, misunderstands your domain, or solves a subtly different problem than the one you posed. Discipline around output review is the antidote, and the industry broadly understands this.
The second failure mode is more subtle: cognitive atrophy. The more we delegate implementation details to AI, the less we exercise the mental muscles that handle those details. When you never have to think through a concurrency primitive, a memory layout decision, or an edge case in your parsing logic — because the LLM handles it — you gradually lose the fluency to evaluate whether those choices are correct. Your ability to catch the slop erodes.
There is a third failure mode that sits underneath both of these: you can’t learn what you never encounter. When AI handles the unfamiliar parts of a task, you are robbed of the friction that would have forced you to understand it. The gaps in your knowledge don’t just persist — they become invisible. You stop knowing what you don’t know. And the more capable the AI + the less disciplined the AI user, the wider this blind spot can grow.
Andrej Karpathy put it precisely: you can outsource your thinking, but never your understanding. The danger is not that AI thinks for you — it’s that you mistake the AI’s thinking for your own understanding, and eventually forget the difference.
What DevLog+ Is
DevLog+ is a self-hosted, single-user developer journal built to answer one question: what do I actually know, and what should I learn next?
It is not a note-taking app with AI features bolted on. It is a structured learning engine with two connected feedback loops. The first is a Learning Engine that reads your journal entries, synthesizes them into a living Knowledge Profile, and continuously maps your strengths, weak spots, and the frontier of what you’re actively learning versus what you should be learning next. The second is a Practice Engine that generates weekly (at the moment Go) micro-projects calibrated to your actual skill level to prevent the atrophy that comes from spending too long in the comfortable middle of your competence.
These two engines are connected through a shared feedback loop: everything you do — journal, quiz, build, read — feeds back into the profile, which refines what each engine produces next.
The Core Workflow
The workflow is deliberately simple:
- Journal — you write about what you worked on, what confused you, what clicked.
- Profile Update — an LLM pipeline reads your entries and updates your Knowledge Profile.
- Quiz — ten questions generated from the profile, graded by an LLM judge, feedback incorporated.
- Reading Recommendations — curated links targeting your weak spots and next-frontier topics, drawn from domains you trust.
- Micro-Project — a self-contained coding exercise, calibrated to your profile, with tasks that mix bugs, features, refactors, and optimizations.
- Feedback Loop — thumbs-up/down and free-text notes attached to anything the system surfaces, correcting the profile in real time.
The output is not a list of notes. It’s a structured, AI-maintained picture of your engineering mind.
The Knowledge Profile
The Knowledge Profile is the central artifact in DevLog+. It is read-only from your perspective — the system maintains it, not you — which is intentional. Self-assessments are notoriously unreliable. The profile is built from evidence: what you’ve journaled, how you’ve performed on quizzes, what feedback you’ve given. Every topic in the profile has a confidence score and a category:
- Demonstrated Strength — topics you’ve shown you understand well and consistently
- Weak Spot — areas where quiz performance, journal uncertainty, or your own feedback signals a gap
- Current Frontier — topics you’re actively encountering and building familiarity with
- Next Frontier — adjacent topics the system recommends you deliberately explore
- Recurring Theme — topics that appear repeatedly across your work, whether or not you’ve mastered them
- Unresolved — items the system isn’t confident it can classify, surfaced for your review
This taxonomy is the engine that drives everything else. The quiz doesn’t ask random questions — it targets weak spots and frontier topics. The micro-projects don’t just practice what you know — they’re calibrated to push you into the next frontier. The reading list doesn’t surface what’s popular — it surfaces what you need.
Quizzes and the LLM Judge
Quizzes are free-text, not multiple choice. Multiple choice tests recognition; free-text tests understanding. Each weekly quiz generates ten questions drawn from the Knowledge Profile, and your answers are evaluated by a separate LLM judge that scores each response across three dimensions: correctness (full, partial, or incorrect), depth, and expressed confidence.
Those scores feed directly back into the profile update pipeline. A strong quiz answer is evidence that bolsters a topic’s confidence score. A weak one is evidence that surfaces it as a weak spot or flags it for the triage queue. The quiz is not just assessment — it’s a profile refinement mechanism.
Reading Recommendations
Reading recommendations are generated from profile gaps and growth areas, drawn from a curated allowlist of domains you trust. The system doesn’t just ask an LLM to suggest links and display them — it validates every URL concurrently with HTTP HEAD and GET requests before surfacing anything. LLM-hallucinated or stale URLs are silently dropped. What you see actually exists.
Three recommendation types are generated: next frontier reads (to open up adjacent areas), weak spot reads (to shore up gaps), and deep dives (to go deeper in areas you’re strong but could go further).
Micro-Projects
The practice engine generates a new project, designed around your current skill level. Each project is a self-contained codebase with starter code, a working test suite, and a set of tasks: some are bugs to fix, some are features to add, some are refactors or performance optimizations. The LLM generates the project, then runs a post-generation compile check via CLI to verify the code actually compiles before the project is issued.
When you submit, the work is evaluated by an LLM grader before the next project is issued. The evaluation informs the next project’s difficulty calibration and feeds back into the Knowledge Profile.
The Stack
DevLog+ is built with technologies chosen for clarity, reliability, and long-term maintainability over trend-chasing.
Backend: Python 3.12 + FastAPI + SQLAlchemy (async)
FastAPI gives you automatic OpenAPI generation, first-class async support, and strong Pydantic integration for request/response validation. SQLAlchemy with asyncpg keeps database operations non-blocking across the board. The architecture is strictly layered — pipelines, services, prompts, models, and config are separated with enforced import boundaries, verified at test time via make test-arch. No circular dependencies, no ad-hoc cross-layer calls.
Frontend: React 18 + TypeScript + Vite + Tailwind CSS
Standard modern frontend stack. The API client is generated from the OpenAPI schema (openapi-typescript) so frontend types stay in sync with the backend contract automatically. Vite provides fast HMR in development. A Prism mock server enables frontend-only development without a running backend.
Database: PostgreSQL 16 + pgvector
A deliberate choice to keep the data layer simple: one database, no separate vector store. pgvector extends Postgres with 1536-dimension vector embeddings for semantic similarity — used for topic deduplication and related-topic lookups during profile updates. Alembic handles migrations uniformly. The tradeoff is less mature approximate nearest-neighbor indexing than a dedicated vector database, but for a single-user application the performance is entirely adequate and the operational simplicity is worth it.
Observability: Langfuse
Every LLM call in DevLog+ is traced through Langfuse — full prompt and completion history, token counts, cost per pipeline run, and latency breakdowns per node. This is not optional infrastructure; it is necessary infrastructure. Without it, debugging a bad profile update or a poorly calibrated project generation means staring at logs and guessing. With it, you can see exactly what context the LLM received, what it returned, and what it cost. Langfuse traces support feedback attachment, so pipeline outputs can be linked back to the quiz answers or journal entries that triggered them.
Testing: pytest + pytest-bdd + Vitest + Stryker
The test strategy is layered to match the architecture: unit tests for services, integration tests against a real Postgres instance (no mocks), architecture tests that enforce the import DAG, and BDD scenario tests written in Gherkin that serve as living acceptance specifications. LLM pipeline nodes have separate evaluation scripts (make eval) that measure accuracy and latency with explicit cost tracking.
OpenRouter: One API, Every Model
All LLM calls in DevLog+ route through OpenRouter, not directly through Anthropic or any other provider’s API.
OpenRouter is a unified LLM routing layer that gives you access to dozens of models — Claude, GPT-4o, Gemini, Mistral, Llama, and others — through a single OpenAI-compatible API. You authenticate once with an OpenRouter key and switch models by changing a string in your config.
The reason to use it for a project like this is straightforward: model flexibility without code changes. Each of DevLog+’s seven pipelines can be configured to use a different model via environment variables (LLM_MODEL_QUIZ_GENERATION, LLM_MODEL_PROJECT_GENERATION, etc.). If you want to experiment with a cheaper model for topic extraction and a stronger one for project evaluation, that’s a one-line config change. If a new model releases and you want to benchmark it against your current setup, there’s no integration work — you change an env var and run the evaluation scripts.
For a self-hosted personal tool where cost and model selection matter more than enterprise SLA guarantees, OpenRouter is the right abstraction level.
Conclusion
DevLog+ is, at its core, a bet that the most valuable thing a developer can do in the age of AI is stay honest about what they actually know. AI gives us leverage, but leverage amplifies both competence and incompetence. The engineer who knows their weak spots and works systematically to close them gets more from AI assistance.
The system won’t make you a better engineer automatically. Writing in the journal still requires reflection. Completing the quizzes still requires effort. Working through the micro-projects still requires you to actually write code. What DevLog+ does is reduce the activation energy for all of that — it removes the coordination overhead of figuring out what to study and provides a feedback loop that self-corrects over time.
If you spend any meaningful portion of your working life writing software with AI assistance, the question worth sitting with is: when the AI is not available, what can you do? DevLog+ is my answer to making sure that question has a good answer.
Github Link: https://github.com/Ullauri/devlogplus
Further Reading
On AI and Developer Productivity
- Stack Overflow Developer Survey 2025 — AI Section — The definitive annual survey on AI tool adoption, trust trends, and developer sentiment across 49,000+ respondents
- Developers remain willing but reluctant to use AI (Stack Overflow Blog, 2025) — Summary and analysis of the 2025 survey findings, including the growing trust gap
On Cognitive Offloading and Learning
- Karpathy Skills: The LLM Coding Manifesto (BrightCoding, April 2026) — Karpathy’s current framing of agentic engineering: developers as orchestrators who need deep enough understanding to catch what agents get wrong
- The Augmentation Trap: AI Productivity and the Cost of Cognitive Offloading (arXiv, 2026) — Peer-reviewed research showing that delegating coding tasks to AI produces working code but erodes conceptual understanding; sustained AI use risks degrading the very skills the productivity gains depend on
On Deliberate Practice
- Deliberate Practice for Software Engineers (dasroot.net, January 2026) — How the core principles of deliberate practice — specific goals, targeted feedback, and structured repetition — apply directly to software engineers working alongside AI tools
On the Stack
- pgvector: Open-source vector similarity search for Postgres — The extension powering semantic search in DevLog+’s Knowledge Profile
- Langfuse: Open-source LLM engineering platform — LLM observability, prompt management, and evaluation — essential infrastructure for any serious LLM application
- OpenRouter API Documentation — The unified routing layer used to manage all LLM calls in DevLog+
- FastAPI Documentation — The Python web framework powering the DevLog+ backend
Leave a comment