Go's Scheduler: G, M, P

Go runs your goroutines on operating-system threads, but it does the scheduling itself, in user space, without asking the kernel which goroutine runs next. Three things make that work: G, M, and P.

The three entities

A G is a goroutine: the unit of work, with its own small stack that starts at a few KB and grows as needed. You can have millions of them. An M is a machine, which is Go’s name for an OS thread. The kernel schedules M’s onto CPU cores and has no idea goroutines exist. It only knows about threads. A P is a processor: a scheduling context, and the permission to run Go code. There are exactly GOMAXPROCS of them.

The arrangement is M:N. Many goroutines multiplex onto a smaller set of threads, through a fixed number of processors. The kernel schedules M onto cores, the runtime schedules G onto M, and an M can run Go code only while it holds a P. Here are all three, and the two ways a P keeps working when a goroutine blocks:

What a P actually is

A P is not the thread and not the core. It is a bundle of per-CPU runtime state plus the right to execute Go code. Two things inside it matter most. The first is a local run queue: the P’s own list of runnable goroutines, a 256-slot ring buffer. The thread holding the P pops its next goroutine from this queue with no lock, because no other P touches it. The second is a per-P allocator cache, the mcache: a stash of memory spans that lets a goroutine allocate small objects without taking a global allocator lock.

The reason behind this is historical. Early Go had only G and M, and a single global run queue behind one lock. Every thread contended for that lock to get its next goroutine, which did not scale on multicore. The P was added to give each execution slot its own lock-free queue and allocator cache. That is the whole reason “running Go code” means “an M holding a P”: without a P, an M is a bare thread with no queue to pull work from and no cache to allocate from. The P is what turns a thread into a thread that can run Go efficiently.

The per-P state exists to make the running path fast, and it is deliberately not tied to goroutine identity. That decoupling is what makes goroutine migration and work-stealing cheap (discussed below).

GOMAXPROCS

GOMAXPROCS is the number of P slots the runtime creates. That is the entire mapping, and a few things follow from it.

It caps parallelism, not concurrency. With GOMAXPROCS=4 you can have 100,000 goroutines in flight. At any instant only 4 are running on a CPU. Concurrency is how many goroutines are in progress. Parallelism is how many run at the same instant. P is the parallelism knob.

It does not cap the thread count. A thread blocked in a syscall holds no P, so a program doing heavy blocking I/O can have many more live M than GOMAXPROCS.

The netpoller

Most blocking in real programs is network I/O, and Go handles it without blocking a thread at all.

The sockets in Go’s net package are non-blocking underneath. When a goroutine makes a network call and the data is not ready, the runtime parks the goroutine (marks it waiting) and registers its file descriptor with the netpoller, a thin wrapper over the OS readiness mechanism: epoll on Linux, kqueue on BSD and macOS, IOCP on Windows. The goroutine detaches from its M, which frees the thread and its P to pick up another runnable goroutine right away. The thread never blocks. One OS thread can serve thousands of goroutines waiting on the network.

When the kernel signals that the descriptor is ready, the netpoller marks the goroutine runnable again, and the scheduler resumes it on some M and P, right after the line where it waited. For the duration of the wait the goroutine holds neither a thread nor a core. It is a small parked structure costing a few KB. The runtime does the parking and waking. The kernel only delivers the readiness signal.

Blocking syscalls

Some calls cannot be made non-blocking: certain file I/O, and calls into C through cgo. Here the thread genuinely blocks in the kernel, and the netpoller trick does not apply.

So the M blocks, and that thread is stuck until the call returns. But the P does not wait with it. A background monitor thread, sysmon, together with the handoff logic, detaches the P from the blocked M and hands it to another M, starting a fresh thread if none is free. The P keeps moving: its queued goroutines keep running on the new thread while one thread sits in the syscall. When the call returns, the now-unblocked M tries to reacquire a P to finish its goroutine. If none is free, the goroutine is queued and the M goes to sleep in a pool, ready to be reused.

This is why the P is the scarce, contended resource and the real limit on parallelism. Freezing a P’s whole queue of runnable goroutines because one thread is waiting on the kernel would be a waste.

Work-stealing

Per-P run queues take the lock off the common path, but they drift apart. One P can drain its queue while another still has a backlog, and a P with nothing left could sit idle next to a pile of runnable work.

When a P’s local queue empties, it does not go to sleep. Its thread looks for work elsewhere, in a fixed order: it checks the global run queue (the slow, locked path used for overflow and balancing), then polls the netpoller for goroutines whose I/O has completed, and only then does it steal - it picks another P at random and takes half of that P’s queue.

Taking half, rather than one goroutine or the whole queue, is what makes it converge. If one P holds all the work, the first steal splits it in two, the next splits each half again, and the queues even out in a few hops. No central dispatcher hands work around. Each idle P rebalances the system itself, and only the rare trip to the global queue takes a lock.

The mcache across migration

A goroutine can resume on a different P than the one it started on. That raises a question about the allocator cache: if the mcache is per-P, what happens to it when a goroutine moves? Nothing, because the mcache belongs to the P, not the goroutine.

The mcache is a per-P free list of unused memory spans, a stash for handing out small allocations without a lock. It does not hold allocated objects. Once allocated, an object is a normal heap object owned by the garbage collector, and a heap pointer is independent of which P allocated it. So when a goroutine moves from P1 to P2, its existing objects stay valid, and it borrows P2’s mcache for any new allocations. Nothing migrates or flushes.

The lock-free guarantee survives because it is a property of the P. A P is attached to one M at a time, so no other thread touches that P’s mcache. A goroutine runs on one P at a time, so a migrating goroutine only ever borrows one mcache, never two at once. The spans a goroutine half-used on P1 stay with P1 and serve whatever runs there next. Nothing is stranded.

There is one cache per P, so the count is GOMAXPROCS no matter how many goroutines exist. The cache scales with the number of workers, which is small and fixed, instead of the amount of work, which is unbounded.

Function color

This model is why Go has no “function color” problem. In Python’s asyncio, an async function is a different color from a sync one: you must never block the event-loop thread, so blocking and non-blocking code stay rigorously separated, and async-ness spreads up the call stack. (The term comes from Bob Nystrom’s What Color is Your Function?.)

Go has no such split. A goroutine writes ordinary, blocking-looking code, like http.Get(...), and the runtime is the event loop, invisible and running across every core. It handles a blocking goroutine by parking it (network I/O, through the netpoller) or by detaching its P to another thread (a true syscall). You never mark a function as blocking, because the runtime makes blocking cheap wherever it happens.

What this leaves out

A few pieces sit just outside this tour, each worth its own treatment.

Async preemption (Go 1.14+): a signal-based mechanism that pauses a goroutine hogging its P - a tight CPU loop with no function calls or allocation points - so it cannot starve others sharing that P. Earlier Go preempted only at function-call safepoints, which such a loop could evade. This is the piece that makes fair sharing of the GOMAXPROCS slots hold up.
The allocator hierarchy: mcache (per-P, lock-free) to mcentral (per-size-class, locked) to mheap (global), organized by size classes.
GC interaction: how the scheduler coordinates with the garbage collector, including mcache flushing, stop-the-world phases, and assist.
sysmon: the full set of duties of the background monitor thread, which include preemption, polling the network, syscall handoff, and forcing GC.

In one paragraph

Go runs millions of cheap goroutines (G) on a small pool of OS threads (M), through exactly GOMAXPROCS processors (P). A P is a lock-free local run queue plus a per-P allocator cache: the permission and the machinery to run Go fast, which is why an M needs one. GOMAXPROCS caps parallelism (simultaneous execution), not concurrency (goroutines in flight) and not thread count. On network I/O the netpoller parks the goroutine and frees the thread entirely. On an unavoidable blocking syscall the thread blocks but the runtime hands its P to another thread so the work continues. Goroutines migrate between P’s freely, because per-P state is tied to the execution slot, not to goroutine identity, and allocated objects live on the GC-managed heap independent of any P. That whole design is why Go has no async/await function color: the runtime is an invisible, multicore event loop that makes blocking cheap everywhere.