The "PLAI" book explains how to do it trough a rewrite/desugaring transformation: [0]
The heart of [a CPS] transformation is [...] to turn every one-argument function, f, into one with an extra argument. This extra argument is the continuation, which represents the rest of the computation. The continuation is itself a function of one argument. This argument takes the value that would have been returned by f and passes it to the rest of the computation. f, instead of returning a value, instead passes the value it would have returned to its continuation.
CPS is a general transformation, which we can apply to any program. Because it’s a program transformation, we can think of it as a special kind of desugaring: in particular, instead of transforming programs from a larger language to a smaller one (as macros do), or from one language to entirely another (as compilers do), it transforms programs within the same language: from the full language to a more restricted version that obeys the [... CPS pattern]. As a result, we can reuse an evaluator for the full language to also evaluate programs in the CPS subset.
There's a more terse description on this site too [1]. And I have to agree with this:
[...] there are many methods for transforming into CPS [... but] learning CPS transformation remains difficult.
Also I remember stumbling upon this article [2] which kind of gave me an excuse to not pay too much attention to CPS :-) (...but not really :-p, mostly the subject got too hairy for me and I lost interest :-p). If I remember right, the article argues in favor of "delimited continuations" as opposed to implementing full general call/cc.
This doesn't go into detail of how to make stackfull continuations work (and indeed when it is necessary to make a new stack). If you're working in C, you need to use inline assembly for this. setjmp and longjmp are not stackfull.
CEK[1] machines are very clean way of implementing continuations or other features that can be implemented in terms of them in a language with no support for lambdas or closures. they are also reasonably fast too. i implemented a simple unlambda[2] interpreter in haskell using the cek/cesk style and it was ~20% faster and more memory efficient than the fastest c interpreter i could find.