Without RLE I get 196370 sprites/second (or 865M pixels/second);
With RLE rle - 554059 sprites/second (or 2442M pixels/second).
That is on a single core of 2.7 GHz Intel Core i5. So if one uses one core for game logic and other for rendering, targeting 30 frames per second, he/she can get around 18468 medium sized sprites per frame.
That applies only to directly copied pixels. Blending reduces that number of sprites to 2114 per frame. Generally alpha blending appears to be rather costly operation, because I'm currently doing it properly - with gamma correction. With improper gamma, it is 5202 sprites/frame.
So there is a large difference between
sr = (((dr-sr)*sa)>>8) + sr;
sg = (((dg-sg)*sa)>>8) + sg;
sb = (((db-sb)*sa)>>8) + sb;
st = pab_lut[255-sa];
dt = pab_lut[sa];
sr = piglut[int2cfp(st[sr]+dt[dr])];
sg = piglut[int2cfp(st[sg]+dt[dg])];
sb = piglut[int2cfp(st[sb]+dt[db])];