As far as I understand, this is a deep technical limitation likely related to the number of bits in a number used in some implementation detail for this feature, or else tangentially related. A single outline shader typically requires an input texture with all of the geometry to be outlined rendered to it, so an outline shader that supports multiple outline groups probably requires an input texture with multiple groups rendered to it, so each group in this texture would need to be rendered with separate values (e.g. red channel pixels being 0.1, 0.2, 0.3, etc as a simplified example), but a pixel in a memory-optimized texture only supports so many values + there are other implementation details that can introduce problems. Just speculation of course, but likely a close reason why this is not trivial.
It would likely significantly increase memory footprint just to double the limit (at least on mobile devices, but allowing these to have different limits breaks a lot of developer expectations and is bad), not to mention complicate the programming behind the scenes which also likely comes with a non-negligible performance cost.
If you look at the render passes in renderdoc, you can see that for DX11 the effect is generating an 8-bit red-channel-only stencil texture with all of the outline groups rendered to it with different R values according to depth from the camera, which relates to the speculation above.