FastWrite/FastRead Panic Error
FastWrite Panic
The implementation of FastWrite consists of three steps: (1) scanning the object to calculate the required memory, (2) allocating memory, and (3) completing serialization. This approach eliminates the need for slice appending, reducing memory allocation and copying, resulting in significantly improved performance.
However, if there is a concurrency bug in the user code between step 1 (calculating the required memory) and step 3 (serialization), where another goroutine concurrently writes to the object being serialized (request parameters for the client or response values for the server), it can lead to Kitex reading an inconsistent object, resulting in errors or even panics.
Note:
- Although the error occurs in the generated
FastWriteField
method of Kitex, the root cause of the error lies in the user code, which is a typical concurrency issue. - These issues are concurrency bugs in the user code, not bugs in Kitex. FastWrite simply exposes these issues more easily.
- A typical scenario is when a field referenced in the request or response object is reused, as explained in the “Case Analysis” section below.
- Importantly, the absence of errors or panics does not guarantee the absence of concurrency issues. In reality, inconsistent data may have been constructed and passed to upstream or downstream components.
Typical Symptoms
The following error messages may occur:
runtime error: index out of range [3] with length 3
runtime error: slice bounds out of range [86:59]
runtime error: invalid memory address or nil pointer dereference
Case Analysis
Client Panic
A typical scenario involves a field referenced within the request being a global variable (or a cached object) that can be concurrently written to.
The simplified error message is as follows:
KITEX: panic, ..., error=runtime error: invalid memory address or nil pointer dereference
panic(...):
github.com/cloudwego/kitex/pkg/protocol/bthrift.binaryProcotol.WriteBinaryNocopy(...)
git/to/project/kitex_gen/some_package.(*SomeType).fastWriteField2(...)
The panic stack includes fastWriteField2
, which is a typical case of concurrent read and write in business logic. The simplified business code is as follows:
key := "default" // reflect.StringHeader{Data=0xXXX, Len=7}
fallbackKey = "" // reflect.StringHeader{Data=nil, Len=0}
wg := sync.WaitGroup{}
for _, task := range taskList {
wg.Add(1)
go func() {
defer wg.Done()
if someCondition {
key = fallbackKey // `key` may be read by another goroutine
}
kitexClient.GetByKey(ctx, &Request{Key: key})
}()
}
wg.Wait()
Analysis:
- The
key
variable of typestring
may be read/written by different goroutines within the loop. - Reading/writing a
string
is not thread-safe: In the Go Runtime implementation, areflect.StringHeader{Data uintptr, Len int}
is used with two fields that need to be assigned separately. - Kitex’s FastWrite may read a “partially written string” (Data = nil, Len = 7) and pass it to the
fastWriteField2
method. - When
fastWriteField2
attempts to read 7 characters fromData = nil
, it triggers anil pointer dereference
panic.
Note:
- Similar issues can occur with slices, which correspond to
reflect.SliceHeader
and haveData
,Len
, andCap
fields. - Concurrent read and write of maps can result in a fatal error in Go.
- The absence of errors or panics does not guarantee the absence of concurrency issues. In reality, inconsistent data may have been constructed and passed to upstream or downstream components: For example, if
fallbackKey = "123456789"
in the code above, the serialized data may be read as “1234567”. IffallbackKey = "123"
, it could result in out-of-bounds data access or even panics.
Server Panic
A typical scenario is when a field referenced in the response is a global variable (or a cached object), and that object may be concurrently written to.
The error message for this case is as follows (slightly simplified):
KITEX: panic happened, ..., error=<Error: runtime error: index out of range [3] with length 1>
panic(...)
encoding/binary.bigEndian.PutUint32(...)
github.com/cloudwego/kitex/pkg/protocol/bthrift.binaryProtocol.WriteI32(...)
git/to/project/kitex_gen/some_package.(*SomeType).fastWriteField3(...)
From the panic stack, it can be seen that the fastWriteField3
method generated by Kitex triggers the panic. It is a typical case of concurrent read and write in business logic. The simplified business code can be as follows:
var localCache sync.Map{}
func (*Handler) GetByKey(ctx context.Context, req *xxx.Request) (*xxx.Response, error) {
resp := localCache.Get(req.Key)
resp.UserID = req.UserID
return resp, nil
}
Analysis:
PutUint32(b []byte, v uint32)
checks ifb[3]
is readable before writingv
tob
. The panic occurs here, indicating an out-of-range access.FastWrite
has already traversedresp
to determine the required length, indicating that between “(1) calculating required memory” and “(3) serialization,” a field inresp
has been modified, requiring more memory space.- Let’s assume the server receives two requests simultaneously (both with
Key
as “X”):A.Request: {Key="X", UserID = "123" }
B.Request: {Key="X", UserID = "123456"}
- The process may execute in the following order:
FastWrite
for request A calculates that it needs N+3 bytes (N represents the space required by other fields inresp
) and allocates the memory.- Request B executes line 5 and replaces
UserID
with “123456” (note that the response is cached, which means A will also return this object). - Request A starts serializing
resp
into the allocated space. When it tries to write the last field,UserID
, it finds that there is not enough space available (it now needs N+6 bytes):index out of range [3] with length 1
Note:
- The absence of errors or panics does not imply the absence of concurrency issues. In reality, abnormal data may have been constructed and passed to upstream and downstream components: For example, if the execution order of A and B mentioned above is reversed, no error will occur, but the expected value to be written in the response message is “123456,” while the actual value written is “123,” and the Kitex client can decode it correctly.
Troubleshooting Suggestions
Self-inspect the code based on the aforementioned case
If there is concurrent read and write on the Request
or Response
(including objects directly or indirectly referenced by them), fix this issue first.
Use the Go Race Detector
If possible, use the -race
flag to identify and eliminate concurrency issues in the code. Refer to https://go.dev/blog/race-detector for more details.
Note: Be cautious when using it in a production environment as it can significantly impact performance.
Identify the problematic field based on the error message
Based on the panic stack, you can identify the specific field by looking at the field type and fastWriteFieldN
method.
For example, if the innermost method in the panic stack is: kitex_gen/some_package.(***Base**).**fastWriteField3**(...)
, it means that the error is caused by the field with index 3 in the Base
type. In the following IDL, the field at index 3 is the Addr
field of type string
:
struct Base {
1: string LogID = "",
2: string Caller = "",
3: string Addr = "",
}
If the panic error message contains invalid memory address or nil pointer dereference
, it indicates that the Addr
field may have been concurrently written with a null value by another goroutine.
If the panic error message is different (e.g., index out of range [3] with length 3
), it may indicate that another field is being concurrently read and written (variable length) and is occupying more memory, causing insufficient memory allocation when writing the field. You can refer to the “Comparing Two Sampling Results” section below to locate the issue.
Client Side
The proposed solution for the client side is as follows:
-
First Sampling: Before sending the request, serialize the entire request body into JSON.
- Note: This may impact performance. Consider sampling the requests proportionally to reduce the impact on the production environment, but it may take longer to capture relevant cases.
-
Second Sampling (two approaches for reference):
a. Panic Sampling: Check if a panic has occurred and then serialize the request body into JSON.
- Check if the returned error from the client contains panic information.
- Check if a panic has occurred in the middleware. Refer to the “Kitex - Panic 处理” section in the Kitex - Panic Handling documentation.
b. Delayed Sampling: Create a goroutine that sleeps for a certain amount of time and then serialize the request body into JSON.
- The longer the sleep time, the higher the probability of finding differences.
-
Compare the results of the two samplings to identify different fields and investigate code sections where concurrent read/write operations may be happening.
Note:
- If a certain performance loss (additional JSON marshaling, comparison) is acceptable, consider comparing all requests. Some requests may not cause panics but could still produce inconsistent data.
- Injecting troubleshooting through middleware is simpler and does not require modifying the business code.
Server Side
On the server side, as the server has exited all middleware during encoding, it is not possible to capture panics or perform sampling comparisons using middleware.
The proposed solution for the server side is as follows:
- Custom PayloadCodec:
- Perform sampling and panic comparison during the Marshal operation.
- Note: This may impact performance. Consider sampling the requests proportionally to reduce the impact on the production environment, but it may take longer to capture relevant cases.
type codecForPanic struct {
payloadCodec remote.PayloadCodec
}
func (c *codecForPanic) Marshal(ctx context.Context, message remote.Message, out remote.ByteBuffer) error {
var before, after []byte
var err error
defer func() {
if err := recover(); err != nil {
klog.Errorf("panic: %v", err)
after, _ = json.Marshal(message.Data())
if bytes.Compare(before, after) != 0 {
klog.Errorf("before = %s, after = %s", before, after)
}
}
}()
before, err = json.Marshal(message.Data()) // Note the performance loss
if err != nil {
klog.Errorf("json encode before Marshal failed: err = %v", err)
}
return c.payloadCodec.Marshal(ctx, message, out)
}
func (c *codecForPanic) Unmarshal(ctx context.Context, message remote.Message, in remote.ByteBuffer) error {
// Recover and compare here
return c.payloadCodec.Unmarshal(ctx, message, in)
}
func (c *codecForPanic) Name() string {
return "codecForPanic"
}
- Specify the custom PayloadCodec during server initialization:
svr := test.NewServer(new(TestServiceImpl),
server.WithPayloadCodec(&codecForPanic{
payloadCodec: thrift.NewThriftCodecWithConfig(thrift.FastRead | thrift.FastWrite),
}),
// Other options
)
- Compare the results of the two samplings to identify different fields and investigate code sections where concurrent read/write operations may be happening.
If panics occur infrequently, you can also consider starting a new goroutine, sleeping for a period of time, and then performing the comparison. This approach makes it easier to identify the modified sections.
FastRead
Typical Symptoms
runtime error: slice bounds out of range [1600217702:1678338]
Cause
The received message (Client: response, Server: request) is encoded incorrectly.
Troubleshooting Suggestions
Refer to the troubleshooting suggestions for FastWrite and investigate the corresponding endpoint (if the error occurs on the server side, check the client; vice versa).
FAQ
How to Disable FastWrite/Read
It is not recommended to disable it: Race conditions in the business logic can lead to inconsistent business data and unpredictable consequences.
If there is a genuine need, you can choose one of the following options:
- Disable the
FastCodec
in Thrift using options:
// Server side
svr := xxxService.NewServer(handler, server.WithPayloadCodec(
thrift.NewThriftCodecDisableFastMode(true, false)))
// Client side
cli, err := xxxService.NewClient("", client.WithPayloadCodec(
thrift.NewThriftCodecDisableFastMode(true, false)))
- Do not generate
FastCodec
encoding/decoding code: Use the-no-fast-api
parameter of the Kitex command-line tool to regenerate the code.