TCP Details

TCPDUMP Setup and Processing

Start with WTSS Access Logs.

Look for 502 (or which ever type you need to look for) errors. I use this awk command.

awk '($9 ~ /502/)' access.log | awk '{print $7}' | sort | uniq -c | sort -r | more

From the generated list, identify a request that you wish to analyze.

You could start with something you are already targeting (like a specific call/endpoint that you know is breaking)

In my specific case, I have a plethora of varied requests that are showing a 502 error. So I start with 2 characteristics that I am looking for

It should be UI based, so it ignores obvious non conferment requests e.g. API calls, messages to the queue etc.

I look for UI requests that have specific identifying information. An example:

Path Entry in NGINX Access Log: /ic/integration/home/faces/global?Adf-Window-Id=n5l17geg8&Adf-Page-Id=0

Here n5l17geg8 is a unique identifier that hopefully separates this from other requests to this endpoint.

faces is a word pattern that to me tells that this is a UI call. I am being very approximate, depending on your circumstances, you may need to guarantee this.

Now run the following command:

awk '($9 ~ /502/)' access.log | grep “n5l17geg8”. This allows us to narrow down to a specific request we can then debug.

In our case we will start looking at the following request:

"10.0.64.3" "-" "-" [22/Mar/2019:07:25:32 +0000] "POST /ic/integration/home/faces/global?Adf-Window-Id=n5l17geg8&Adf-Page-Id=0HTTP/1.1" "502" "563" "https://xxxxx/ic/integration/home/faces/global" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36" "4225" "6551" "-" "wdcdevqsic-xxxxx" "170.251.60.100, 10.0.64.5" "xxxxxxx" "9999" "21.183" "10.0.64.8:443" "0.000" "21.183" "21.183" "502" "-" "-" "-" "-" "-"

The important parts of the request are underlined. Please note that sometime the content size can also be seen as a unique identifier for the request. In the worst case, time at which the request came in is your best friend, unless someone is pounding the same URL.

Subtracting time taken from time local.

07:25:32 - 21.183 ~ around 07:25:10/11.

We take this time and the path to search it in the wireshark UI. Search in the pane like this:

http.request.uri eq "/ic/integration/home/faces/global?Adf-Window-Id=n5l17geg8&Adf-Page-Id=0"

From the list of entries look for the same time stamp: 07:25:10/11 (get more specific if needed). In our case it is 10.817

Since there is no exact match at the time. (but multiple matches, we look on the other server first as well)

There are 2 requests in that approximate time, so we look at 07:25:01 and then 07:25:11.

Right click, follow, tcp stream. Here we got lucky as the first request was not the failing one … but the next one was and they are both on the same stream.

However here we see that the connection was created at 07:20:55 and terminated at 07:25:32.

Check the stream using tcp.stream eq 25677 (you will have some number some like that in your dump details)

An open connection and

Provide item sequencing (Send in order order "1, 2", arrive in the order "1, 2" at the opposite end, error-free).

When Unix programs do any sort of I/O, they do it by reading or writing to a file descriptor. An example of this would be a call to the socket() system routine.

It returns the socket descriptor. This file descriptor is simply an integer associated with an open file (that can be a network connection or socket, among other obvious options).

Different structs that track TCP-specific info like Sequence numbers, Current window size etc

A Receive buffer (or “queue”) and

A Write buffer (or “queue”).

A new data packet comes arrives on the network interface (NIC) or vNIC,

By an interrupt from the NIC, or by

By Polling the NIC for data.

Decodes the packet and

Source IP, Port,

Destination IP, and port.

Using aforesaid information look up the "struct Sock" (for the socket) in memory associated with the TCP connection.

If there are NO sequencing issues, data payload from the request is copied into the Socket’s Receive buffer.

Blocking read/2, or

I/O multiplexing system call (select or epoll_wait) to wait on the socket.

Kernel to remove data from kernel's receive buffer

Copy above data into the buffer supplied to the read/2 system call.

If the receive buffer is empty and the user calls read, system call will block until data is available.

If the receive buffer is nonempty and user calls read, system call will immediately return with whatever data is available.

A partial read can happen if the amount of data ready in the read queue is less than the size of the user-supplied buffer. The caller can detect this by checking the return value of read

If the receive buffer is full and other end of the TCP connection tries to send additional data, the kernel will refuse to ACK the packets.

There is a timeout happening at the Weblogic level. This would require updates to the following entries. This could explain some of the 502s, but not all of them as the request timelines don't fit the descriptions of timeouts. That said, the following timeout settings will need to be increased to their max or 310 s.

Duration: Number of seconds to maintain HTTP keep-alive before timing out the request.

On Weblogic to read data from sockets MUXERS are used. We may need to tweak and use native muxers that use platform-specific native binaries to speed it up. But first attempt will be to confirm slow reads.

By default, health condition is not checked by cache before returning the cached connection to the client for use. If the network connection is unstable, the system will need to check the connection's health condition before returning it to the client.

To enable this behavior (checking the health condition), set http.keepAliveCache.socketHealthCheckTimeout system property to a value greater than 0 to tune how a socket connection is returned from cache when keep-alive is enabled.

Connection Backlog Buffering

Number of File Descriptors

Complete Message Timeout

File Descriptor: Within the context of this discussion, this refers to the new file descriptor made for an incoming connection as a part of the accept() call. Historically, TCP/IP sockets use file descriptors. A network connection can be seen as being analogous to a file - you can read, write, and close both, and fits well with the Unix idea of "everything is a file".

Command to extract monthly frequency of 502s Collapse source

cat access.log | grep "Dec/2018" | awk '($9 ~ /502/)' | awk '{print $4}' | awk -F '/' '{print $1}' | awk -F '[' '{print $2}' | sort | uniq -c

Bibliography:

See the implementation of socket structs in the Linux kernel’s net/sock.h.

http://beej.us/guide/bgnet/html/single/bgnet.html#socket

How TCP Sockets work:

How (Already) Established Connection Flow works

A few basics first:

Stream sockets (that TCP uses) are reliable two-way connected communication streams. They maintain

A socket can be seen as a way to speak to other programs using standard Unix file descriptors. In other words:

For each Socket or TCP file descriptor (tracked by the kernel), there are

Kernel is notified in 2 possible ways:

Kernel gets the packet from the NIC, it

Identifies the TCP connection packet is associated with. It uses

Kernel wakes/triggers processes doing/using

User Space process calls read/2 on the file descriptor. This causes

Read Semantics:

Possible Issues:

Socket Reads are too slow. If the Kernel senses that the kernel buffers are not clearing quickly, it may start sending RESET back to the client.

Connection's health condition is bad:

Other Parameters (probably not relevant in this case since the socket and the connection already exist):

Definitions:

This section explains how socket system calls interact with kernel data structures, and how kernel interacts with the actual network. It also explains how listen queue overflowswork.

1. HTTPS Duration:Number of seconds to maintain HTTPS keep-alive before timing out the request.
2. Complete Message Timeout: The maximum number of seconds that this server waits for a complete message to be received. If you configure network channels for this server, each channel can override this message timeout.
3. Idle Connection Timeout: The maximum number of seconds that a connection is allowed to be idle before it is closed by the server. The T3 and T3S protocols ignore this attribute. If you configure network channels for this server, each channel can override this idle connection message timeout.

TCP Details

Post a Comment

0 Comments

Search

Popular Posts

Oracle Index Fundamentals

Buffer Cache Fundamentals

Formulas to calculate Pacing

Java Commands and Tools

Handling Redirects-LR

TCP Details

Social Widget

TCP Details

You may like these posts

Post a Comment

0 Comments

Search

Popular Posts

Oracle Index Fundamentals

Buffer Cache Fundamentals

Formulas to calculate Pacing

Java Commands and Tools

Handling Redirects-LR

TCP Details

Social Widget