grid47 Exploring patterns and algorithms
Sep 28, 2024
6 min read
Solution to LeetCode 393: UTF-8 Validation Problem
You are given an array of integers data, where each integer represents one byte of data. Your task is to check whether this sequence of bytes forms a valid UTF-8 encoded string based on the UTF-8 encoding rules for 1 to 4 bytes characters.
Problem
Approach
Steps
Complexity
Input: The input consists of an integer array `data` where each integer represents one byte of data.
Example: Input: [197, 130, 1]
Constraints:
• 1 <= data.length <= 2 * 10^4
• 0 <= data[i] <= 255
Output: The output is a boolean indicating whether the input array `data` represents a valid UTF-8 encoding.
Example: Output: true
Constraints:
• The output should be true if the byte sequence represents a valid UTF-8 encoding, otherwise false.
Goal: The goal is to validate if the given byte sequence adheres to the rules of UTF-8 encoding.
Steps:
• Iterate through the array of bytes in `data`.
• Check the first bits of each byte to determine whether it's the start of a 1, 2, 3, or 4-byte character.
• For continuation bytes (those starting with `10`), ensure that the correct number of continuation bytes follows.
• Return true if the entire sequence is valid; otherwise, return false.
Goal: The solution should efficiently handle the input size within the given constraints.
Steps:
• The solution must handle arrays of length up to 2 * 10^4 efficiently.
Assumptions:
• Each integer in `data` represents one byte of data, and the byte values range from 0 to 255.
• Input: Input: [197, 130, 1]
• Explanation: The byte sequence '11000101 10000010 00000001' represents a valid UTF-8 encoding: a 2-byte character followed by a 1-byte character.
• Input: Input: [235, 140, 4]
• Explanation: The byte sequence '11101011 10001100 00000100' represents an invalid UTF-8 encoding because the second byte does not start with '10' as required for a continuation byte.
Approach: The approach involves iterating through the byte array and validating each byte according to the rules for UTF-8 encoding.
Observations:
• We need to check the starting bits of each byte to determine the number of bytes that form a character.
• For each continuation byte, we need to ensure it starts with '10'.
• The solution involves a simple traversal of the byte array and checking the bits, ensuring that the sequence follows the UTF-8 encoding rules.
Steps:
• Start by initializing a counter to track the number of continuation bytes expected for the current character.
• For each byte, check its most significant bits to determine if it's the start of a new character or a continuation byte.
• If it's a continuation byte, check if it starts with '10'. If it's the start byte, update the continuation byte counter.
• Return false if any byte doesn't match the expected pattern, and return true if the sequence is valid.
Empty Inputs:
• The input is guaranteed to be non-empty, so no need to handle empty arrays.
Large Inputs:
• The solution must be optimized to handle large inputs with up to 2 * 10^4 elements.
Special Values:
• A sequence with an invalid continuation byte (not starting with '10') should return false.
Constraints:
• The solution should efficiently check the validity of the UTF-8 encoding even for the largest inputs.
Define the function 'validUtf8' that takes a vector of integers, 'data', representing a sequence of bytes. It will return a boolean value indicating whether the sequence is a valid UTF-8 encoding.
2 : Variable Initialization
int count =0;
Initialize a variable 'count' to track the number of continuation bytes remaining for a valid UTF-8 sequence.
3 : For Loop
for(autoc : data) {
Start a loop to iterate through each byte 'c' in the 'data' vector.
4 : Main If Condition
if(count ==0) {
Check if no continuation bytes are expected, meaning we are starting a new character in the UTF-8 sequence.
5 : UTF-8 First Byte Check
if((c >>5) ==0b110) count =1;
Check if the byte 'c' represents a valid starting byte for a 2-byte UTF-8 character. The condition checks the first 3 bits to determine if it matches the '110' prefix.
6 : UTF-8 First Byte Check
elseif ((c >>4) ==0b1110) count =2;
Check if the byte 'c' represents a valid starting byte for a 3-byte UTF-8 character. The condition checks the first 4 bits to determine if it matches the '1110' prefix.
7 : UTF-8 First Byte Check
elseif ((c >>3) ==0b11110) count =3;
Check if the byte 'c' represents a valid starting byte for a 4-byte UTF-8 character. The condition checks the first 5 bits to determine if it matches the '11110' prefix.
8 : Invalid First Byte Check
elseif (c >>7) returnfalse;
If the byte does not match any valid starting byte prefix (i.e., the first bit is 1 but does not match any of the expected patterns), return false, indicating an invalid UTF-8 sequence.
9 : Else Block for Continuation Bytes
} else {
If we are expecting continuation bytes (i.e., 'count' is not 0), we check if the current byte follows the continuation pattern.
10 : UTF-8 Continuation Byte Check
if ((c >>6) !=0b10) returnfalse;
Check if the byte 'c' is a valid continuation byte. A continuation byte in UTF-8 should start with '10' in its most significant bits.
11 : Decrement Continuation Byte Count
count--;
Decrement the 'count' variable to indicate that one continuation byte has been successfully processed.
12 : Return Statement
return count ==0;
If 'count' is 0 at the end of the loop, it means all continuation bytes were matched correctly, and the UTF-8 sequence is valid. Otherwise, return false.
Best Case: O(n)
Average Case: O(n)
Worst Case: O(n)
Description: The time complexity is O(n) because we iterate through each byte in the array once.
Best Case: O(1)
Worst Case: O(1)
Description: The space complexity is O(1) as we only use a constant amount of extra space.