Lutan-Intangible interact
WEEK 1
The group project I completed with Liyah has been merged into her Notion workspace.
WEEK 2
Introduction:
PIR stands for Passive Infrared sensor. "Passive" means it doesn't emit any signals; it simply receives infrared radiation from the environment.
All objects with a temperature above absolute zero emit infrared radiation. The human body, with a temperature of approximately 36-37°C, emits infrared radiation with a wavelength of about 9-10 micrometers. When infrared radiation strikes the crystal inside a PIR sensor, the crystal heats up and undergoes a change in electrical charge.
If a person remains still, the infrared radiation received by the crystal is constant, its temperature will remain constant, and no further charge changes will be generated. Therefore, PIR sensors can only detect moving bodies.
Structure:
White cover:
Typically made of polyethylene or silicone material, it only allows infrared radiation with wavelengths of 7-14 micrometers to pass through (the wavelength range of human body radiation), blocking visible light and other wavelengths.
Fresnel lens system:
The polygonal structure on the outer casing amplifies the changes in radiation levels as a person moves from one area to another.
The polygonal structure on the outer casing amplifies the changes in radiation levels as a person moves from one area to another.
Four-pin/Six-pin:
Determined by the number of internal crystals. Four-pin devices contain two crystals; six-pin devices contain four crystals.
Sensing angle:
Almost the entire front and side, that is, 180 degrees.
Distance threshold:
Approximately 3 meters.
Advantages and disadvantages:
Almost the entire front and side, that is, 180 degrees.
Distance threshold:
Approximately 3 meters.
Advantages and disadvantages:
It's very sensitive; even localized movements like hand movements can be detected. The drawbacks are that it must be facing the correct direction and it cannot detect changes in distance, only whether there is human movement.
Applications and ideas:
| Burglar alarm |
| Infrared tracking |
WEEK3
Curiosity Cube:
I plan to create an embodied intelligent robot that can perceive the world environment and uses a Large Language Model (LLM) as its cognitive brain.
It doesn't have any specific application scenarios; it's simply intended as a desktop companion robot. It also serves as a way to verify and integrate the knowledge I've acquired. I've already built the hardware circuitry and designed the overall structure of the code and product. However, there is still a tremendous amount of work to be done.
At the code level, I have currently completed the reading of basic sensor values, TCP data transmission, and the acquisition and storage of raw audio and video data streams, as well as basic face detection. The entire basic perception layer is essentially complete. I will focus on the appearance and structural design, as well as the "decision-implementation" parts that follow the "perception-cognitive" layers. This will be the part where I will primarily apply the knowledge learned in this course.
This is the current hardware circuit design, and it's unlikely to undergo significant modifications in the future. The speaker, servo motors (which are inductive loads), and multiple power-consuming sensors may generate various frequencies of noise, as well as voltage drops or instability issues when operating simultaneously (especially the servo motors; I didn't use a driver board due to size constraints). Therefore, the circuit primarily utilizes filtering circuits composed of small-capacity ceramic and electrolytic capacitors (multi-stage decoupling), as well as large-capacity electrolytic capacitors to stabilize the voltage.
In terms of signal processing, due to the specific characteristics of the I2S digital microphone and the power amplifier module driving the speaker, I mainly employed damping resistors and AC coupling circuits.
In addition, there are self-resetting fuses to protect the servo motors, as well as small surface-mount components for electrostatic discharge and surge protection.
I'm currently working on the ASR speech recognition part. I don't want to use offline recognition, as I feel it would lack realism, so I'm using real-time recognition. The previous `get_recent_pcm` function isn't very good anymore; it takes an excessively long, continuous piece of PCM data. Therefore, a new function has been created: `get_latest_chunk(self)`
(Haha, the `get chunk` function shown in this image had an indentation problem, and when I ran it later, the system gave me an error. It specifically pointed out an indentation issue, which surprised me; I thought it would only give a rough error.)
to retrieve the latest piece of PCM data.
To better pinpoint the problem and verify the integrity and feasibility of the process step by step, I chose to start with the most basic path: first, perform simple audio energy differentiation (i.e., set up a state machine to distinguish whether speech is occurring), and then write the received audio data to a WAV file when speech is occurring. This helps me verify the feasibility of the basic path and troubleshoot underlying issues. Subsequent upgrades are also relatively simple; I can directly change the destination of the audio data to send it to the ASR recognition system. I can choose to deploy the ASR locally or use a cloud API. But overall, it's just one module, which can be connected and disconnected at any time and easily replaced. I really like this decoupled approach to building the system.
I've added three scripts. Following the old method, one is a buffer to store the output, which is also the module's only external interface; one is a function script (here, vad_segmenter.py is used to determine the state machine); and the other is the actual worker script that retrieves the data, calls the function, and writes the results to the buffer.
First, in the initialization function, to ensure the stability of the module, I store all parameters in an instance with a clearly specified format.
These four state variables are the core variables of this state machine. They are: whether speech has started, the noise floor, the start time of the current speech segment, and the continuously updated time of the last detected sound.
This is a function that calculates thresholds based on noise. The threshold for starting to speak is defined as the sound volume exceeding a pre-set "start speaking volume" / greater than 3 * noise floor. The threshold for ending speaking is also determined by comparing a pre-set value with the noise floor. "nf" is updated in real-time; I initially set it to "nf_min," while "noise_floor" is updated in real-time based on the RMS of the audio data, continuously updating the noise floor through comparison.
new = (1-a)*old + a*current
This function is one I find very interesting. At the beginning of last semester, we did an exercise where the position of a graphic followed the mouse cursor. Among the many lines of code I didn't understand at the time was this one. Now I know it's the exponential moving average (EMA) formula. The characteristic of this formula is that the greater the difference, the faster the convergence, and vice versa.
This is the main function of the state machine. Its input is the RMS, which comes from the audio buffer.
The first thing to do is to continuously update the noise floor. This update must be done when no one is speaking; otherwise, the volume of speaking will continuously increase this threshold.
This function is one I find very interesting. At the beginning of last semester, we did an exercise where the position of a graphic followed the mouse cursor. Among the many lines of code I didn't understand at the time was this one. Now I know it's the exponential moving average (EMA) formula. The characteristic of this formula is that the greater the difference, the faster the convergence, and vice versa.
Then, based on the set and updated parameters, a judgment is made to distinguish whether or not the person is speaking.
Buffer is still a classic function structure. A function is used to allow the worker script to update the dictionary status data in the buffer after getting the information of each chunk. This is more like a heartbeat to monitor whether the script is running normally.
The second commit vad function is to publish the returned text and other content after the worker function calls the functional function in the vad segment script.
The final get snapshot is given to build perception state for use.
Next, we'll build the main script—the worker script.
As you can see, the current code structure has shortcomings. At the beginning, I imported the `audio`'s `snapshot` and `chunk` scripts. The lightweight information in the `snapshot`, such as RMS, is used by the VAD state machine function to determine whether speech has occurred. However, the complete PCM-16-bit raw audio data in the `chunk` is not yet used. But this is only for this initial stage; after the ASR module is integrated, this data will be used by the ASR module.
There's not much to say about the initialization, startup, and shutdown functions. I'll only focus on two functions: one for saving the WAV file and the main `run` function.
After ensuring the directory exists, write the data to a WAV file. Name the file with the start/end timestamps.
My approach and consideration for the project's progress are to prioritize getting the most basic and simple interaction logic running, regardless of any particularly complex cognitive aspects.
My current plan is that when someone approaches (PIR triggers certain conditions) and someone speaks (the preparatory work I did this week for ASR), the robot will enter a search/tracking state, turning its head left, right, up, and down to capture the face, ensuring the face remains centered in its field of vision. It's like it senses your approach/activity and then searches for and watches you, haha. I think this perfectly aligns with the "curiosity" aspect of the Curiosity Cube! And technically and in terms of time, it's entirely achievable!
By the way, the gimbal for the neck has been modeled and printed, and it's working well!
Comments
Post a Comment