Lutan-Code your way
WEEK 1
2.4GHz!!!for esp32S3
When connecting to Wi-Fi, the ESP32 development board does not support 5G networks; it must use a 2.4G network, which is the IoT network. I only obtained the SSID and password after logging into my home router's administration panel.
I noticed that my home Wi-Fi's 2.4GHz and 5GHz networks use completely different account names and passwords. My Linux system is connected to the 5GHz network, while the MCU is connected to the 2.4GHz network. They can communicate normally, which surprised me, because I wasn't sure if this would work, as the 2.4GHz and 5GHz networks seem to be distinct.
WIFI of respberry-Pi:
I couldn't find the standard network configuration file on my Raspberry Pi. I had initially configured it to connect to my home Wi-Fi when I flashed the Ubuntu Linux system onto the SD card using the Raspberry Pi's dedicated flashing software.
I took the Pi to school and used it there once, changing my phone's hotspot name and password to match my home Wi-Fi to trick the Pi into connecting. This worked, and the Pi connected, but its IP address changed. I didn't pay attention to this at first, but when I got home, I found that the Pi couldn't connect to my home Wi-Fi anymore. I suspect the configuration was messed up; it seems the two Wi-Fi connections with the same name and password but different IP addresses confused it. I had to connect the Pi directly to my home router with an Ethernet cable to get it working again.
I built a WiFi connection function that allows me to save multiple sets of WiFi usernames and passwords, and then try them one by one using if/elif statements.
//----------打包数据,并通过TCP发送----------
if (millis() - lastSend > interval) {
StaticJsonDocument<512> doc;
doc["type"] = "perception";
doc["device"] = "Lutan-ESPS3";
doc["ts"] = millis();
JsonObject sensors = doc.createNestedObject("sensors");
JsonObject th = sensors.createNestedObject("temp_humi");
th["temp"] = temperature;
th["humi"] = humidity;
JsonObject pir = sensors.createNestedObject("pir");
pir["motion"] = PIR_motion;
JsonObject mic = sensors.createNestedObject("mic");
mic["audio"] = sample;
// 序列化并发送(一行一条)
serializeJson(doc, client);
client.print("\n");
lastSend = millis();
}
I have an I2C temperature and humidity sensor, an I2S digital microphone, and a PIR sensor – these are my three inputs. I package them into a JSON dictionary format and send them via TCP to my Linux system, which is a Raspberry Pi. I like the dictionary format because it allows me to easily access the corresponding values using their keys.
The `nc` command is a barebones TCP pipeline command that can be used to quickly verify whether a connection is successful, but it can only exist as a foreground or background process. This doesn't fit my vision of building a systematic, stable, and independent program.
Furthermore, I will definitely create consumer processes (even an HTTP server) to consume this data later.
Therefore, I need to convert it into a .py script and encapsulate it as a daemon process.
I placed two files in the `project/perception/get_data/` directory: `receive_tcp.py` and `sensor_buffer.py`.
I cannot directly use `return` in the `receive_tcp` script to return the received data dictionary, because TCP is a continuous, always-on channel. If I use `return`, calling this function would only work once and would block the program.
Therefore, a buffer function is needed to store the messages received from the TCP channel, allowing the main program to freely and easily retrieve the data dictionaries from it.
| buffer |
The buffer contains two functions: one is called in the `receive_tcp` script to update the value dictionary;
the second function is called in `build_perception` to retrieve the latest values and build the `perceptionState`.
| receive tcp |
| build perception state |
The camera is not connected to the MCU, but to the Raspberry Pi, so I've set it up as a separate module, but still within the "perception" directory. `get_video` handles camera initialization and frame acquisition, while `buffer` stores simple camera information, not the frames themselves. `drawframe` is simply the drawing function.
`camera_buffer` will be called in `build_perception_state` to construct the `perceptionState`. In the `mjpeg_http` script responsible for the streaming service, the frames obtained from `get_video`, the `perceptionState` (which includes camera information), and the `drawframe` function will all be imported to render the complete visual output.
Added a script for facial recognition, but it only serves the basic `perceptionState`, so it only detects presence and location, not identification.
I manually created a structural diagram of the current perception system, and I feel I now have a clearer understanding of the structure and the relationships between its components.
I also discovered a pattern: the `main.py` program only imports scripts that not only have their own functions but also actively import other scripts that only contain their own functions.
Therefore, those scripts containing only self-contained functions seem to act as pure, minimal functional modules, which are first imported into scripts responsible for larger functional blocks and used within them. These larger functional block scripts are then imported by the `main` program.
So I think this structure can be divided into: minimal function modules – complete functional modules (loops/services, etc.) – main program (scheduling & orchestration).
Draw a box in the `drawframe` function.
Draw a box in the `drawframe` function.
Different internal and external calls of the perception State:
In `_init_`, all dictionary structures are flattened, so internal calls can easily access elements, such as `state.count`, without needing to use `state['camera']['face']['count']`.
At the same time, to ensure a well-structured output for external use, the complete dictionary hierarchy is reconstructed in the `to_dict` function.
The audio data for "perception" needs to be upgraded. Previously, I only received volume data in 2-second intervals; now I need to receive the complete audio stream.
Similar to the structure and relationship of "sensor buffer & receive TCP," because the audio stream data format is binary, unlike JSON, and the data volume is too large and the transmission frequency is high, a new TCP connection needs to be established for the AUDIO part. 8080 is listened to by SENSOR TCP, and 8081 is the port for the MJPEG HTTP SERVER, so I chose to bind the AUDIO TCP to port 8082.
The AUDIO BUFFER has three functional components: one for writing data from TCP, one for reading by perceptionstate, and one for reading by a future ASR (Automatic Speech Recognition) script.
At this point, the data collection for the perception component is complete. However, we cannot directly proceed to the "cognitive" stage yet. Instead, we should use the audio and video data to perform three pre-recognition steps: face recognition, action recognition, and speech recognition, to determine "who it is," "what action is being performed," and "what was said." This will yield the final, most complete PERCEPTION SNAPSHOT, which can then be directly fed into the COGNITIVE module for the LLM to build a world model.
WEEK3
This week I learned some basics about facial recognition and ASR voice recognition. Aside from that, I focused on the casing and structural design. I think it would be more comfortable to move on to the next steps once it has a physical form. This part isn't completely finished yet, but it should be completed this week.
WEEK4
I'm currently working on the ASR speech recognition part. I don't want to use offline recognition, as I feel it would lack realism, so I'm using real-time recognition. The previous `get_recent_pcm` function isn't very good anymore; it takes an excessively long, continuous piece of PCM data. Therefore, a new function has been created: `get_latest_chunk(self)`
(Haha, the `get chunk` function shown in this image had an indentation problem, and when I ran it later, the system gave me an error. It specifically pointed out an indentation issue, which surprised me; I thought it would only give a rough error.)
to retrieve the latest piece of PCM data.
To better pinpoint the problem and verify the integrity and feasibility of the process step by step, I chose to start with the most basic path: first, perform simple audio energy differentiation (i.e., set up a state machine to distinguish whether speech is occurring), and then write the received audio data to a WAV file when speech is occurring. This helps me verify the feasibility of the basic path and troubleshoot underlying issues. Subsequent upgrades are also relatively simple; I can directly change the destination of the audio data to send it to the ASR recognition system. I can choose to deploy the ASR locally or use a cloud API. But overall, it's just one module, which can be connected and disconnected at any time and easily replaced. I really like this decoupled approach to building the system.
I've added three scripts. Following the old method, one is a buffer to store the output, which is also the module's only external interface; one is a function script (here, vad_segmenter.py is used to determine the state machine); and the other is the actual worker script that retrieves the data, calls the function, and writes the results to the buffer.
First, in the initialization function, to ensure the stability of the module, I store all parameters in an instance with a clearly specified format.
These four state variables are the core variables of this state machine. They are: whether speech has started, the noise floor, the start time of the current speech segment, and the continuously updated time of the last detected sound.
This is a function that calculates thresholds based on noise. The threshold for starting to speak is defined as the sound volume exceeding a pre-set "start speaking volume" / greater than 3 * noise floor. The threshold for ending speaking is also determined by comparing a pre-set value with the noise floor. "nf" is updated in real-time; I initially set it to "nf_min," while "noise_floor" is updated in real-time based on the RMS of the audio data, continuously updating the noise floor through comparison.
new = (1-a)*old + a*current
This function is one I find very interesting. At the beginning of last semester, we did an exercise where the position of a graphic followed the mouse cursor. Among the many lines of code I didn't understand at the time was this one. Now I know it's the exponential moving average (EMA) formula. The characteristic of this formula is that the greater the difference, the faster the convergence, and vice versa.
This is the main function of the state machine. Its input is the RMS, which comes from the audio buffer.
The first thing to do is to continuously update the noise floor. This update must be done when no one is speaking; otherwise, the volume of speaking will continuously increase this threshold.
This function is one I find very interesting. At the beginning of last semester, we did an exercise where the position of a graphic followed the mouse cursor. Among the many lines of code I didn't understand at the time was this one. Now I know it's the exponential moving average (EMA) formula. The characteristic of this formula is that the greater the difference, the faster the convergence, and vice versa.
Then, based on the set and updated parameters, a judgment is made to distinguish whether or not the person is speaking.
Buffer is still a classic function structure. A function is used to allow the worker script to update the dictionary status data in the buffer after getting the information of each chunk. This is more like a heartbeat to monitor whether the script is running normally.
The second commit vad function is to publish the returned text and other content after the worker function calls the functional function in the vad segment script.
The final get snapshot is given to build perception state for use.
Next, we'll build the main script—the worker script.
As you can see, the current code structure has shortcomings. At the beginning, I imported the `audio`'s `snapshot` and `chunk` scripts. The lightweight information in the `snapshot`, such as RMS, is used by the VAD state machine function to determine whether speech has occurred. However, the complete PCM-16-bit raw audio data in the `chunk` is not yet used. But this is only for this initial stage; after the ASR module is integrated, this data will be used by the ASR module.
There's not much to say about the initialization, startup, and shutdown functions. I'll only focus on two functions: one for saving the WAV file and the main `run` function.
After ensuring the directory exists, write the data to a WAV file. Name the file with the start/end timestamps.
WEEK5
Now I'm officially starting the ASR integration process. I want to proceed in the most controllable way, so my initial plan is to use the previously saved WAV files for offline, local ASR recognition.
Between local and cloud deployment, I ultimately decided on local deployment. This not only avoids maintenance hassles like incurring fees but also allows me to learn more, such as model deployment.
I finally chose the Whisper model. Its size allows it to run smoothly on the Pi. I'm using the multilingual model, which can automatically recognize Chinese and English and generate corresponding text.
The speed isn't particularly fast, taking about 10 seconds, just for this short sentence.
I also tried the tiny and base models, which took about one or two seconds to recognize, much faster than the small model.
I also tried the tiny and base models, which took about one or two seconds to recognize, much faster than the small model.
Finally, to balance the requirements of accuracy in recognizing long sentences and rapid response, I decided to use the base model.
The next step is to integrate the model into the `commit segment` function of the speech worker script (because this function is responsible for publishing the recognition results). Meanwhile, since ASR's response speed is very slow, multithreading is an excellent choice to avoid code blocks that block the process.
Firstly, once a result is identified, if no new results appear, the ASR in the perception state will remain the previous text. However, I don't think this is a major problem. We can simply check if the ASR is the same as before when retrieving the state in the subsequent decision loop; if it is, we can assume there's no new dialogue.
Then the next two problems are more serious. One is:
[ASR] tmp_asr_segments/seg_1772221239922_1772221240285.wav -> [BLANK_AUDIO] (7269 ms)
[PERCEPTION] <PerceptionState T=-50 H=0 motion=False audio=0 fps=10.40581730852724face=0 aud_alive=True aud_rms=13.3vad_speaking=False asr='[BLANK_AUDIO]'>
[PERCEPTION_FACE] {'count': 0, 'boxes': []}
[PERCEPTION] <PerceptionState T=-50 H=0 motion=False audio=0 fps=10.50285339389857face=0 aud_alive=True aud_rms=25.0vad_speaking=False asr='[BLANK_AUDIO]'>
[PERCEPTION_FACE] {'count': 0, 'boxes': []}
[PERCEPTION] <PerceptionState T=-50 H=0 motion=True audio=0 fps=10.30769457619621face=0 aud_alive=True aud_rms=19.5vad_speaking=False asr='[BLANK_AUDIO]'>
[PERCEPTION_FACE] {'count': 0, 'boxes': []}
[PERCEPTION] <PerceptionState T=-50 H=0 motion=True audio=0 fps=10.449239906526689face=0 aud_alive=True The blank audio (aud_rms=23.5vad_speaking=False asr='[BLANK_AUDIO]')
is occurring very frequently and cannot be properly recognized.
Secondly, I have to be extremely close to the microphone for it to start recording WAV files, but this distance is far beyond the normal interaction distance. I can't possibly be right next to the microphone every time I speak. I believe my hardware connection is fine; the problem must be in the Pi's code, perhaps in the sound amplification, noise floor, or the on/off threshold settings.
I listened to the audio, and it seems the slicing is too fast. My words weren't even finished before the slicing was complete, and each WAV file took 1 second to process. I think this is a problem with the VAD segmentation function.
Also, the audio volume is indeed a bit low, although it's very clear.
Therefore, I think the main problems lie in the VAD segmentation and the trigger threshold.
I made some parameter adjustments, and now it can basically trigger normally even at a distance of 60-70 centimeters, but its accuracy is only so-so.
It might be because my English pronunciation isn't very good, but the recognition accuracy is indeed much higher when I speak Chinese.
After increasing the gain of the original audio file, the effect was somewhat better.
After increasing the gain of the original audio file, the effect was somewhat better.
During discussions with GPT, I gradually realized something: the perception state is a persistent, constantly updating "STATE." The ASR language recognition results are incompatible with it. Like my previous problem, the state is always updating, but the ASR text remains stuck in the previous recognition result. The system cannot know its sequence number, whether it's old or new, whether it has been consumed or used, whether it's valid, etc. These questions cannot be answered by a text result itself; it's a single-trigger "EVENT", unlike a "STATE." Therefore, my next step should be to build a speech event.
WEEK6
So now we start building "speechevent". Let's reorganize the current script structure:
1."perception/precog/speech buffer" stores vad/latest/debug and is read by PerceptionState. is a "warehouse"
2. "perception/precog/vad segment" is a functional function script responsible for making VAD state machine and starting/ending segmentation.
3. "perception/precog/speech worker" is the actual work script, responsible for fetching audio chunks, segmenting them, running ASR, and commit_segment(text=...)
4. "perception/build_perception/build_perception_state.py" is responsible for integrating get_speech_snapshot() into state["precog"]["asr"]
If I want to enter an abstraction layer like EVENT, I only need to add a script now, which is the "event bus" and change the STATE of the speech buffer to EVENT.
Two functions, one is responsible for publishing and writing to the bus, and the other is the output interface (get), which is used to get the latest events by cognitive and other parts.
Add a monitoring thread in the main script to observe whether the bus is working normally. It can be directly converted to the input interface of llm later.
You can see the two outputs of speech event publish and speech event, and the id and text are correct.
So currently, speech event can be said to be a simple version of running through.
There are two options before me now. I have now completed the state flow of perceptionstate, and also have an event flow such as speech event. This should be a relatively complete architecture.
The first option is to continue abstracting the speech event, for example
"turn on the light"
Intent:
type=command
action = turn_on
target = light
This is called Speech Understanding Layer.
The second option is to go directly down and enter the cognitive stage, which is to connect to LLM.
After some consideration, I decided to enter the second option. The reason is that after thinking about it, I think it is difficult to choose one now. Because I noticed that there is a problem, which should be a problem with this language model. Many of the words I say will be converted into a sentence by ASR that has the same pronunciation but completely wrong semantics. It seems that it is not very good at thinking and only based on pronunciation without much consideration of whether the sentence is reasonable. I think this can be put directly into LLM, and then let LLM make some corrections. Now I think it is not appropriate to use this unreasonable text to rashly perform intent extract (understanding). This is re-structuring an already wrong output, which is equivalent to confirming and solidifying the error. So at this stage, I think this part can be put on hold.
Regarding the cognitive part, I have a question. I'm using OpenAI's API, and I've tested it before; each call takes about two to three seconds. I also don't know if it's free. So I have two questions.
First, at what frequency should I call the API, or under what triggering conditions should I call it? Second, if the conditions for calling the API are triggered again during this cognitive event, what logic should I use: override/wait and execute/interrupt/ignore?
-------------------------------------------------------------------
LLM calls are likely an expensive and slow resource, so I think fixed-frequency polling is definitely not feasible; it's wasteful and blocking. I believe the best trigger is a speech event. However, speech events have a significant chance of being triggered accidentally because, unless LLM is used for semantic understanding, it doesn't know whether I'm talking to it or someone else.
The best approach seems to be using a specific wake word as a gating mechanism. (I used to think it was silly for some smart products to keep calling out like that, but now I realize it seems to be a very low-cost optimal solution, haha). This wake word can be directly matched using ASR results. However, the problem remains because, as mentioned earlier, ASR results can contain homophones, resulting in a high error rate for fixed text matching. I think there are three options to choose from, with increasing accuracy and cost:
1. Set a wake word that is very difficult to mistranslate (e.g., "hello" is very difficult to mistranslate in both Chinese and English, in my experiments), and then perform single text matching.
2. Approximate text matching: list all possible homophones of the wake word, and then use any word that meets this condition.
3. Use the original audio data and a dedicated wake word detection model for training and configuration.
I haven't decided which one to use yet.
My idea for an interruption mechanism is to avoid disrupting the existing process, instead queuing new requests and using the latest/most important event in the next API call.
However, this still presents me with challenges.
--------------------------------------------------------------------
This also made me realize that not all reaction mechanisms need to use the cognitive advice generated by LLM. Like humans, some reactions are unconditional, subconscious. In my project, I believe only the parts involving language dialogue need to use LLM. (There might be facial recognition later, but I increasingly feel that expanding too much would overwhelm me.)
Just like us humans, when we sense someone approaching through vibrations/sounds/peripheral vision, our subconscious action is to turn our heads left and right to locate the person, then focus our gaze on their face to identify them, and only then do we engage in conversation, recall memories related to that person, etc.
I want to complete this subconscious "search & observe" phase first. I want to postpone the cognitive part of LLM; I'm a bit tired of it now and somewhat reluctant to face it, haha. When the PIR sensor detects someone approaching, it enters the "search" state, turning its head up, down, left, and right until it finds a face, and then adjusts the servo angle so that the face remains centered in the frame (meaning the little robot will keep looking at your face, which is quite interesting).
While this part doesn't require entering the cognitive step, I think there are some key points worth noting.
First, the execution command should be issued after a decision-making process within the decision loop. This decision-making involves more than just the values from sensors like PIR; it also considers historical states and should include a duration-based assessment (I don't know the technical terminology for this, but for example, even if a person is constantly in front of the robot, the PIR might occasionally fail, preventing it from exiting the "observation state". Furthermore, if the observation state continues for a certain time without a speech event, it enters a "waiting standby state." If a speech event occurs, the "observation state" becomes another threshold for invoking the cognitive phase of the LLM, leading to the "interaction state." My explanation might be a bit disorganized, but what I mean is that the decision loop should contain a state machine and components related to state memory, making the continuation and transition of states more logical and lifelike, rather than a silly and strange code life.)
The "result" and "state" scripts are for formatting constraints and only contain class definitions. The "context" script is responsible for remembering historical states. The "loop" script is the actual worker, performing conditional checks to determine the current state.
Oh, by the way, just a quick aside: the assembly of the outer shell and structure is now complete.
Excellent! As you can see, initially the DECISON field showed STANDBY. When the [sensor][motion] (PIR) field in perception became TRUE, DECISION immediately changed to SEARCH. This means someone was detected, so the search for a face began. Once a face was found (i.e., after [face][count] and [face][boxes] appeared), DECISION's STATE immediately changed to OBSERVE. This perfectly matches my expectations.
The code for the decision loop is too long to show here. I'll just briefly describe my STATE switching logic.
When someone approaches, the PIR is triggered. If no face is detected at this time, the state switches from STANDBY to SEARCH. The corresponding action in this state is the servo motor moving up, down, left, and right. There are two paths: if no face is found after the search time, it returns to STANDBY; if a face is found, it enters the OBSERVE state.
In the OBSERVE state, the action is to try to track the position of the face's boxes, keeping the face as centered on the screen as possible, meaning the robot is always looking at you. The OBSERVE state also has two possible outcomes. If a speech event is triggered during OBSERVE, the system enters the INTERACT state. In this state, the LLM (Learning Manager) is invoked, generating cognitive advice. This advice should contain three types of action suggestions: a positive response (corresponding to a nod on the servo), a negative response (corresponding to a left or right head shake), and a voice response (for questions that cannot be answered with a simple yes/no answer). In this case, the horn will be activated for a dialogue response. Then, the system returns to the OBSERVE state.
The second possible outcome is that if no speech event is detected within the waiting time, the system returns to the STANDBY state.
The INTERACT state, its subordinate states, and the actual servo control implemented by specific actions are not yet implemented. The test procedure described above only tests the switching relationships between the STANDBY, SEARCH, and OBSERVE states.
I plan to first demonstrate the specific control of the servo motors on the MCU side for the corresponding ACTIONs of these three states.
My current port 8080 is used for TCP communication between the MCU and the PI, port 8081 seems to be used for pushing MJPEG data to the HTTP server port, and port 8082 is used for TCP communication specifically for audio (also between the PI and the MCU).
For the sake of channel cleanliness and readability, I do not intend to use the existing port, but instead plan to open a new port: 8083, specifically for the PI to send current action commands to the MCU.
So what I need to do next is first translate the decision action into a command dictionary that I'm ready to send to the MCU.
no problem
So now we've completed the entire perception-decision-translate-action process. The next step is to use the port we just confirmed to send the command to the MCU via TCP.
When designing the process for the Pi to send commands to the MCU, I encountered a problem. TCP commands essentially involve the client remembering the server's IP address and port, and then actively initiating the connection.
If I make the Pi the client, the situation in my system becomes: the MCU is the client, sending messages to the Pi's ports 8080 and 8082; the Pi is also a client, sending messages to the MCU's port 8083. If the Wi-Fi network changes, the IP addresses of both the Pi and the MCU will change, requiring me to modify the IP address on both sides' code, which is cumbersome.
Therefore, I plan to leverage the bidirectional data transmission capabilities of TCP. In the communication line where the Pi sends commands to the MCU, the MCU will still act as the client, actively connecting to the Pi, and then the Pi will send the commands back to the MCU. This way, there is only one variable: the Pi's IP address.
WEEK7
Since pi is now a server, the previously written sender script is no longer suitable. It should listen on port 8083 instead of actively connecting. Therefore, a new action server script needs to be created.
The action was successfully sent to the MCU and received by the MCU. There are some logical issues, such as in the track mode. The face box coordinates should be continuously updated in real-time to allow the servo motors to constantly adjust their angles.
Additionally, if in observe (cmd=track) mode there is no command for an extended period (i.e., after a timeout), it should enter standby mode. However, since the user is likely still present at this time, search and observe will be triggered again, resulting in unnecessary repeated jitter.
To improve servo motor face tracking, anti-shake measures were added to the track, replacing the original x and y positions with the distance differences between x and y and the center. A dead zone was also added.WEEK8
My design objective is as follows: what we have currently completed constitutes instinctive behaviors—analogous to biological "unconditioned reflexes"—which are managed within the `decisionloop` using `if/else` statements to determine the appropriate state (STANDBY, SEARCH, or OBSERVE).
The cognitive component we are now integrating is intended to represent processes involving higher-level "brain-like" reasoning; crucially, this cognitive function should have the capacity to override and control those instinctive behaviors. Our previously established logic dictates the following flow: a PIR sensor trigger initiates the SEARCH state; upon detecting a human face, the system transitions to the OBSERVE state; subsequently, if ASR (Automatic Speech Recognition) text is recognized while the system is in the standby/waiting phase of the OBSERVE state, both the recognized text and the current `perceptionstate` are sent to an LLM to generate a piece of `cognitiveadvice`.
My vision is for the small robot to answer questions using only "yes" or "no" responses. Therefore, the advice generated by the LLM should take the form of `interact-answer yes` or `interact-answer no`; these commands would then be translated into corresponding actions sent to the MCU to control the robot's head movements—either nodding up and down or shaking left and right.
Furthermore—as I mentioned earlier—this cognitive layer must be capable of overriding and controlling the instinctive responses. Consequently, the `cognitiveadvice` should also have the power to dictate state transitions. For instance, if a user says something like, "Stop looking at me," the advice should recommend reverting to the STANDBY mode. In such a scenario, even if the conditions within the `decisionloop` do not technically meet the criteria for entering STANDBY, the corresponding action command for STANDBY must still be sent to the MCU. Additionally, this directive should be accompanied by a "memory" function; for example, after a specific duration (e.g., one minute), the memory of this command would fade, allowing the `decisionloop` to resume its instinctive decision-making processes.
This script is designed to control the structure of LLM outputs throughout my entire system. It includes:
1. Enum definitions
2. Dataclasses
3. `from_dict()` methods
4. `to_dict()` methods
5. Basic validation logic
Additionally, there is a cognitive control script designed to record and manage the highest-priority instructions issued by the cognitive advice module.
Next, we will create a "Cognitive Arbitrator" script. In my current system, the decision loop is responsible for computing instinctual outcomes, while the cognitive module handles the computation of cognitive outcomes. The Arbitrator script acts as the mediator, determining which result—instinctual or cognitive—should ultimately prevail; it also serves as the interface through which cognitive advice is integrated into the decision loop.
Now, let's truly integrate the "Cognitive Override" into the main loop.
My current system already includes:
`update_decision(...)`: Generates the instinctual `DecisionResult`.
`CognitiveControl`: Stores the temporary override.
`arbitrate_decision(...)`: Merges the instinctual result with the override.
The only missing piece now is the final "wiring":
Inside `decision_monitor_thread()`, first compute the `instinct_result`.
Then, execute `arbitrate_decision(...)`.
Obtain the `final_result`.
Use `final_result.action` to send commands to the MCU.
Additionally, whenever an override becomes active, synchronize `decision_context.current_state`.
Consequently, I proceeded to create a new "cognitive loop" script. This script encapsulates the calls to the LLM within a dedicated thread, allowing it to be launched directly from the main program. It then retrieves the "advice" generated by this thread and feeds it into the "cognitive arbitrator" to determine the final decision.
I have now identified another issue: since I am utilizing a local ASR model, there appear to be some performance limitations—specifically, low recognition accuracy and excessive processing times—which make it extremely difficult to trigger the cognitive layer. I therefore plan to upgrade to a cloud-based ASR solution.
It has been successfully implemented: when I instructed it not to look at me, the LLM was successfully invoked, returned an override suggestion, and the robot subsequently entered STANDBY mode and began the countdown.
As is evident, this specific ASR recognition task took 20 seconds to complete—a remarkably slow performance.
Upon reviewing the logs, I determined that the issue stemmed from the VAD (Voice Activity Detection) segmentation, which was occasionally prone to false triggers. This resulted in the generation of numerous "EMPTY" or "LOW ENERGY" audio segments; these segments were subsequently forwarded to the LLM, causing a task backlog on the LLM side and consequently slowing down the overall response time.
To address this, I implemented a filtering mechanism for the conditions that trigger the "publish speech event" action. This prevents segments classified as "EMPTY"—or those that are excessively short in duration—from being published as speech events and initiating calls to the LLM.
This optimization has significantly reduced the response time, bringing it down to approximately 1 to 2 seconds.
WEEK9
At this point, I find it difficult to make further progress, as I feel I have already achieved my intended outcomes and acquired the knowledge I set out to learn.
This may be attributed to the extra studying I undertook during the winter break, as well as the significant amount of time I dedicated daily to advancing my project during the first half of the semester. Consequently—even though the semester has not yet officially concluded—it feels as though I have already accomplished everything I set out to do.
My current plan is to finalize and refine the project description and documentation as soon as possible. Since I approached this as a comprehensive and complex undertaking, I believe it would be highly beneficial to give it a proper conclusion by compiling it into a portfolio-style showcase or project dossier—especially as I intend to use it when applying for internships.
Secondly, I would like to broaden my horizons by exploring additional concepts, theories, and areas of knowledge. I feel somewhat adrift right now; I am unsure of my ultimate career path or exactly where I currently stand in my academic progression. I am also struggling with course selection for the upcoming semester; the available options seem to lack a certain spark—that sense of novelty, intellectual excitement, and discovery that truly captivates one's interest.
Since the midterm exams, I feel as though I have slipped into a state of low spirits. This stems primarily from uncertainties regarding my future choices. I am currently quite lost, unsure of what to study or what activities to pursue; looking further ahead, I am equally uncertain about which professional field to enter or how to make a meaningful impact within it.
I continue to maintain my daily routine of studying and exercising—I do not feel as though I have stalled in my efforts. Yet, mentally and emotionally, I feel trapped.
Comments
Post a Comment