Flaneur Creation Notes

Circular Time Flaneur.001.jpeg

TL; DR#

Our team participated in the Hack Engine event organized by Jike Company from April 8 to April 10. This article describes the process of our project Flaneur's birth, which is a website that uses artificial intelligence to generate music, sounds, and content, aiming to make users feel like they are walking with an old friend. The author also discusses other projects showcased at the Hack Engine event, including travel planning and general knowledge management.

What is Hack Engine#

Here is the official introduction to Hack Engine:

Q: What is the difference between Hack Engine and a Hackathon?
A: There is no difference. In addition to Hackathon, we are also an incubator, fund, and alumni entrepreneur network.
Q: Isn't that Y Combinator?
A: Yes.

In short, Hack Engine organized an AI-themed Hackathon event, where a team of up to 5 people needs to complete the development of a small product and demonstrate it within 48 hours.

Our team is also very focused on the development and application of generative AI, and since I am an old user of Jike, we quickly decided to participate. We were curious about how everyone could use AI and wanted to meet the real people behind familiar IDs.

Our team consists of 5 members: backend engineer @ Xiao, frontend engineer @ Edison, designer @ Brant, general helper @ Jason, and we also brought in algorithm engineer @ York from the skiing group (?) as an external aid. Our team had almost no experience participating in hackathons, so we consulted @Junyu before we set off. Pea Pod is probably the earliest company in China to hold hackathons, usually developing a small application within 24 hours; when I asked about the most profound experience, @Junyu looked up at the sky (calling cloud computing?), recalling a year when the hackathon coincided with a heavy rainstorm in Beijing, and everyone stayed up all night, even considering whether to go out to rescue people. @Junyu gave me three pieces of advice: the most important thing is to finish development, second is to make it interesting, and third... that's it.

Since the topic would only be announced on the kickoff day, we didn't do much other preparation. On the engineering side, we deployed a server capable of running Stable Diffusion and prepared two different OpenAI API accounts in case one got blocked. Additionally, combining @Junyu's advice, we established a few principles for selecting a topic:

Fun and interesting, small enough / vertical
Can be completed in two days
Or can’t be completed in two days but is sufficiently shocking (can we just show a video as a demo? ((fake it till you make it))

Copilot for ?#

At 9:30 AM on Saturday, the topic was announced: Copilot for X.

Copilot for X, X = anything, so it didn’t say anything at all! To avoid prematurely narrowing our ideas, we decided to think individually first and then meet to discuss. I originally wanted to take this opportunity to socialize, but I was surprised to see other teams discussing enthusiastically and even starting to work, wow. By the time we were supposed to meet, I was already hungry, so we decided to fill our stomachs first.

The event was held in Wujiaochang, and we were unfamiliar with the area and didn’t know where to eat, so we casually strolled while discussing, and inspiration struck!

Shanghai is a city very suitable for walking. I remembered a time many years ago when I was walking in Shanghai, probably on a summer evening, strolling alone on Hengshan Road, enjoying the evening breeze and listening to music, feeling very comfortable. The song I was listening to was Shu Qi's Tram, from the Hong Kong album in the LV SoundWalk series (yes, it’s Hong Kong). LV selected some iconic locations in Hong Kong, invited local musicians to compose, and had Shu Qi voice the introductions, integrating stories, which was very pleasant to listen to. However, this series only has three albums in China: Beijing, Shanghai, and Hong Kong, and I finished listening to it quickly. In Beijing, it was Gong Li voicing, and in Shanghai, it was Chen Chong voicing. I highly recommend walking in the order of the locations in the album while listening to the music.

Tram - Shu Qi

Among the entire series, my favorite is still this Tram; Shu Qi's voice is just too beautiful! So I thought, if we could use AI to generate similar content and add AI-generated background music, it should be quite nice.

We discussed it, and it was indeed feasible, and the idea extended a bit: for example, we could add some more timely information, such as the current weather, the user's movement status, and step frequency, so that even if you open it in the same location, the content you hear each time would be different; we could also gather more information to introduce nearby landmark buildings.

We sorted out the requirements, and this small product would have the following characteristics:

No operation needed, just open it to use
Generate background music that matches the walking pace based on current location, weather, movement status, etc.
Use a pleasant female voice to introduce the history and stories of nearby neighborhoods, as if a real person is accompanying you on a walk
Pre-generate content for the neighborhood and nearby neighborhoods, so you can keep walking without interruption

The final effect is a simplified version of LV SoundWalk. Or from another perspective, LV SoundWalk is too elite, with only a limited number of locations. In fact, every inch of land we live on has its own story, and every place deserves its own SoundWalk, so it can also be understood as the democratization of SoundWalk.

At 2 PM on Saturday, we completed our brainstorming and division of labor, and got to work!

The Birth of Flaneur#

@Junyu: The first step in making a product is to buy a domain name

First, we needed to name this product.

Shanghai is a very fashionable city. On the day I arrived in Shanghai, I took the subway from the airport to the city center, and as I walked out of the subway station, I saw a very elegantly dressed young woman holding a bouquet of flowers wrapped in an English newspaper. Shanghai is really fashionable! I couldn't help but marvel. Upon closer inspection, I realized I was wrong; it wasn't an English newspaper, it was a French newspaper. Shanghai is just too fashionable! I couldn't help but marvel.

Since the inspiration came from walking in Shanghai, and we started creating this small product in Shanghai, we definitely needed a fashionable name!

So we named this product Flaneur, which means "strolling" in French, specifically referring to "aimless wandering." Given that Flaneur has no interactive features, this name is indeed very fitting.

The implementation of Flaneur can be summarized in the following steps:

Obtain user status information, such as geographical location and movement status
Gather information related to that geographical location, such as current weather, Wikipedia introductions, and POIs
Use GPT to generate a description that covers the information from #2
Convert the generated description from #3 into a human voice (Shu Qi) using TTS
Generate suitable BGM based on the location, weather, movement status, etc., from #1 & #2; if walking, it should be soothing, if the user is running, it can be more upbeat
Merge the audio tracks from #4 & #5 for playback
For demonstration purposes, we still need an interface to scroll through the description from #3

There wasn't much difficulty on the engineering side; the only issue we encountered was that the webpage couldn't obtain the user's movement status, so we decided to abandon that information. The interesting part was the AI-related implementation, which also inspired us a lot. The AI part can be divided into three sections: text generation, text-to-speech (TTS), and music generation.

The narration in the LV SoundWalk series is very flavorful, combining local history and characteristics, making it crucial to have GPT generate text in a similar style. I mainly trained GPT. Taking Wujiaochang as an example, I found some information from Wikipedia as a prompt for GPT, asking it to act as a "tour guide" to introduce Wujiaochang, but the generated text was very "tour guide-like." I thought of adding some more "on-the-spot" descriptions, such as "you just passed an ancient door," which improved the effect a bit, but GPT still couldn't help but say things like "Welcome to Wujiaochang" or "Hello, old friend," whereas I believe a good effect is a gentle female voice directly "breaking" into your ear, starting a conversation without pleasantries (otherwise, I would feel awkward since it’s a very pleasant female voice).

Later, I suddenly thought of having GPT act as "introducing the nearby neighborhood to a blind friend," and this setting produced very good results! But GPT always couldn't help but add a line at the end like "Even though you can't see, but..." Such comfort. Following the same idea, I adjusted the prompt, and the final prompt and effect were as follows:

TTS was the most challenging part of the entire process! TTS, or text-to-speech, actually has many mature solutions, such as the phrases often heard on Douyin like "Family, who understands?" and "Pay attention, this man is called Xiaomei," but Flaneur clearly couldn't use such unrefined voices; it should at least be Gao Yuanyuan! So we researched customized TTS and found two solutions:

MockingBird: An open-source model that can generate readings of any text with just a few seconds of source audio, but requires self-deployment, and the demonstration effect is acceptable.
11Labs: Uploading a 10-minute audio file can generate readings of any text, and the effect is stunning! Paid (it doesn’t seem expensive). The downside is that it only supports English.

There are also some domestic vendors offering voice customization solutions, but they require 15+ working days and costs in the tens of thousands... It seems they are using very outdated technology, so the costs are high.

@York spent a lot of effort deploying and optimizing the MockingBird model, but the final effect was still mediocre. We studied the technology behind the model, and MockingBird is based on the previous generation of GAN, which might be the reason for its average performance.

While @York was struggling with the model, I started playing with 11labs. I first tried Shu Qi's voice + Chinese reading, and the result sounded like a foreigner just taking the HSK. Shu Qi's voice + English was a bit lacking, not as stunning. So what if we used a familiar foreign actress? My first thought was Scarlett Johansson and the movie HER.

The result was stunning! Although I couldn't "have" Shu Qi, I unexpectedly got Samantha; what more could I ask for!

Family, who understands!

Music generation was the most ordinary part. Many years ago, there were already software that generated corresponding BPM BGM based on step frequency, so there wasn't much imagination. Plus, with limited time, I thought it wouldn't take too much effort for background music, so I decided to use AI to pre-generate a bunch of different BGM tracks, and we could just play the ready-made ones during the demo.

Since the interface was very simple, we completed the interface development and API adaptation on the first day; the most time-consuming parts should be the backend handled by @Xiao and the TTS part handled by @York. By the evening of the second day, we had all the interfaces working; both days we left the venue right at midnight, while many teams were still working hard.

Time to leave

Are you there? Check out the effect#

Please see our demo video below:

Pretend there is a video

A brief explanation:

Just open the webpage to use, no operation needed
All content is AI-generated, including content, music, and Samantha's sexy voice (although the music is pre-generated, it is also AI-generated)
We even embedded an advertisement as an Easter egg (who knows, it might really be commercialized)

Experience address: https://flaneur.polytimeapp.com/
You can also open it by reading the original text at the end; please open it on a mobile device.

After opening, you need to click some bubbles to start playback; the loading is still a bit slow, and the generated information is somewhat monotonous, so please give Flaneur some patience.

When we first conceived it, we were completely inspired by LV SoundWalk. But when I actually used it and heard Samantha's voice introducing, I really wanted to chat with her!

I love walking; sometimes it's for thinking, sometimes it's with friends. The most comfortable state is actually going to an unfamiliar neighborhood and walking with "a very familiar friend." I often have some strange associations and cold jokes; unfamiliar environments give me more inspiration and clues, and having someone respond while walking and talking is the most comfortable state.

Furthermore, if we add the ability to call the phone's camera, using the CLIP model to understand images and incorporating that as part of the prompt for GPT to generate content, it could really allow Samantha—no, Flaneur—to see what you see. She would truly be like an old friend walking with you, listening to your ramblings, accompanying you down one street after another. Just like in the movie HER. HER is already a movie from ten years ago, and its filming locations happen to be in Shanghai.

It feels like a dream coming true, wow.

Demo Day!#

Demo Day is when a hundred teams showcase their results in one day, very cool! Each team only has 5 minutes, and going over time will result in a ruthless interruption, very harsh! I can't wait to see everyone's projects, one by one!

Our presentation was scheduled to be the fourth to last, and by that time, I was actually quite tired... However, it went very smoothly, and I managed to convey everything I wanted to say, so there wasn't much to write about.

I listened carefully to almost all the projects and took notes on the ones I liked / found interesting / were impressive. I still can't compare to the diligence of my seniors; @junyu took detailed notes on each project. Since the organizers might have some confidentiality considerations, I'll just write some abstract thoughts.

I saw several projects combining AI with technology for good, which I really liked. The recent boom in generative AI has made many friends worry about whether they will be replaced by AI in the future (especially professions like lawyers, programmers, and investment researchers). Coincidentally, I talked with a friend yesterday about technological changes in human history; in fact, each time has been a liberation of humanity itself—short-term, some people may be affected and lose their jobs, but soon it will be discovered that people are actually liberated from "less human jobs" to do "jobs more suited for humans." AI can help marketing accounts generate content and also help visually impaired groups interact with the world more seamlessly.

Many projects focused on travel planning. We also considered this theme but found that we might face the problem of not having data to use. Dynamic pricing for flights, hotel room rates, and even map routes are all constraints, and all this data is controlled by OTA service providers with strict anti-scraping strategies, so it’s very likely to create a beautiful product without any usable data. Since the advent of the mobile internet era, data has been firmly controlled by large companies and trapped in isolated app islands, leading users to form perceptions like "take a taxi with Didi," "watch videos on Douyin," and "search for boredom on Baidu." However, they are essentially all "information," not "video / text / voice / map" or "notes / emails / schedules / TODOs," making it hard to say this isn't a "detour of the internet."

The theme that most teams at Hack Engine were interested in, and which I am also most interested in, is "general knowledge management." GPT itself is a language model and lacks logical reasoning ability, while human knowledge actually exists in various logical relationships. The proposition "the Earth is round" is not important; the proposition "gravity causes the Earth's matter to gather towards the center, thus forming an approximately spherical Earth" is important. Epistemology defines knowledge as Justified True Belief (JTB), meaning knowledge must meet the following three conditions:

Someone believes something;
This belief is indeed true;
This belief is justified.

None of the three conditions can be missing. Here are a few counterexamples: gravity causes the Earth's matter to gather towards the center, thus the Earth becomes a bagel (the belief is false); because there is a hamster running on a wheel inside the Earth, the Earth is round (the justification for this belief is wrong).

GPT stands for Generative Pre-Trained Transformer, and it is a "large language model." The launch of ChatGPT seems more like a temporary move to capture user attention and data, and it doesn't necessarily mean that the best form/application of GPT is "conversational." Now everyone is doing conversation, and I think we have been led astray by OpenAI. Additionally, as mentioned earlier, GPT lacks logical reasoning ability, so asking GPT knowledge-based questions is unwise. I'm sure everyone has seen GPT seriously spouting nonsense (hence GPT is also known as "nonsense machine").

On the other hand, I personally believe GPT is very suitable for "content processing and generation under limited information," such as all the original information for Flaneur being provided by us; for example, first help me filter through the Read it later list; for example, generate a summary of an article (the TL;DR at the beginning of the article was written by GPT); for example, help me automatically establish associations in my knowledge base (actually, it only needs embedding); for example, generate a new article based on the fragmented information I wrote, etc...

I call these "general knowledge management," a theme I am very interested in and passionate about. Friends who are also interested in this theme are welcome to communicate!

1476px-Pieter_Bruegel_the_Elder_-The_Tower_of_Babel(Vienna)_-_Google_Art_Project.jpg

Building a giant spirit for humanity

Ending#

The entire Hack Engine was very compact, and the demo was completed on time or even ahead of schedule, with results announced on Monday night. Flaneur was not selected, which still feels a bit regrettable. However, we thoroughly enjoyed the process itself over these two days, spending an unforgettable weekend; the weather in Shanghai has been great these days, and the weather forecast initially predicted rain, but *** it turned out to be sunny!

Shanghai always gives me a similar feeling: a beautiful beginning and process, leaving some regrets at the end.

Anyway, I am very grateful to my team, and extra thanks to @York for coming from Hangzhou to participate (we actually forgot to take a group photo TAT)

I would like to especially thank the organizing team at Jike; we had an almost perfect experience on-site, encountering no issues at all, completely unlike a first-time event. Hack Engine also paid great attention to details, such as the competition certificates being specially designed; such details, wow.

More advanced than Byte's employee badges, wishing Jike Company a speedy acquisition by ByteDance.

Although we were not selected, Flaneur was still widely liked, and many friends asked me if Flaneur would continue to be developed, which made us very happy. To be honest, we haven't figured it out; making a demo and making a real product are quite different, and whether the existing technology can meet our expected effects also needs further research. Our team is also facing some difficulties, making it hard to spare energy and resources to create a new product.

Anyway, if you also like Flaneur, please don't hesitate to let us know with your compliments!