A MINI TIKTOK WITH USER EMBEDDINGS

I'll try to explain how I built a fully personalized TikTok feed using data from Kaggle, powered by Streamlit + Pinecone + FirstBatch. I've crafted—a mini TikTok that seamlessly tailors content to your tastes from the get-go.

Try it on here:

From curating the perfect dataset to deploying an intuitive web app with Streamlit, this journey is a testament to the power of embeddings & vector databases. Let's embark on this adventure, step by step:

Curating the Kaggle Dataset
TikTok Videos as Embeddings
Efficiently Storing Embeddings in Pinecone
Crafting a Custom Algorithm through FirstBatch Dashboard
Designing a Streamlit Web Application for User Interaction
The Final Step: Sharing it!

Kaggle Dataset

I’ve used the TikTok Trending Videos, a collection of 1000 trending TikTok videos from 2020. The original author provides some context:

“I scraped the first 1000 trending videos on TikTok, using an unofficial TikTok web-scraper. Note to mention I had to provide my user information to scrape the trending information, so trending might be a personalized page. But that doesn't change the fact that certain people and videos got a certain amount of likes and comments.”

Generating Video Embeddings

LangChain recently released a powerful Multi-Vector Retriever for RAG on tables, text, and images, exploring ideas of generating embeddings for distinct types of data, and methods for storing and retrieving them efficiently.

One significant insight I've noticed (which has also been highlighted in various other posts) is the potential of leveraging LLMs from multi-model outputs to produce a "summary" that can then be used to create an embedding for contextual representation.

Recent tests have shown that embeddings derived from extensive passages tend to be less effective than those produced from a concise "summary" of the same content, especially in the context of retrieval and RAG.

“So why not do it here?” I’ve asked.

The crucial approach to producing embeddings from TikTok clips involved utilizing metadata and video frames. I began by pulling frames from each TikTok video, with videos averaging around ~200 frames. I employed Llava (specifically llava-13b) to create textual descriptions based on these video frames. Reflecting on the process, I chose to use up to 6 frames, starting from the initial frame and concluding with the final one, to craft descriptions for each video.

Each description is subsequently inputted into GPT-4, combined with the post description (user text) and MusicMetadata, to produce a concise representation of the TikTok video.

Though a good portion of the musicMetadata was custom audio, GPT was able to reason with published songs and incorporate their effect on video description. These summaries are fed to FlagEmbeddings ("BAAI/bge-small-en-v1.5") to generate embeddings of 384 dimensions. Flag embeddings are notoriously good for their embedding size.

Storing in Vectors

I’ve created an instance with Pinecone, for 384 dimensions and upserting vectors was rather easy:


    pbar = tqdm(total=int(batch_size / bs))
    for ids_vectors_chunk in chunks(gen, batch_size=bs):
        index.upsert(vectors=list(ids_vectors_chunk))
        pbar.update(bs)
    pbar.close()

Crafting My TikTok Algorithm

I’ve used the User Embeddings dashboard to generate a complete algorithm that can utilize embeddings stored in a vector database to generate personalized experiences.

My idea was to build an app that let users navigate through a large set of data using only their attention, meaning their watch time on each video shapes their feed. Streamlit app measures time spent on each video and sends 3 different signals to FirstBatch based on time spent, the signals being:

High Attention: 15 seconds of watch time
Medium Attention: 12 seconds of watch time
Low Attention: 9 seconds of watch time

If a video is watched for less than 9 seconds, no signal is sent, based on the assumption that it typically takes 8-9 seconds for a user to determine a video's appeal. These settings can be adjusted via the dashboard, and any changes will immediately be reflected in the deployed app.

The algorithm presents the closest matches based on user embeddings derived from their watch time signals. Furthermore, it adjusts the results based on user satisfaction with new content batches. For instance, if users consistently display high levels of engagement, the content is deemed appropriate. However, if they begin to skip videos rapidly, the content becomes more diverse and randomized.

The algorithm has 5 states: 0, H, M, L, L-R.

0 is the initial state, where a random TikTok video is shown to users.

H is the state designated for a high attention signal, delivering content with almost no exploration and randomness. Uses only the last 3 signals.

The degree of randomness and expansion escalates from H→M→L→L-R. Simultaneously, the number of signals used for deriving user embeddings also grows. This is because the algorithm aims to highlight items that previously captivated the user to see if they still hold the user's interest.

Streamlit!

I’ve shared the full code at this repo. There are three main components on the Streamlit app:

I initialize vectorDB and FirstBatch inside the Streamlit session state.

   
 if 'personalized' not in st.session_state:
        config = Config(batch_size=5, verbose=True, enable_history=True, embedding_size=384)
        personalized = FirstBatch(api_key=FIRST_BATCH_API_KEY, config=config)
        pinecone.init(api_key=PINECONE_API_KEY,
                      environment=PINECONE_ENV)

        index = pinecone.Index("tiktok")
        personalized.add_vdb("tiktok_db_pinecone", Pinecone(index))
        st.session_state.personalized = personalized

        st.session_state.session = st.session_state.personalized.session(AlgorithmLabel.CUSTOM,
                                                                         vdbid="tiktok_db_pinecone",
                                                                         custom_id=CUSTOM_ALGO_ID)

        st.session_state.batches = []
        st.session_state.ids = []
        st.session_state.likes = []
        st.session_state.current_idx = 0

        st.session_state.watch_time = []

        ids, batch = st.session_state.personalized.batch(st.session_state.session)
        st.session_state.batches += batch
        st.session_state.ids += ids

        st.session_state.stamp = time.time()

Display videos using session data


def display_video_with_size(width=640, height=360, avatar_url=""):
    b = st.session_state.batches[st.session_state.current_idx]
    _id = b.data["id"]
    text = b.data["text"]
    username = b.data["username"]
    play_count = b.data["play_count"]
    likes = b.data["likes"]
    url = cdn_url.format(_id)

    padding_horizontal = 20
    adjusted_width = width - 2 * padding_horizontal

    lines_of_caption = max(1, len(text) // 40)
    caption_height = lines_of_caption * 20

    video_embed_code = "full code at github"

    total_height = height + 150 + caption_height + 40
    st.components.v1.html(video_embed_code, height=total_height)

Added the signaling mechanism based on watch time


if st.button("Next"):
    cid = st.session_state.ids[st.session_state.current_idx]
    t2 = time.time()
    time_passed = t2 - st.session_state.stamp

    if time_passed > 5:
        # If time spent post is more than 5 seconds, send signals
        if time_passed > 15:
            ua = UserAction(Signal.HIGH_ATTN)
            st.session_state.personalized.add_signal(st.session_state.session, ua, cid)
        elif time_passed > 12:
            ua = UserAction(Signal.MID_ATTN)
            st.session_state.personalized.add_signal(st.session_state.session, ua, cid)
        elif time_passed > 9:
            ua = UserAction(Signal.LOW_ATTN)
            st.session_state.personalized.add_signal(st.session_state.session, ua, cid)

Fin!

That's the essence of it. With just around ~140 lines of code, you can craft your personalized content platform using Streamlit, FirstBatch, and a vector database of your preference.

Thanks for your time!