I'll try to explain how I built a fully personalized TikTok feed using data from Kaggle, powered by Streamlit + Pinecone + FirstBatch. I've crafted—a mini TikTok that seamlessly tailors content to your tastes from the get-go.
Try it on here:
From curating the perfect dataset to deploying an intuitive web app with Streamlit, this journey is a testament to the power of embeddings & vector databases. Let's embark on this adventure, step by step:
I’ve used the TikTok Trending Videos, a collection of 1000 trending TikTok videos from 2020. The original author provides some context:
“I scraped the first 1000 trending videos on TikTok, using an unofficial TikTok web-scraper. Note to mention I had to provide my user information to scrape the trending information, so trending might be a personalized page. But that doesn't change the fact that certain people and videos got a certain amount of likes and comments.”
LangChain recently released a powerful Multi-Vector Retriever for RAG on tables, text, and images, exploring ideas of generating embeddings for distinct types of data, and methods for storing and retrieving them efficiently.
One significant insight I've noticed (which has also been highlighted in various other posts) is the potential of leveraging LLMs from multi-model outputs to produce a "summary" that can then be used to create an embedding for contextual representation.
Recent tests have shown that embeddings derived from extensive passages tend to be less effective than those produced from a concise "summary" of the same content, especially in the context of retrieval and RAG.
“So why not do it here?” I’ve asked.
The crucial approach to producing embeddings from TikTok clips involved utilizing metadata and video frames. I began by pulling frames from each TikTok video, with videos averaging around ~200 frames. I employed Llava (specifically llava-13b) to create textual descriptions based on these video frames. Reflecting on the process, I chose to use up to 6 frames, starting from the initial frame and concluding with the final one, to craft descriptions for each video.
Each description is subsequently inputted into GPT-4, combined with the post description (user text) and MusicMetadata, to produce a concise representation of the TikTok video.
Though a good portion of the musicMetadata was custom audio, GPT was able to reason with published songs and incorporate their effect on video description. These summaries are fed to FlagEmbeddings ("BAAI/bge-small-en-v1.5") to generate embeddings of 384 dimensions. Flag embeddings are notoriously good for their embedding size.
I’ve created an instance with Pinecone, for 384 dimensions and upserting vectors was rather easy:
pbar = tqdm(total=int(batch_size / bs))
for ids_vectors_chunk in chunks(gen, batch_size=bs):
index.upsert(vectors=list(ids_vectors_chunk))
pbar.update(bs)
pbar.close()
I’ve used the User Embeddings dashboard to generate a complete algorithm that can utilize embeddings stored in a vector database to generate personalized experiences.
My idea was to build an app that let users navigate through a large set of data using only their attention, meaning their watch time on each video shapes their feed. Streamlit app measures time spent on each video and sends 3 different signals to FirstBatch based on time spent, the signals being:
If a video is watched for less than 9 seconds, no signal is sent, based on the assumption that it typically takes 8-9 seconds for a user to determine a video's appeal. These settings can be adjusted via the dashboard, and any changes will immediately be reflected in the deployed app.
The algorithm presents the closest matches based on user embeddings derived from their watch time signals. Furthermore, it adjusts the results based on user satisfaction with new content batches. For instance, if users consistently display high levels of engagement, the content is deemed appropriate. However, if they begin to skip videos rapidly, the content becomes more diverse and randomized.
The algorithm has 5 states: 0, H, M, L, L-R.
0 is the initial state, where a random TikTok video is shown to users.
H is the state designated for a high attention signal, delivering content with almost no exploration and randomness. Uses only the last 3 signals.
The degree of randomness and expansion escalates from H→M→L→L-R. Simultaneously, the number of signals used for deriving user embeddings also grows. This is because the algorithm aims to highlight items that previously captivated the user to see if they still hold the user's interest.
I’ve shared the full code at this repo. There are three main components on the Streamlit app:
if 'personalized' not in st.session_state:
config = Config(batch_size=5, verbose=True, enable_history=True, embedding_size=384)
personalized = FirstBatch(api_key=FIRST_BATCH_API_KEY, config=config)
pinecone.init(api_key=PINECONE_API_KEY,
environment=PINECONE_ENV)
index = pinecone.Index("tiktok")
personalized.add_vdb("tiktok_db_pinecone", Pinecone(index))
st.session_state.personalized = personalized
st.session_state.session = st.session_state.personalized.session(AlgorithmLabel.CUSTOM,
vdbid="tiktok_db_pinecone",
custom_id=CUSTOM_ALGO_ID)
st.session_state.batches = []
st.session_state.ids = []
st.session_state.likes = []
st.session_state.current_idx = 0
st.session_state.watch_time = []
ids, batch = st.session_state.personalized.batch(st.session_state.session)
st.session_state.batches += batch
st.session_state.ids += ids
st.session_state.stamp = time.time()
def display_video_with_size(width=640, height=360, avatar_url=""):
b = st.session_state.batches[st.session_state.current_idx]
_id = b.data["id"]
text = b.data["text"]
username = b.data["username"]
play_count = b.data["play_count"]
likes = b.data["likes"]
url = cdn_url.format(_id)
padding_horizontal = 20
adjusted_width = width - 2 * padding_horizontal
lines_of_caption = max(1, len(text) // 40)
caption_height = lines_of_caption * 20
video_embed_code = "full code at github"
total_height = height + 150 + caption_height + 40
st.components.v1.html(video_embed_code, height=total_height)
if st.button("Next"):
cid = st.session_state.ids[st.session_state.current_idx]
t2 = time.time()
time_passed = t2 - st.session_state.stamp
if time_passed > 5:
# If time spent post is more than 5 seconds, send signals
if time_passed > 15:
ua = UserAction(Signal.HIGH_ATTN)
st.session_state.personalized.add_signal(st.session_state.session, ua, cid)
elif time_passed > 12:
ua = UserAction(Signal.MID_ATTN)
st.session_state.personalized.add_signal(st.session_state.session, ua, cid)
elif time_passed > 9:
ua = UserAction(Signal.LOW_ATTN)
st.session_state.personalized.add_signal(st.session_state.session, ua, cid)
That's the essence of it. With just around ~140 lines of code, you can craft your personalized content platform using Streamlit, FirstBatch, and a vector database of your preference.
Thanks for your time!