2024-12-08 Running teloxide bots cheaply on Fly.io

I run a lot of Telegram bots, most of them using Teloxide. I’ve been using my own server to run them, but I decided to move some of them to their own infra for extra resilience.

Fly.io can run the bot only on requests, and accept them shutting down after a bit of inactivity (as understood by the app itself). This is a article about the way I am doing it, with code-snippets and advice, with a few additional notes on my approach to organize the codebase. It might save you an hour, and it will also save me a bit of time on my next bot. Maybe it’ll inspire you to make a nicer and more resilient bot with a good webapp interface for some of the advanced features.

In this article, first of all – what has to change in your already-working bot for this setup to work:

Use webhook rather than long-polling to receive updates
Implement the inactivity-shutdown logic
Configure fly.io to start the bot on requests, and not restart it when it shuts down

And some other tips that I see as unnecessary and allowing for a difference of opinions, but they work for me and my bots well:

Webapp crates and tips
Notes on passing parameters to the webapps
Use tracing and sentry (unfortunately, still far from perfection)
Use the app events bus
Use pg_notify to get real-time pushes from the database

Bot configuration

Let’s start with the most essential parts. If you want Fly.io to be able to wake your bot up on updates, you have to use the webhook rather than long-polling.

Configuring the URL programmatically allows one to set it up from the env variables, and to test this locally as well (I use cloudflare’s tunnel for that).

use teloxide::{
    dispatching::ShutdownToken,
    update_listeners::webhooks::{axum_no_setup, Options},
};

let mut disp = Dispatcher::...;

// I didn't find shutdown_future very useful (neither did I understand
// what it does), most it ever did is hung the process
let (listener, _shutdown_future, routes) = axum_no_setup(Options::new(address, url));

// we'll use this to shut the bot down
let bot_shutdown_token = disp.shutdown_token();

// we'll await this in the end to make sure the bot is done
let bot_dispatcher_task = tokio::spawn(
        disp.dispatch_with_listener(
            listener,
            LoggingErrorHandler::with_custom_text("An error from the Telegram update listener"),
        )
    );

// configure the webhook address
bot.set_webhook(url.clone()).send().await?;

Inactivity-shutdown logic

Here’s my inactivity-shutdown monitor. You can pass it around or you can store it globally, I won’t judge.

Basically every time something happens, call monitor.notify_activity(). If nothing happens for INACTIVITY_THRESHOLD_SECONDS, the bot will shut down. If monitor.notify_activity() returns an Error, the app is shutting down, so call it early to avoid doing half-done work. If you return an Error to telegram, it’ll call your bot again in a bit.

This handles inactivity, Ctrl+C, and SIGTERM. There are a couple of obvious ways to improve it.

use anyhow::{anyhow, Result};
use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
use std::sync::Arc;
use std::time::{Duration, SystemTime, UNIX_EPOCH};
use tokio::task::JoinHandle;
use tokio::{signal, time};

#[derive(Clone, Debug)]
pub struct InactivityMonitor {
    last_activity: Arc<AtomicU64>,
    is_shutting_down: Arc<AtomicBool>,
}

// tweak it. In most cases 1 minute is enough IMO
const INACTIVITY_THRESHOLD_SECONDS: u64 = 60 * 8;

impl InactivityMonitor {
    /// Creates a new instance of InactivityMonitor
    pub fn new() -> Self {
        let current_time = SystemTime::now()
            .duration_since(UNIX_EPOCH)
            .unwrap()
            .as_secs();

        Self {
            last_activity: Arc::new(AtomicU64::new(current_time)),
            is_shutting_down: Arc::new(AtomicBool::new(false)),
        }
    }

    /// Records the current time as the last activity time.
    /// Returns an error if the system is already shutting down.
    pub fn notify_activity(&self) -> Result<()> {
        // Check if we're already shutting down
        if self.is_shutting_down.load(Ordering::Acquire) {
            return Err(anyhow!("Already shutting down"));
        }

        let current_time = SystemTime::now()
            .duration_since(UNIX_EPOCH)
            .unwrap()
            .as_secs();

        self.last_activity.store(current_time, Ordering::Release);
        Ok(())
    }

    /// Returns a JoinHandle that will complete after 30 seconds of inactivity.
    /// The handle can be used to await the shutdown condition.
    pub fn inactivity_shutdown_handle(&self) -> tokio::task::JoinHandle<()> {
        let last_activity = self.last_activity.clone();
        let is_shutting_down = self.is_shutting_down.clone();

        tokio::spawn(async move {
            loop {
                time::sleep(Duration::from_secs(1)).await;

                let current_time = SystemTime::now()
                    .duration_since(UNIX_EPOCH)
                    .unwrap()
                    .as_secs();

                let last_activity_time = last_activity.load(Ordering::Acquire);
                let inactive_duration = current_time.saturating_sub(last_activity_time);

                if inactive_duration >= INACTIVITY_THRESHOLD_SECONDS {
                    // Mark as shutting down before breaking the loop
                    is_shutting_down.store(true, Ordering::Release);
                    break;
                }
            }
        })
    }
}

impl Default for InactivityMonitor {
    fn default() -> Self {
        Self::new()
    }
}

pub async fn shutdown_signal(
    shutdown_tg: impl FnOnce() -> Result<()>,
    inac_monitor_handle: JoinHandle<()>,
) {
    let ctrl_c = async {
        signal::ctrl_c()
            .await
            .expect("failed to install Ctrl+C handler");
    };

    #[cfg(unix)]
    let terminate = async {
        signal::unix::signal(signal::unix::SignalKind::terminate())
            .expect("failed to install signal handler")
            .recv()
            .await;
    };

    tokio::select! {
        _ = inac_monitor_handle => {
            println!("Shutting down because of inactivity...");
        },
        _ = ctrl_c => {},

        #[cfg(unix)]
        _ = terminate => {},
    }

    println!("Shutting down...");

    shutdown_tg().unwrap();
}

Now let’s use it. First of all, in your main function:

let inactivity_monitor = InactivityMonitor::new();

// set everything app, then...

axum::serve(
        todo!(), // your listener
        todo!(), // your routes
    )
    .with_graceful_shutdown(shutdown_signal(
        move || {
            // start shutting down the bot, but return syncronously
            std::mem::drop(bot_shutdown_token.shutdown()?);

            Ok(())
        },
        inactivity_monitor.inactivity_shutdown_handle(),
    ))
    .await?;

println!("Webserver is done... waiting for the bot...");

bot_dispatcher_task.await?;

println!("Bot is done... bye!");

And everywhere anything ever happens:

// if we unwrap or ?-return here, before doing anything,
// we'll basically error out if the bot is shutting down
// so e.g. telegram can retry the request
inactivity_monitor.notify_activity().unwrap();

To wit, in your teloxide dptree handler:


let handler = dptree::entry()
        .map(move |conf: BotConfig, update: Update| async move {
            conf.inactivity_monitor.notify_activity()?;
            anyhow::Ok(())
        })
        // ...all your bot logic...
        .branch(todo!())
        .branch(todo!());

As an axum middleware:

ServiceBuilder::new()
    .map_request(move |req: Request<axum::body::Body>| {
        if let Err(err) = inactivity_monitor.notify_activity() {
            eprintln!("Error in HTTP: {:?}", err);
        }

        req
    })

And if you have any other queue or long-running task, don’t neglect calling the function as well.

Fly.io configuration

# in your fly.toml
[http_service]
  auto_stop_machines   = "off"
  auto_start_machines  = true
  min_machines_running = 0

Crates and webapps

Now let’s move on to the additional extra free advice.

Telegram webapps are nice but a bit limited. Currently all of my webapp-powered bots use maud and htmx for a very simple ajaxy UI.

I considered dioxus-liveview for more complicated UIs, but I’m a bit worried about its limitations, and the fact that it hangs in cases of network breakdown (a problem on mobile, where the bots are to be used most of the time).

I am currently trying out dioxus-fullstack but I hit a problem after a problem, and will probably wait until 0.6 stabilizes.

Given that we’re already using axum, it’s easy to attach a bunch of additional routes.

As a note, I split the bots into several crates (e.g. web, tg, global, and the binary serve), to help with the compile time and make rust-analyzer’s life a little easier.

Unfortunately, the only way to configure the base webapp url that I could find was through talking to BotFather.

This is my base html:

pub fn base_html(body: Markup) -> Markup {
    html! {
        (DOCTYPE)

        head {
            // without this, the app looks bad on iOS
            meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0";

            script src="https://cdn.tailwindcss.com" {}
            script src="https://unpkg.com/[email protected]" {}


        }

        (body)
    }
}

Closing the page is very useful in some cases:

fn close_tab() -> Markup {
    html! {
        head {
            script src="https://telegram.org/js/telegram-web-app.js" { }
        }
        body {
            script {
                "window.Telegram.WebApp.close();"
            }
        }
    }
}

Passing parameters to the webapp

I don’t use the “normal” webapp methods, nor telegram’s frontend user-identification features, because they don’t work with groups. To figure out reliably which group opened the webapp, I use their special URLs instead:

InlineKeyboardButton::url("app", format!("https://t.me/{bot_username}/food?startapp={}", chat_identifier))

Feel free to sign chat_identifier or do something fancier if you feel that this is not secure enough for you. Note that startapp only allows up to 80 characters.

In some cases I must hotlink to a specific page, I abuse startapp to take URLs, replacing / with __. I don’t do base64 to avoid running afoul of the 80-chars limit. Here’s the axum middleware that makes it work:

fn rewrite_request_uri(mut req: Request<axum::body::Body>) -> Request<axum::body::Body> {
    tracing::info!("{} {}", req.method(), req.uri());

    let query = qstring::QString::from(req.uri().query().unwrap_or_default());

    if let Some(startapp) = query.get("tgWebAppStartParam") {
        let route = startapp.replace("___", "/");

        *req.uri_mut() = format!("{}/{}", REWRITE_PATH.unwrap_or_default(), route)
            .parse()
            .unwrap();

        tracing::info!("Rewritten to {}", req.uri());
    }

    req
}

let rewrite_request_uri_middleware = tower::util::MapRequestLayer::new(rewrite_request_uri);
axum::serve(
        listener,
        rewrite_request_uri_middleware
            .layer(final_app)
            .into_make_service(),
    )...

Tracing and Sentry integration

Tracing is a great logging library for async apps, and sentry is very useful to monitor errors and performance. When using both, you can easily see specifically which spans are slow, and what happens in general.

Even though Sentry is pretty good at tracking your tracing spans, none of the current libraries I could find were able to attach the current span to an error. So you will not get a reasonable async stack-trace this way. Maybe one day I’ll find the time.

To configure tracing with teloxide:

// Got this from here https://github.com/teloxide/teloxide/discussions/872
fn spanned<E: 'static>(handler: UpdateHandler<E>) -> UpdateHandler<E> {
    use teloxide::{
        dispatching::DpHandlerDescription,
        dptree::{from_fn_with_description, HandlerDescription as _},
    };
    use tracing::{info_span, Instrument};

    // This is a hacky replacement for `handler.description().clone()`.
    // Ideally cloning `DpHandlerDescription` would be supported by `teloxide`.
    let description = DpHandlerDescription::entry().merge_chain(handler.description());

    from_fn_with_description(description, move |deps, cont| {
        handler.clone().execute(deps, cont).instrument(info_span!("update handling"))
    })
}

Use sentry_tower to make it work with axum as well.

Use an app-events bus

Nothing complicated, but very useful:

#[derive(Debug, Clone)]
pub enum AppEvent {
    DoStuff(Uuid),
    OtherStuff(Uuid),

    // useful for debugging:
    RecreateWebhook,
    SendAdminMessage(String),
}

// I don't know why 3
let (events_sender, events_receiver) = tokio::sync::mpsc::channel(3);

Now you can communicate things to do both from the webapp and from the bot.

`pg_notify`

Some of my bots work with postgres database, and they have some config in there that they want to have updated. I started by caching and re-requesting the config all the time, but I found that pg_notify is nicer and more fun to use. Since most of the bots shut-down quickly, I’m not worried about keeping the DB busy with useless subscribers.

I looked at the code I have for pg_notify and I’m too ashamed to post it right now. After a while, I might decide it can make for a nice crate. For now - you’re on your own, but it’s not complicated.