System Overview
TimeLens is composed of several core subsystems that together enable the end-to-end experience of historical reconstruction in AR. Each subsystem is designed with modular “agentic” principles – in other words, different AI components operate like cooperative agents, each handling a specific task (data processing, image generation, etc.), and communicate through defined protocols. This design is conceptually similar to emerging Model Context Protocol (MCP) standards in agentic AI, where multiple AI agents share context and coordinate task.
Below, we outline the five main components of TimeLens and how they interact:
- 1. Data Ingestion Layer: This is the foundation where all historical information is collected and prepared. The system ingests:
- Historical Maps & Atlases: e.g. high-resolution digitized maps from the New York Public Library and David Rumsey collection. These maps provide street layouts and building footprints from different eras.
- Photographs & Blueprints: Archived photographs, paintings, and building plans are used to understand architectural styles and details.
- Urban Planning Records: Old city planning documents, land use records, and even textual descriptions help determine building functions, materials, and heights.
- Geospatial Data: Modern GIS data (e.g. elevation models, current coordinates) to accurately align historical features with today’s geography.
- The ingestion layer includes data cleaning and alignment routines. For example, an old map might be georeferenced onto modern coordinates (using tools like QGIS), so that the AI knows exactly where a 1900s building corresponds on today’s map6sqft.com. The output of this layer is a unified, time-indexed geospatial database of each location.
- 2. Topological Abstraction Engine: Using the collected geospatial data, TimeLens constructs an abstract topological model of the scene for each historical era. Essentially, this engine converts raw maps and points into a mathematical graph enriched with shapes:
- It identifies landmarks and intersections as points (which we treat as 0-dimensional simplices, or 0-simplices in topology).
- It identifies roads or paths connecting those points as edges (1-dimensional simplices, or 1-simplices).
- It identifies larger surface elements like city blocks or building footprints as filled areas (2-dimensional simplices, or 2-simplices).
- Together, these points, edges, and faces form a simplicial complex – a structure from algebraic topology that represents a space by breaking it into simple piecesumap-learn.readthedocs.io. In simple terms, a simplicial complex is a set of vertices, line segments, triangles, etc., glued together along their faces. More formally, any face of any simplex in the complex is also included in the complex, and any two simplices meet along shared facesumap-learn.readthedocs.io. This ensures the representation is closed under intersections (if two historical buildings share a wall, that wall is itself part of the complex).
Example: Consider a small intersection as it was in 1920. The engine might produce a simplicial complex K_1920like:
pythonCopyK_1920 = { "0-simplices": ["v0 (Times Square)", "v1 (Empire State)", "v2 (Library)"], "1-simplices": [("v0", "v1"), ("v0", "v2")], # roads like Broadway, 5th Ave "2-simplices": [("v0", "v1", "v2")] # an area (e.g., a triangular city block) connecting those landmarks}
Here v0, v1, v2 are key points (0-simplices). ("v0","v1") is a road segment between two landmarks, etc. This is a toy illustration – real city complexes have thousands of simplices.
- 3. Cochain Feature Assignment: On top of the simplicial complex, TimeLens assigns rich attributes to each element via cochains. In algebraic topology, a cochain is essentially a function that assigns a value (from some group or field) to each simplexmath.stackexchange.commath.stackexchange.com. In our case, we use cochains to assign descriptive features:
- For each 0-simplex (point/landmark): attributes might include name of building, construction date, architectural style (e.g. Gothic, Art Deco), etc.
- For each 1-simplex (road): attributes like road width, paving material (cobblestone vs asphalt), traffic density of the era (horse carriages vs cars).
- For each 2-simplex (zone/block): attributes like land use (market, residential), average building height, material (wood, brick, stone) of buildings in that block, etc.
- These features form vectors attached to simplices. The collection of all these assignments is a cochain on the simplicial complex. By capturing such data, we introduce domain knowledge into the model – for instance, knowing a district was mostly wooden houses can influence the visuals (the AI might depict a fire hazard in 1666 London’s wooden citybrandxr.io). We also leverage the coboundary operator δ from cohomology, which maps a k-cochain to a (k+1)-cochain, to model temporal changes. Intuitively, δ here measures how attributes change over time or across adjacent simplices. For example, if a building (0-simplex) exists in 1900 but not in 1950, the “difference” (a 1-cochain spanning the time gap) flags a demolition. In this way, temporal deltas (like construction or destruction events) are encoded, helping the AI avoid anachronisms (e.g., not showing a building before it was built or after it was demolished).
- 4. Latent Scene Encoding Engine: This subsystem converts the structured topological representation (graph + features) into a numeric latent vector that the generative model can understand. We employ a combination of graph neural networks (GNNs) and other neural encoders:
- For each category of object, we use a specialized encoder. For instance, we use a PointMLP (a small multi-layer perceptron) to embed 0-simplices (landmark points) into a feature vector capturing that landmark’s properties. Likewise, an EdgeCNN could encode 1-simplices (roads) – for example, by taking a sequence of coordinates that define the road’s shape and applying a 1D convolution to capture its geometry. A FaceTransformer might encode 2-simplices (zones) by attending to the properties of all buildings in a block.
- Once every vertex, edge, and face has its own feature vector (say in a 512-dimensional space), we apply a Graph Neural Network to allow them to inform each other. GNNs are neural networks tailored for graph-structured dataen.wikipedia.org. They work by iterative message passing: each node (or higher-order simplex) exchanges information with its neighbors and updates its embeddingen.wikipedia.org. In our pipeline we use a Graph Attention Network (GAT), which is a kind of GNN that learns to weight the importance of neighbors via an attention mechanism. This is useful because not all neighboring information is equally relevant – e.g., a landmark might pay more “attention” to a major road next to it than to a small alleyar5iv.labs.arxiv.orgar5iv.labs.arxiv.org. The GAT layer produces refined embeddings $h_i'$ for each element by aggregating neighbors $j$ with learned attention weights $\alpha_{ij}$ar5iv.labs.arxiv.orgar5iv.labs.arxiv.org. In code, using PyTorch Geometric’s APIar5iv.labs.arxiv.org:
pythonCopygat = GATConv(in_channels=512, out_channels=512, heads=4) # 4-head attentionh_prime = gat(x, edge_index) # x = initial node features, edge_index = graph connectivity
Here h_prime would be the updated set of features for all nodes after one round of message passing.
- Global Scene Vector: After several GNN layers, we obtain a set of enriched feature vectors for the entire complex (each corresponding to a point, road, or area). We then need to summarize the entire scene (the whole city snapshot at time t) into one latent representation. We achieve this via a graph pooling or readoutoperation – essentially a weighted sum or average of all feature vectors, sometimes with an attention mechanism to focus on the most salient features. We denote the final scene encoding as $Z_t$. For example, we can compute:
Zt = ∑i∈Iαi hi,Zt=∑i∈Iαihi,
where $h_i$ are the feature vectors of all simplices $i$ in the complex, and $\alpha_i$ are learned weights that determine each feature’s importance in the scene (one can imagine $\alpha_i$ might be higher for large landmarks or central districts). This single vector $Z_t$ condenses “all the information about the city at time t” – the geometry, the styles, the materials, etc. – into a form the image generator can condition on.
- 5. Diffusion-Based Image Generator: This is the heart of TimeLens’s visual reconstruction – a generative AI model that produces the actual image of the historical scene, given the condition vector $Z_t$ and the user’s current view. We use a state-of-the-art denoising diffusion probabilistic model (DDPM) as our image generator, enhanced with transformer-based architecture for flexibility. Diffusion models have emerged in recent years as powerful generative models for imagesproceedings.neurips.cc. They work by starting with pure noise and gradually refining it into a coherent image, effectively learning the reverse process of noise diffusion. Our generator has the following features:
- Architecture: We use a U-Net backbone with Transformer cross-attention layers, similar to the design of models like Stable Diffusion. The U-Net provides the hierarchical multi-scale image refinement, while the Transformer layers allow the model to condition on external inputs (in our case, the scene encoding and view information) via cross-attention. Cross-attention is a mechanism where the model learns to align or “pay attention” to relevant parts of the conditioning vector when generating each part of the imagemedium.commedium.com. In Stable Diffusion for example, the latent image features query the text embeddings through cross-attention to ensure the output matches the promptmedium.commedium.com. By analogy, in TimeLens our latent image features will query $Z_t$ (and possibly a textual label of the era) so that the output image aligns with the historical context. This way, the model knows to draw, say, horses not cars if $Z_t$ represents nineteenth-century traffic.
- View/Positional Conditioning: We incorporate the user’s camera viewpoint as part of the conditioning. When the AR interface sends the user’s perspective (device GPS position, orientation, camera intrinsics), we encode that too (using a simple positional encoding or 6-DoF pose encoding) and include it in the conditioning vector. This ensures the generated image is rendered from the correct perspective, matching the angle at which the user is looking. (For instance, seeing a building from the north vs. south side are different images.)
- Training Objective: The diffusion model is trained with a combination of losses:
- Denoising Loss: At its core, the diffusion model is trained to predict and remove noise. We use the standard L2 reconstruction loss between the model’s output and the original image data during trainingproceedings.neurips.cc. Since we often don’t have actual photographs for every viewpoint in the past, we train on proxy tasks (like rendering synthetic data or partial comparisons to known photos). The model learns to go from a noisy image $x_t$ to a slightly less noisy $x_{t-1}$ iteratively.
- Style and Content Loss: To ensure the generated imagery is not just structurally correct but also stylistically authentic, we employ perceptual losses. We utilize a pretrained visual model (like VGG or CLIP) to compute a style/content loss. For example, a CLIP-based loss can ensure the image lookslike a 1910 photograph if the prompt is "New York 1910" – CLIP embeddings of the generated image should be close to embeddings of real historical images or the textual descriptionshunya-vichaar.medium.comshunya-vichaar.medium.com. Likewise, a VGG-based style loss ensures texture consistency: if historical records say buildings were made of red brick, the neural style loss will penalize if the texture diverges from that.
- Temporal/Spatial Coherence Penalty: A novel aspect of TimeLens is enforcing coherence across frames and viewpoints. If a user slightly moves their camera, or if we generate the same scene twice, the results should not wildly differ (no flickering or object jumps). We introduce a penalty that measures differences between images generated for close-by viewpoints or time steps. Techniques from video diffusion models are relevant here – for instance, adding a temporal discriminator or temporal attention layer that tries to maintain consistency across frameslilianweng.github.io. In practice, during training we sometimes input two very close camera poses and encourage the model to produce similar outputs except for the necessary parallax shift. This significantly reduces jitter and reinforces the notion that there is a single consistent 3D scene being rendered.
- Sampling (Image Generation Process): When it’s time to actually generate an image for the user:
- We start with a random noise latent image (a tensor) and the conditioning pair { $Z_t$, camera pose }.
- The diffusion model runs its reverse diffusion process: a series of steps where it predicts noise and subtracts it, gradually transforming the noise into an image. Each step is guided by the conditioning – through cross-attention the model “knows” what should appear where (e.g., if $Z_t$ encodes a street layout, the model will preferentially form streets in those locations).
- After say 50–100 diffusion steps, we obtain a final latent image, which is then decoded (via a decoder or VAE) into a full-resolution image. We often generate at a moderately high resolution (for example, 512×512 or 1024×1024).
- Super-Resolution: To achieve the level of detail and clarity needed for AR (possibly the equivalent of 4K resolution in the user’s view), we pass the output through a super-resolution GAN. We utilize Enhanced SRGAN (ESRGAN) for this purpose, as it’s known to produce realistic high-frequency details without creating artifactsopenaccess.thecvf.comopenaccess.thecvf.com. The super-resolution model upscales the image (e.g. 4× or 8× enlargement) so that even fine details like windows, textures of walls, and people’s clothing appear sharp when overlaid on the real world. The final output is a high-res image of the historical scene from the user’s perspective.
- 6. Real-Time AR Interface: The last subsystem deals with delivering the experience to the user in real time:
- We build the AR application using frameworks like ARKit (for iOS) or ARCore (Android), and a rendering engine (Unity with Metal on Apple devices, for example). These handle the 6-DoF tracking, meaning the device continuously knows its position and orientation in spacebrandxr.io.
- When the user points the device, ARKit provides a real-world coordinate frame. We anchor a virtual image plane in the 3D world at the correct location and orientation corresponding to the real scene. The TimeLens generated image is textured onto this plane. Essentially, the generated historic view is fixed in space such that it aligns with the real structures. This overlay is then rendered on top of the camera feed.
- The heavy image generation is done on the cloud (or a powerful edge server). The device streams the necessary input (its GPS, compass, and an initial camera frame for context) to the cloud. A lightweight on-device module might do some preprocessing (e.g., recognizing which city or landmark the user is looking at, to fetch the right model data).
- The cloud inference server (e.g., an AWS Lambda or a dedicated GPU server) runs the diffusion model with the provided $Z_t$ (fetched from the precomputed database for that location and era) and viewpoint. Thanks to optimization and model distillation, this inference is fast, returning an image in under ~2 seconds.
- The AR app then takes the returned image, and because it knows the calibration (field of view etc.), it composites the image precisely over the camera feed. We double-buffer and use motion prediction such that even within the 2s generation time, minor device movements are compensated (for instance, we might slightly crop or warp the generated image if the user moved a bit, to avoid sudden jumps, while the next frame is being generated).
- The result is a smooth AR visualization. Although the AI doesn’t regenerate at the full 60 frames per second of the display, the static overlay moves with the device (as a fixed object in AR space) at 60 FPS, giving a stable illusion. When the user moves significantly or switches era, a new image is generated. Our pipeline aims for <2 seconds latency for a new generation, which is fast enough that even if a user takes a step and the scene updates, it feels like a brief camera effect. (Future optimizations may further reduce this to sub-second.)
Integration of Subsystems: All these components are orchestrated so that when a user selects a time period and points their camera, the system retrieves the appropriate pre-trained model components (graph and cochain data for that city and year), feeds them into the generative pipeline, and streams the output to AR. The design is scalable – the heavy lifting is offloaded to cloud, meaning even a smartphone can experience high-quality graphics. Moreover, the use of standard protocols and agent-based modular design ensures that each piece (data ingestion, generation, AR rendering) can be improved or replaced independently without affecting the others, which is a key software architecture benefitdynatrace.comdynatrace.com.
Section 3: Mathematical & Topological Foundations
In this section, we dive deeper into the mathematical concepts under the hood of TimeLens – primarily topological representations and graph-based modeling. Our approach is inspired by the field of Topological Data Analysis (TDA)and Geometric Deep Learning, blending classical math with modern AI.
3.1 Simplicial Complexes for Urban Modeling
A central representation we use is the simplicial complex, which provides a flexible way to model not just pairwise connections like graphs, but also higher-order relationships (triangles, volumes, etc.). Formally, an abstract simplicial complex $\mathcal{K}$ is a set of elements such that:
- Every element of $\mathcal{K}$ (called a simplex) is a set of vertices ${v_0, v_1, \dots, v_k}$.
- If a set of vertices is in $\mathcal{K}$, then all of its subsets are also in $\mathcal{K}$umap-learn.readthedocs.io. (This is the closure property: any face of a simplex is also a simplex in the complex.)
- Any two simplices intersect in at most a common face (they either don’t intersect or share a lower-dimensional simplex).
In simpler terms, a simplicial complex is built out of points, line segments, triangles, tetrahedra, etc., such that whenever you include a higher-dimensional piece, you also include all the pieces on its boundary. For example, if we include a triangular area (2-simplex) formed by vertices A, B, C, then the edges (A,B), (B,C), (A,C) and the vertices themselves A, B, C must all be in the complex.
Why simplicial complexes for cities? Cities naturally have entities of various dimensions:
- 0D: Points (landmarks, points of interest, intersections).
- 1D: Lines (roads, railway lines).
- 2D: Areas (city blocks, parks, building footprints).
- (Even 3D volumes could be considered if modeling building interiors, though we largely use 2D surfaces for exteriors).
Using a simplicial complex allows us to encode a city map in a single structure where, say, a building’s footprint (2D) is directly connected to its boundary streets (1D) and corner points (0D). This is more expressive than a plain graph which only has vertices and edges. For instance, consider two buildings on opposite sides of a street: in a graph you might represent building nodes connected via a street node, but in a simplicial complex, the street and buildings can jointly form a 2D face if needed (representing, say, an urban block including that street as a boundary).
Moreover, simplicial complexes are well-studied in algebraic topology, and powerful invariants like homology can be computed to characterize their structureen.wikipedia.org. While TimeLens does not explicitly compute homology, the theory gives us confidence that the representation can capture complex connectivity (like “holes” in the layout – e.g., a city square enclosed by buildings would be a 1-hole in homology terms).
3.2 Chains, Cochains, and Coboundaries
To attach data to a simplicial complex, we use the notions of chains and cochains. A chain is a formal linear combination of simplices (often used to define homology), whereas a cochain is a function on simplices (used in cohomology)math.stackexchange.commath.stackexchange.com. In practical terms:
- A 0-chain could be thought of as marking a multiset of points.
- A 0-cochain is like assigning a number or label to every point (vertex) in the complex.
- Similarly, a 1-cochain assigns a value to every edge, and a 2-cochain to every face.
We utilize cochains to store attributes: for example, a 2-cochain might assign “building material = stone” to a particular face (denoting an area in which buildings are stone). These assignments can be numeric (like a scalar or a feature vector) rather than just categorical, enabling the learning algorithms to use them.
The coboundary operator $\delta$ is a linear operator that maps $k$-cochains to $(k+1)$-cochains. Intuitively, it measures how the values on $k$-simplices change around $(k+1)$-simplices. In algebraic topology, $\delta$ is the dual of the boundary operator on chains, and $\delta^2 = 0$ (the coboundary of a coboundary is zero)en.wikipedia.org – which corresponds to the idea that if you go around a closed loop, the net change should be zero. In TimeLens, we co-opt this concept in the time dimension: We consider time as an extra dimension and treat the difference between year $t$ and $t+\Delta t$ as something like a coboundary. For instance, if $f$ is a 0-cochain giving the height of a building (attached to a vertex representing that building), then $\delta f$ on a 1-simplex connecting that building’s state between 1900 and 1950 would give the change (perhaps the building was demolished, so the height went from 50m to 0, a significant change that $\delta f$ captures). This is an unconventional use of $\delta$, but it provides a principled way to incorporate temporal transitions as part of our model’s knowledge – effectively a differencing operation that the neural network can learn to interpret (such as “if $\delta$ of the existence attribute is nonzero, don’t render the building in the later scene”).
3.3 Graph Representation and Heterogeneous Networks
While simplicial complexes are the richer structure, for computational purposes we often break the problem down to graphs when feeding into neural networks. Specifically, we extract a heterogeneous graph from the complex:
- We have node types: landmark, intersection, district center, etc. (0-simplices).
- We have edge types: road, railway, river, etc. (1-simplices connecting points).
- We might also introduce a node type to represent 2-simplices (treating each area as a node connected to its boundary nodes or edges – this is like the bipartite graph representation of a planar complex, or the dual graph in planar graphs).
Graph Neural Networks can operate on such heterogeneous graphs by either one-hot encoding the type of each node and edge or by using separate weight matrices per type (this is known as relational GNN or heterogeneous GNN). In TimeLens, we implement a simplified approach:
- We create a graph where nodes represent all simplices of all dimensions. For example, we can have a node for each landmark, and a node for each road segment (thus treating roads as nodes as well), and connect a landmark-node to a road-node if the landmark lies on that road. This is akin to the concept of a factor graph or bipartite node-incidence graph.
- Alternatively, we maintain separate adjacency relations: e.g., a building node connected to all its corner intersection nodes, etc.
The key is that by structuring the data as a graph (with possibly millions of nodes for a large city), we can apply scalable GNN techniques to compute embeddings. GNNs are well-suited because they naturally respect the locality and connectivity of spatial datawilsoncwc.github.io. For example, a GNN will learn that a building’s features should be influenced by the street it’s on and the neighboring buildings, rather than something far away, unless there’s a path in the graph connecting them (which might happen through the road network).
Spatial Reasoning via GNN: Graph Neural Networks have been successfully applied to spatial problems like traffic prediction, as they capture the network effects (a road’s traffic depends on connected roads)wilsoncwc.github.iowilsoncwc.github.io. In TimeLens, the GNN’s role is to perform spatial reasoning – for instance, infer that “this street is wide and in a wealthy district, so buildings along it were likely stone mansions” or “this alley is adjacent to a market, likely filled with vendor stalls.” Such inferences are possible because during training the GNN can learn correlations from data (if such data is available, e.g., labeled maps or described scenes). Even without explicit labels, the GNN helps diffuse information: the historical attribute that a certain block was an industrial area will flow to the roads and buildings in that block, influencing the visual output (e.g., more factories, smokestacks in that part of the image).
By the end of this phase, mathematically we have a function:
fenc:(Complex at time t) → Zt∈Rd,fenc:(Complex at time t)→Zt∈Rd,
where $d$ is the dimension of the latent vector (we often use $d=512$ or $1024$). This function $f_{\text{enc}}$ is implemented by the combination of MLPs + GNN described, and it’s trained as part of the overall system so that $Z_t$ is a useful conditioning for the image generator.
Section 4: Embedding & Latent Vector Construction
Now we focus on the technical mechanics of turning our rich structured data into the latent vector $Z_t$. This involves neural network architectures and a bit of math on how features are aggregated.
4.1 Feature Embeddings for Different Entities
Each entity (point, edge, area) in the city is characterized by various features from the data ingestion stage. We design separate embedder networks for each, acknowledging that their data is of different natures:
- Landmark Embedding (PointMLP): Each 0D point may have features like coordinates (x,y), type of landmark (church, house, etc.), and any numeric attributes (height, year built). We feed these into a multi-layer perceptron. For example, a simple MLP might take a 3D input (e.g., x,y coordinates plus perhaps an elevation or importance score) and output a 128D hidden representation, then another layer to 512D:
pythonCopyclass PointMLP(nn.Module): def __init__(self): super().__init__() self.model = nn.Sequential( nn.Linear(input_dim=3, hidden_dim=128), nn.ReLU(), nn.Linear(128, 512) ) def forward(self, x): return self.model(x)
This is a toy code (actual input_dim might be larger if we include more features per point). The idea is to encode point features into a 512-dimensional vector $h_{\text{point}} \in \mathbb{R}^{512}$.
- Road Embedding (EdgeCNN): Roads or edges can be represented as polyline geometry or as a sequence of connected points. We can use a 1D convolution or recurrent network that goes along the sequence of points defining the road’s shape. Additionally, roads have attributes (e.g., name, length, type). We design a small CNN that, for example, takes an ordered sequence of turning angles or offsets that describe the polyline and produces an embedding $h_{\text{road}}$. Another approach is to sample points along the road and use a PointNet-like embedding of those sample points.
- Zone/Area Embedding (FaceTransformer): For 2D areas (like a city block), one approach is to treat the set of boundary vertices (or boundary roads) as a sequence or set and use an attention mechanism to encode it. A transformer encoder that attends over all corner points of a polygon could capture the shape of the polygon (convex, size, etc.) as well as any attributes uniform to that area (e.g., land use). We dub this the FaceTransformer – effectively a transformer that produces $h_{\text{area}}$ given the sequence of vertices of that area.
Each of these embedding networks is trained to produce features that are useful for the final image generation. We might initialize them with heuristics – e.g., the PointMLP could be initialized to roughly encode coordinates (so that closeness in space means closeness in embedding space), but ultimately they are learned.
4.2 Graph-Based Feature Aggregation (GAT Layer Details)
After obtaining initial embeddings for all entities, we construct a unified graph of relationships and apply graph neural network layers. As mentioned, we specifically leverage Graph Attention Network (GAT) layers because of their ability to handle varying degrees and focus on relevant neighborsar5iv.labs.arxiv.orgar5iv.labs.arxiv.org. Let’s denote by $h_i^{(0)}$ the initial embedding of node (or simplex) $i$ after the above step (PointMLP, etc.). A GAT layer will produce an updated embedding:
hi(1)=σ(∑j∈N(i)αij W hj(0)),hi(1)=σ(∑j∈N(i)αijWhj(0)),
where $\mathcal{N}(i)$ denotes the neighbors of $i$ in the graph (e.g., roads connected to a landmark, landmarks on a road, etc.), $W$ is a learnable weight matrix (the linear transformation for the layer), and $\alpha_{ij}$ are attention coefficients that depend on $h_i^{(0)}$ and $h_j^{(0)}$. The attention mechanism (from Veličković et al., 2018) computes something like:
eij=LeakyReLU(a⊤[Whi ∥ Whj]),eij=LeakyReLU(a⊤[Whi∥Whj]),
then normalizes these to $\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}(i)} \exp(e_{ik})}$scholar.google.co.uk. Intuitively, the network learns a weight $e_{ij}$ for each neighbor $j$ of $i$ based on their features, and then does a weighted sum. This is more flexible than a plain graph convolution (which would just average or sum).
We may use multi-head attention to stabilize training (the heads get concatenated or averaged)ar5iv.labs.arxiv.org. The PyTorch Geometric code snippet given earlier shows a 4-head GAT; each head produces a 512-dim output and we either concat to 2048 or average to 512.
We stack multiple GAT (or other GNN) layers to propagate information over larger distances. For example, after 2 layers, a node’s embedding $h_i^{(2)}$ will incorporate information from its neighbors’ neighbors (2-hop neighborhood), etc. This is important in a city graph: something two streets over might influence the scene (e.g., a fire in one block might spill smoke to the next in 1666 London).
4.3 Computing the Final Scene Vector ZtZt
After L layers of GNN, we have a final set ${h_i^{(L)}}$ for all elements $i$ in the complex. Now we need to pool them into one vector $Z_t$. Several strategies are possible:
- Simple average or sum: $Z_t = \frac{1}{N}\sum_i h_i^{(L)}$ (with $N$ the number of nodes). This treats everything equally, which might not be ideal (small alley contributes as much as main cathedral).
- Weighted sum with learned weights: as shown earlier, learn parameters $\alpha_i$ for each node (could be a linear projection of the feature to a scalar, then softmax across nodes).
- Attention pooling: introduce a context vector or query that attends to all node features. For instance, use an attention mechanism where $Z_t = \sum_i \beta_i h_i^{(L)}$ and $\beta_i = \text{softmax}(q^\top h_i^{(L)})$ for some learnable query vector $q$.
- Set-to-sequence model: sometimes a transformer decoder can be used to iteratively “read out” a set of features to a fixed-size sequence or single vector.
In our design, a simple yet effective method was to add a special “graph context node” that is connected to all others (a bit like a virtual master node). This node has no initial features, but after the GNN layers, it accumulates information from the whole graph (since we connect it to every other node, it attends to all of them). Its embedding $h_{\text{ctx}}^{(L)}$ can serve as $Z_t$. This is analogous to how transformer models often use a [CLS] token to aggregate sequence information.
Regardless of method, the result is that $Z_t$ captures a holistic representation of the scene at time t. For example, if t = “London in 1666”, $Z_t$ would encode things like “mostly wooden buildings, narrow streets, dry summer (if such meta info is given), high risk of fire, medieval architecture”. The diffusion model will condition on $Z_t$ to know what to draw.
One more component of $Z_t$ is the time information itself – we usually append or include the era/year as part of the encoding (perhaps as a one-hot year or a Fourier time embedding). This way, the generative model can distinguish 1910 vs 1940 scenes if both share structural similarities. Think of $Z_t$ as encapsulating “City X – year Y – structural context”.
We also embed the camera pose in a similar way (via a small MLP or sinusoidal positional encoding for angles), and this is used not in $Z_t$ directly but concatenated to the conditioning fed into the diffusion model’s cross-attention. Typically, in diffusion models like Stable Diffusion, the conditioning (text embedding) is combined with a time-step embedding for the diffusion step and fed into the U-Net via cross-attention layersmedium.com. In our case we have a composite conditioning: [scene context vector; camera pose; (optionally) textual label]. The textual label could be something simple like “London, 1666” which we embed with a language model or CLIP encoder – this helps in guiding style (for instance, the word “1666” might bias the style toward sepia or engravings, whereas “1970s” might bias toward color film style). Using text is not necessary, but we consider it as an auxiliary input that can help the model latch onto known visual tropes of an era (we leverage pretrained CLIP encoders for that knowledge of historical imagery, which is part of their training data).
In summary, the creation of $Z_t$ is a pipeline of embedding networks and graph neural nets that translate raw historical data into a learned representation. This representation is patentably unique in how it encodes an entire city’s historical state using a combination of topology (simplicial complexes) and learned embeddings. Traditional AR systems might have used just a database of 3D models or images, but our approach encodes the scene abstractly, allowing the generative model to imagine details that were never explicitly recorded, guided by these embeddings.
(We have now prepared the stage for the generative model. Next, we describe the architecture and training of the diffusion model that takes $Z_t$ and produces images.)
Section 5: Image Generation with Diffusion Transformers
This section provides the technical deep-dive into the generative image model used by TimeLens, which we’ve built to be patent application-worthy in its own right. It combines the latest in diffusion probabilistic models with a transformer-based conditioning mechanism and custom additions for historical image generation.
5.1 Diffusion Model Architecture
Base Model – DDPM: We use a Denoising Diffusion Probabilistic Model (DDPM) as our base, following Ho et al. (2020)proceedings.neurips.cc. To briefly recap, a DDPM defines a forward process where an image $x_0$ is gradually noised over $T$ steps to $x_T$ (which is basically pure Gaussian noise), and a learnable reverse process that starts from noise and iteratively denoises it to recover an image sample. The model learns to predict the noise $\epsilon_\theta(x_t, t)$ or directly the denoised image at each step.
Our implementation uses a U-Net convolutional architecture for the core denoising network, as is common in diffusion modelslilianweng.github.iolilianweng.github.io. The U-Net has encoder and decoder CNN blocks with skip connections, allowing multi-scale processing of images (important for capturing both local texture and global layout).
Transformer for Conditioning: The novel part is how we integrate the conditioning vector $Z_t$ (and other context) into the model. Instead of simply concatenating $Z_t$ to some CNN feature maps, we employ cross-attention layers in the U-Net. Concretely:
- Each U-Net residual block is augmented with a self-attention mechanism (as done in Imagen, Stable Diffusion, etc. for better long-range coherence).
- We replace or augment these with a cross-attention block that takes as queries the feature map of the image (from the U-Net hidden layers) and as keys+values the conditioning embeddings (which can be $Z_t$, text embeddings, and pose embeddings).
- Essentially, at various depths of the U-Net, the model can “lookup” the context. For example, imagine a U-Net layer responsible for composing the overall scene layout: it might query $Z_t$ to decide where major structures go. A later layer focusing on texture might query a part of $Z_t$ that encodes building material to choose brick vs stone texture.
This design is inspired by multimodal transformers where cross-attention is used to inject one modality into anothermedium.commedium.com. In Stable Diffusion, for instance, the image latent features form queries and the text embedding forms keys/values in cross-attention, ensuring the image aligns to the promptmedium.com. We do the same but with our scene encoding: TimeLens’s diffusion model leverages cross-attention within its U-Net to condition the denoising of latent images on the historical scene context, analogous to how Stable Diffusion conditions on textmedium.commedium.com.
We illustrate the model structure as follows (in a conceptual diagram):
Figure: Diffusion Transformer architecture (conceptual). The image’s latent representation is iteratively denoised. At each step, U-Net convolutional layers (blue) are interleaved with self-attention (gray) and cross-attention (green) blocks. The cross-attention allows the model to attend to context vectors like $Z_t$ (historical scene encoding) and $E_{\text{text}}$ (text embeddings of era) which guide the generation. Positional encodings for the diffusion timestep $t$ and for the camera viewpoint are added to the appropriate layers. This architecture ensures the generated image is both structurally correct and contextually faithful to the input.medium.commedium.com
(The figure depicts the U-Net shaped network with arrows indicating cross-modal attention from context to image feature maps. It's a high-level schematic showing how conditioning enters the network.)
Positional/Diffusion Embeddings: Like standard practice, we embed the timestep $t$ of the diffusion process (not to confuse with historical time) with sinusoidal embeddings and feed that into each U-Net block (often through adaptive normalization like FiLM or as additional channels). We also include the camera pose embedding in a similar way so that the model “knows” the viewpoint at every step of denoising.
5.2 Training Loss Functions and Objectives
Training the diffusion model requires a careful mix of objectives to get realism, faithfulness, and temporal stability. Our total loss $\mathcal{L}$ is a weighted sum of several terms:
- Diffusion Reconstruction Loss: This is the core mean squared error (MSE) between the model’s predicted noise and the actual noise added, at randomly sampled timesteps. In formulation:
Ldiffusion=Ex0,t,ϵ[∥ϵ−ϵθ(xt,t,Zt,pose)∥2],Ldiffusion=Ex0,t,ϵ[∥ϵ−ϵθ(xt,t,Zt,pose)∥2],
where $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t},\epsilon$ as per DDPM’s noising processproceedings.neurips.cc. This trains the model to perform denoising. It is essentially equivalent to maximizing a variational bound on the data likelihoodproceedings.neurips.cc.
- Image Fidelity Losses: While diffusion MSE alone can produce sharp images, we add:
- Adversarial Loss (optional): We experimented with a GAN discriminator that tries to distinguish generated historic images from real historic photos (where available). However, since training data of real images for every precise angle is scarce, this is used sparingly to fine-tune style.
- Perceptual Loss: Using a pretrained CNN ($\phi$, e.g. VGG-19’s conv layers), we enforce that the generated image $I_{\text{gen}}$ has similar feature maps to a target $I_{\text{target}}$. The target could be a real photograph of the scene if available (for some known viewpoints in historical archives), or a synthetic rendered image if we have any 3D model proxies. When no direct target exists, we use pseudo-targets: e.g., the same scene generated by a previous iteration of the model or a lower-resolution model, to enforce consistency. This loss is:
Lperc=∑l∥ϕl(Igen)−ϕl(Itarget)∥1,Lperc=∑l∥ϕl(Igen)−ϕl(Itarget)∥1,
summing differences at feature layer $l$. It helps preserve fine details and style.
- Style Loss: In classical neural style transfer, one matches Gram matrices of features to enforce texture statistics. We use a similar approach to ensure the output image has the “look” of the era. For example, images representing the 1860s might have a certain sepia tone and grain, whereas 1970s Kodachrome has a distinct color palette. We compute Gram matrix differences on select layers or use CLIP’s image embeddings to measure high-level style similarity to reference images from that eraen.wikipedia.org. If $I_{\text{ref}}$ is a collage or set of reference images from the target era:
Lstyle=∑l∥Gram(ϕl(Igen))−Gram(ϕl(Iref))∥F.Lstyle=∑l∥Gram(ϕl(Igen))−Gram(ϕl(Iref))∥F.
- Temporal Coherence Loss: This is crucial for AR use. If we have two images $I_1, I_2$ generated for slightly different camera poses (or consecutive frames as a user moves), they should be consistent. We simulate this during training by taking an image, applying a small random camera transformation (e.g., a few degrees or a meter shift), and generating both, then penalizing differences:
- Pixel-level coherence: $|I_1 - \text{warp}(I_2)|^2$, where $\text{warp}(I_2)$ is $I_2$ projected to $I_1$’s viewpoint via known depth or homography. However, we often don’t have depth info for generated images. As an alternative, we encourage feature-level coherence:
- Feature coherence: Extract features (e.g., CLIP image embeddings or lower-layer CNN features) for $I_1$ and $I_2$ and penalize their differences. Also, a temporal discriminator can be used: a 3D CNN that looks at $I_1$ and $I_2$ and tries to classify if they are a consistent pair or not. The generator then tries to fool this discriminator by producing consistent pairslilianweng.github.io.
- Ping-pong loss: A known technique in video GANs is to generate forward then backward and ensure you come back to the same frame (ping-pong)arxiv.org. We analogously generate an image, move camera, then move camera back, and require the image returns to original, enforcing stability.
Our final loss:
Ltotal=Ldiffusion+λpercLperc+λstyleLstyle+λcohLtemporal,Ltotal=Ldiffusion+λpercLperc+λstyleLstyle+λcohLtemporal,
with weights $\lambda$ tuned empirically. During training, all conditioning ($Z_t$, etc.) is provided, and we train the model to minimize this loss.
5.3 Sampling and Inference Process
When deploying TimeLens in the real world, the sampling (image generation) process is a streamlined version of the training loop:
- Condition Retrieval: For the user’s selected time and location, fetch or compute $Z_t$. (We often precompute $Z_t$ for many locations/eras and store them, so this is just a lookup to avoid running the GNN in real time).
- Initial Noise & Setup: Initialize a latent image $x_T \sim \mathcal{N}(0, I)$ of appropriate dimension (we use a latent resolution, e.g. $64\times64$ if using latent diffusion with a VAE that will decode to $512\times512$). Set the conditioning: $c = [Z_t, E_{\text{pose}}, E_{\text{text}}]$ where $E_{\text{pose}}$ is pose embedding, $E_{\text{text}}$ is text embedding if used. Choose the number of denoising steps (we typically use $50$ for realtime vs $100-200$ for offline quality).
- Iterative Denoising: For $t = T$ down to $1$:
- Compute $x_{t-1} = \frac{1}{\sqrt{\alpha_t}} (x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}t}}\epsilon\theta(x_t, t, c)) + \text{noise}_{t}$, following the posterior mean formula with optional added stochastic noise (or use the simplified DDIM deterministic update for speed).
- Each step calls our U-Net with cross-attention that injects $c$. This gradually “pulls” the sample towards one that matches the context. For example, at high noise levels, it might just enforce broad color scheme and composition, and at low noise levels, it adds fine details like windows aligned just right with the building positions from $Z_t$.
- Decode Latent: Once we reach $x_0$, we feed it through the decoder (VAEs are trained beforehand on natural images and possibly fine-tuned on historical photos, to ensure the latent space properly represents images). This yields an image $I_{\text{gen}}$.
- Super-Resolution: We apply the ESRGAN-based upsampler. This could be a sin