Deep understanding of urban mobility is of great significance for many real-world applications, such as urban traffic management and autonomous driving.



1, Low frame rate. The time interval between two sequential frames of a webcam video typically ranges from 1s to 3s, resulting in large vehicle displacement, with some vehicles only appearing in one frame of the video sequence


2, Low resolution. Citycam video resolutions vary, for example, 352x240, 320x240 or 704x480 pixels. The vehicle at the top of a frame can be as small as 5x5 pixels. Further, image compression also induces artifacts.


3, High occlusion. Cameras installed at urban intersections often capture videos with high traffic congestion, especially during rush hours, resulting in vehicle high occlusion rate.


4, Large perspective. Cameras are installed at the top of tall poles with a high point of view to be able to capture more video content, resulting in videos with large perspective differences. Vehicle scales vary dramatically based on their distance to the camera.


5, Variable environmental conditions. Different cameras have different positions, scenes, and

perspectives. Even for the same camera, weather and illumination change significantly over time.


6, Different traffic patterns. Throughout the day or night, weeks and months, different seasons, as well as different streets, traffic patterns change significantly from sparse to heavy.


In order to overcome those challenges, we collected a large scale dataset and developed many machine learning based method. We achieved amazing results on monitoring the city traffic from many different cameras across different timespans. Please see our publication page for more details.