Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board


2022,  4 (2):   115 - 131

Published Date:2022-4-20 DOI: 10.1016/j.vrih.2021.12.002


The combination of an augmented reality (AR) headset and a smartphone can simultaneously provide a wider display and a precise touch input; it can redefine the way we use applications today. However, users are deprived of such benefits owing to the independence of the two devices. There is a lack of intuitive and direct interactions between them.
In this study, we conduct a formative investigation to understand the window management requirements and interaction preferences of using an AR headset and a smartphone simultaneously and report the insights we gained. In addition, we introduce an example vocabulary of window management operations in the AR headset and smartphone interface.
This allows users to manipulate windows in a virtual space and shift windows between devices efficiently and seamlessly.


1 Introduction
Recently, AR headsets have become increasingly accessible for personal use owing to their commercial development. They have an unlimited display compared to smartphones, and the trend of people using AR headsets in daily lives is expected to grow in the near future. However, AR headsets do not perform as well as smartphones in terms of precise operation and tactile feedback, such as text entry. Therefore, we expect that instead of replacing smartphones, a fusion interface combining AR headsets and smartphones will emerge.
We envision that users can simultaneously handle multiple tasks by distributing various windows across the AR virtual space and the smartphone. They can explore either the wide display of AR or the precise input features of a smartphone. For example, a user can open several applications in AR, such as video players, web browsers, and social applications, and can have another chat window open on the smartphone, as shown in Figure 1a. They can search for information, leave a comment, or reply to a message freely by dragging the desired window onto the smartphone and operating it, or they can expand the current phone view intuitively by dragging it out into the virtual space. Without complex navigation, users can extend the original hierarchical operations on the phone to a wider space and enjoy the precise input benefits of smartphones. Moreover, the fusion interface is portable. Users can freely use the fusion interface while traveling or on the street, similar to using mobile phones at present, without worrying about whether there is a flat surface for them to place some external input devices, such as keyboards and mice.
However, there is a lack of direct and efficient interaction methods for achieving this vision. Currently, AR headsets and smartphones are independent of each other, and there is no association between applications running on the two devices. Even if there are two similar applications on these devices (e.g., two WeChat apps on an AR headset and a smartphone), a user must keep both applications active on each device to synchronize data and information, one for display and one for input, which is redundant. Moreover, users must perform tasks by employing different types of interaction methods, namely, hand-gesture-based or controller-based interaction in AR and touch-based interaction on smartphones. This frequent switching consumes time and energy and is unnatural. In summary, applications and interactions are separated between AR and smartphones, and cross-device window management requires further investigation.
We introduce a series of window management operations in the AR+smartphone interface, taking advantage of the display capability of the AR headset and the precise touch input capability of the smartphone. Our goal is to explore the window management requirements in the fusion interface and offer a feasible option consisting of intuitive interaction methods for next-generation mobile interfaces.
Our window system provides various window types (Figure 1b) to perform different tasks (Figure 1c). It enables users to manage windows across the AR space and phone display via a combined input method that includes phone touch and phone posture. This input method utilizes the 3D-tracking information of a phone to create a 3D-tracked mixed interaction, which leverages the familiarity of screen touch and intuition of gesture interaction. We believe that 3D-tracked touch and posture input of smartphones can be efficient for window management operations in the combining interface, allowing users to switch window types seamlessly and manipulate windows efficiently. We refer to this as "phone-aware interaction" in the remainder of the paper.
We address the problem of window management in the AR headset+smartphone interface and make the following contributions.
(1) We highlight the concept of window management operations in the AR headset+smartphone interface to better leverage the display and precise input capabilities of the two devices.
(2) Adopting a user-centered design perspective, we introduce a set of interactions to deal with window management operations based on the collection of user preferences.
(3) We conduct a qualitative user study to validate the usability of our window system and obtain a positive result.
In the remainder of this paper, we first introduce our formative interview to better understand users' opinions regarding the requirements of window management in this fusion interface and suggest interaction methods. After presenting our design factors based on our observations, we describe our interaction method mapping to window management operations and evaluate the overall usability of our example window system. We believe that Fusion Window explores the concept of designing fusion interactions across smart devices to acquire better experience and efficiency in the future.
2 Related work
The combination of AR headsets and smartphones can be traced from the previous concept of combining palmtop computers and 3D graphics[1]. However, smartphones and AR headsets were unavailable to consumers at that time. Recent research on combining AR headsets and smartphones has focused on using phones to manipulate virtual objects. In this case, the phone was not used as input for applications but merely as a controller, which limited the functionality of the smartphone.
With the multitasking[2] requirements in practice, the methods of interacting with windows play a more important role. Recently, research on managing windows in AR environments have become popular, and researchers have explored the design framework for window types and window management operations in AR[3,4]. Similar to their work, we explored the design framework of window types and management operations, specifically in the AR headset+smartphone interface. In addition, other studies have proposed combining the display capability of AR headsets and smartphones for information display and navigation[5]. This resembled window management, but they focused on integrating display capabilities to improve display efficiency. In contrast, we regarded the window as the basic unit to study the operation requirements of window state transformation and the corresponding interaction in the entire AR headset and smartphone interface. In conclusion, we divided prior research into three categories: using phones to manipulate AR objects, window management in AR space, and cross-device information display.
2.1 Using phone to manipulate AR objects
Recently, Millette et al. explored bimanual interactions for 3D object manipulation, and proposed draw-and-drop and touch-and-draw interactions in a 3D modeling scenario[6]. Grandi et al. proposed a cooperative object manipulation method that uses a phone for 3D virtual object manipulation[7]. Wang et al. used a tablet and headset for 3D object manipulation in virtual space[8]. They focused primarily on 3D modeling and geometry education scenarios[9]. Researchers have also used touch panels and pens for manipulation tasks in virtual environments[10,11]. They found that using phones was faster and more efficient.
These studies explored a subset of window-management operations in the AR+phone interface. Other window-related operations were rarely involved, such as closed AR content and transferred windows between phone and AR.
2.2 Window management in AR space
Windows around users in virtual spaces are widely used on some occasions, such as conferencing[12] and multitasking[3]. Prior research has explored a suitable window mode while performing tasks[13] or displaying information[14]. Recently, researchers have also considered that an appropriate layout can improve efficiency in multitasking scenarios[3]. In addition, Ens et al. introduced the Ethereal Planes framework, which contributes to the design of the user interface of virtual space[4]. In addition to discussing window mode in virtual space, interaction methods for managing windows in an AR environment have been mentioned more frequently. Hand gestures are the primary interaction techniques in practice. Lee et al. proposed projected windows using fingertips to manage windows in space[15]. Earlier, Pierce et al. proposed a specific hand gesture for manipulating virtual objects[16]. Valkov et al.'s work involved touch as an interaction modality in a tangible projected virtual space[17]. In conclusion, window displays and interactions in AR have been widely studied, but there have been few studies on the window management of the AR+smartphone interface.
2.3 Cross device information display
The display of information in both AR headsets and smartphones was considered to improve the efficiency[18]. For further interaction in this fusion interface, prior research explored the feasibility of distributing input into multiple displays dynamically to compensate for the inaccuracy of the input method in AR[19-21]. Direct operation on the phone could provide many more interaction capabilities, including text entry and force touch. Therefore, transferring the window between the AR and phone was considered indispensable. Prior research has also explored the cross-device window transfer between virtual and real states. Grubert et al. proposed transferring the content shown in AR to the phone, but there was a lack of discussion about interaction methods[5]. Similarly, Yee et al. introduced peephole interaction to locate phones in different positions to show different content on the screen[22]. Chen et al. introduced body-centric interaction using the phone's location and body parts as cues to motivate users to display content on a phone screen[23]. Because of the precise input capabilities of smartphones, previous research has explored the transfer of virtual content to phones for further operations. However, with the powerful display capability of the AR headset, it is also necessary to transfer content on the smartphone to the AR headset. As users are prone to distribute tasks in different devices, transferring between different states is an important task. Prior research has contributed to some of these findings. In our work, we systematically discuss the entire requirement of transferring operations and other window management operations in the AR headset+smartphone interface.
In contrast, we mainly explore what window states should be included and what operations are needed to manage windows across different states in the AR headset+smartphone interface. We propose using phone-aware interactions with man windows across different statuses. Similar to previous work by Chan et al., who used fingers for direct-touch intangible displays, we used a smartphone to directly manage windows[24].
3 Formative interview
The goal of this formative interview is to better understand users' opinions about window management requirements in the AR headset + smartphone interface. This includes three aspects: (1) what window state should be included, (2) what operations are required to dynamically manage windows in different tasks, and (3) what interaction methods would users prefer to perform these operations?
3.1 Participants
Ten participants (two females and eight males, aged between 20 and 26 years) were recruited for this interview. Four of them were quite familiar with the AR headset and had AR programming experience. One participant did not have prior experience with the AR headset. The others used the AR headset once. All the participants used a smartphone with a facility.
3.2 Procedure
In the interview, we informed participants of three scenarios: watching videos, surfing the internet, and chatting. Watching video was to provide users with a video picture and let users describe their common operations, such as pausing and moving windows, with the new interface. Surfing the Internet was to provide users with a search picture, let users use the new fusion interface, and describe their preferred interaction methods for operating common operations, such as browsing and input. Chatting was to provide users with a chat picture with a single user and let users describe how they would use this new interface to perform common operations, such as browsing and sending pictures. We provided participants with an AR headset and a smartphone during the process. Six example windows composed of three factors (spatial reference frame, distance, and size) were introduced to the participants. They could call different windows by pressing the keyboard at any time. They imagined the process within different scenarios, and were encouraged to describe the operations they would perform. They were encouraged to come up with new windows outside of our supply. We observed their behavior during the interviews. Data were recorded using notes and videos. Based on these results, we built our design space and proposed a set of window-management operations. We also conducted a user evaluation to validate our design.
3.3 Observation
We divided our observations gained from the formative interviews into four categories: (1) window status, (2) operation requirement, (3) interaction preference, and (4) others. We introduce these observations in the remainder of this section. Our design space is a manifestation of the following observations.
3.3.1 What window properties are in consideration?
O1: Proximity. Proximity represents the distance between the window and user in space. In different proximities, users tend to perform different types of tasks by using different interaction methods. Participants (10/10) reported windows with different proximities for performing different tasks. Two kinds of proximity were exposed through the interview: (1) inaccessible windows (often related to walls or flats), and (2) windows in arm-reach proximity. They also adopted different interaction strategies depending on distance. Additionally, participants tended to locate windows with few input requirements (such as videos) far away. Windows with more input requirements (e.g., chatting) are often located in close proximity.
O2: Following attributes. The following attributes indicate the relative coordinate relationships of the window: All the participants mentioned windows fixed in space when they remained in situ. Some of them also referred to windows moving along with them when they were moving, suggesting windows moving along with the smartphone, head, or hand. Six participants mentioned the windows following their phones. Regarding the head, two (2/10) participants reported that some global application icons were supposed to show up bounding with the head and move along. Several participants (4/10) mentioned that the windows could also follow the hand while interacting. For example, when turning palm up, the windows can appear, and vice versa.
O3: Attachment attributes. Attachment attributes indicate whether a window is attached to the surface of a real object. Some participants showed great concern about unfolding windows on the surfaces of real objects. They mentioned attaching windows to a wall or a desk, which enabled them to manipulate content inside the windows with a tangible interface, making them feel more natural and immersive. They also reported that it is useful in reducing visual interference.
3.3.2 How can windows be managed?
O4: Transfer. The majority (8/10) of participants mentioned the transfer operation. They chose to transfer windows to different states for various purposes. For example: "when watching a video, I would like to search the target video on the phone because it is convenient and time-saving. But I would like to let the video display in AR for viewing because then it is wide". In addition, they were expected to transfer windows freely and quickly. The participants showed considerable concern about distributing tasks to a suitable device. It was reasonable for them to promote efficiency and reduce errors: "text entry in AR is so tiring and time-consuming. If I can transfer windows to the phone screen, it will be much more convenient".
O5: Discrete operation. Participants also reported other operations, including closing, minimizing, maximizing, and navigating. Among them, "Close" was mentioned most often (10/10): "When the movie ends, I need to shut it down." Owing to the problem of a small FoV, half of the participants (5/10) described a navigation requirement. Specifically, there was a need for a quick look at all the windows displayed in the AR and navigating to a desired one. To reduce the uncertainty caused by the out-of-sight windows, two participants (2/10) introduced minimization and maximization operations. They described that, with the minimization operation, all applications could appear in their field of view. They could then quickly magnify them whenever necessary. This gave them more confidence, and they considered it useful when a new message appeared.
O6: Selection and manipulation. All participants reported the need to select and manipulate the windows in the AR space. They wanted to select a window for further operations. Participants also recommended arranging windows according to importance and relevance dynamically, "I think sometimes I need to put some windows close. Maybe when I am reading, a website or notebook can appear as well for searching information and taking notes". In addition to adjusting the location, rotating and scaling abilities were expected.
3.3.3 What interaction method is preferred?
O7: Phone and hand gestures. Phone-based interactions and bare-hand gestures were most commonly reported by participants. When interacting with windows in AR, the participants preferred to interact directly through hand gestures and phone-based interactions. They chose to use a phone when the interaction occurred in the conversion between phone and AR. Phone-based interactions fall into three main categories: phone-based touch, location and orientation, and posture. From the perspective of touch, long press, slide, and multi-touch were preferred. In terms of posture, participants showed great interest in interacting through the posture of the smartphone, such as throwing it out, shaking, etc.
O8: Phone controller. When interacting with a remote window in AR, the participants subconsciously regarded the phone as a remote control. When interacting with a close-up AR window, the phone was seen as a tool to directly control the AR content, similar to a pen in hand. However, regardless of the distance, participants wanted to map the location of the AR window to the location of the smartphone. Owing to the tangible feeling, participants tended to use phones to directly touch the virtual window. One user said "With a phone, my fingers can touch the screen, which makes me feel more confident that I can touch the virtual window".
O9: Distinguishing Touch. When interacting with virtual windows through the touch input of a smartphone, participants did not want to confuse it with the original phone operations. Therefore, they suggested multi-touch interactions with two fingers or the palm. They also mentioned a long press to distinguish the touch from the original touch input of a smartphone. They often mentioned in the experiment that "what if it is mistaken with the original touch, I would like to use two fingers to do the operation". To the best of our knowledge, distinguishing touch is a peak appeal for all participants in our experiment. In addition, they revealed that using touch as confirmation provides more confidence and certainty.
O10: Interacting with certainty. Participants believed that in the AR headset + smartphone interface, it was necessary to switch attention between the AR and the smartphone. In this case, when touch was used as the input mode, participants felt great uncertainty in the interaction. Because this makes them feel confused, they do not know what the current operation is acting on, phone or AR. They thought that the object toward which their eyes were looking or their head was facing was the one with which they were interacting. However, at the same time, they still worried that this would cause them to enter input on the wrong device by mistake. In conclusion, participants hope to have a more deterministic interaction in the AR+phone interface, which will not cause unnecessary confusion between the AR and phone.
O11: Interaction consistency. Participants believed that interactions of all kinds should have a certain relationship or similar input modalities. Under different window management tasks, the interaction mode must be uniform, and there should not be a significant difference between them. Meanwhile, for operations with opposite semantics, they strongly suggest opposing interactions. For example, they mentioned that if sliding upward represented transferring a window from the smartphone screen to the AR space, then sliding downward represented transferring a window in the opposite direction.
3.3.4 What else are users concerned about?
O12: Occlusion. In an AR environment, occlusion occurs not only between real and virtual scenes but also between the virtual scene and the self. In our interview, participants showed great concern about the potential conflicts between the smartphone and AR windows and the AR windows themselves. They mentioned that, to avoid occlusion, what they focus on should be displayed, and others should vanish. For example: "If I am looking at the following window around the phone, the fixed windows that caused occlusion should disappear until I look up to them so that I can get access to the information efficiently".
O13: Adsorption. When describing how to manage windows in a fusion interface, users mentioned "adsorption" several times. For windows in AR that are virtual and without tactile feedback, participants would like to visually connect virtual windows with fingers with no seams to compensate for the lack of touch. Therefore, they mentioned further management of the window by adsorption, "By adsorption, I think I come across this window and then am able to move it".
O14: Move or Still. Our experiment was carried out in a sitting position; however, participants expressed that these interactions should also be suitable for moving scenarios. They assumed that one-handed and tangible interaction would be better because they were concerned about the accuracy and efficiency of hand gestures while moving and were afraid that they would not have a second hand to interact. It can be observed that our system needs to be equally applicable in both mobile and fixed scenarios.
O15: Default Settings. Even though the window state in AR is flexible, the participants considered a default setting for each operation. For example, to transit the phone screen to a fixed window, they would like to set the window at a default distance instead of setting the location. Compared with continuous operation, a discrete operation is preferable in this situation. Other operations have similar situations: "If a window can directly follow the phone, I hope it will be in a fixed position, otherwise I need to determine the location every time, too much trouble for me".
4 Design space
We built our design space based on the observations we obtained from the formative interviews. It is composed of ten factors. Each factor is a guide that interaction designers could take into consideration when designing an application in a fusion interface. In diverse scenarios, it can help designers reasonably assign different window management operations. These factors will help to shape AR headset and smartphone interfaces in the future.
D1: Input and Output. Smartphones and AR headsets have input and output capabilities, respectively. In different tasks, users have a certain preference for input and output. Accordingly, they tend to assign different tasks to both devices. Due to the diversity of inputs and outputs, users need a more intuitive interaction method in the case of memory cost and inconvenience. Which input and output properties are suitable for applications? How can they be intuitively combined?
D2: Continuity. Window management operations can be performed between and within tasks. Some operations are performed more frequently than others. The continuity of the task should not be interrupted by window management operations. To reduce the risk of disturbances, there is a need for interaction similarity, precision, and efficiency. There should not be an obvious distinction that interrupts task continuity. How should diverse interaction methods be balanced in order to achieve continuity? How should users interact to maintain similarity to the original task without seams?
D3: Default or Flexible. Windows in space have a high degree of freedom in terms of their attributes. Flexible operations provide users with more degrees of freedom but with higher operation costs. The default setting enables users to manage windows quickly, but it may lead to further operation. With the balance between user-defined requirements and default settings, what degrees of freedom should different tasks apply? What are the default settings supposed to be like?
D4: Dominant or nondominant. In diverse situations, the users' hands are occupied to varying degrees. The one-handed interaction method should be routinely used. Considering the intuition of hand gestures in the AR space, users are also accustomed to interacting through the hand. The use of a dominant hand or non-dominant hand differs in terms of accuracy and fatigue level. What is a reasonable method to combine these interactions? Is the dominant hand or non-dominant hand more suitable for interaction? When performing window management operations, how should this hand be applied in different scenarios?
D5: Unimodal or Multimodal. Smartphones are touch-sensitive and have abundant sensing abilities. The AR headset has hand gesture, gaze, and head movement interaction capabilities. Different input modalities represent various aspects of the user intentions. An adequate combination of multiple input modalities could be more accurate but may increase memory costs. How can they be applied in a uniform manner to improve efficiency and overall satisfaction?
D6: Interleaved or Simultaneous. Various interaction modalities coexist in the AR headset and the smartphone interface. This may lead to user uncertainty about the object to which interactions are applied. Simultaneous inputs increase certainty, accuracy, and even intuition. Moreover, this may consume more awareness. Are interleaved or simultaneous interactions more feasible? What should be done to make users feel more natural and intuitive through simultaneous multimodal interactions?
D7: Tangible or Intangible. Recently, Patricia et al. found that haptic feedback provides users with a greater sense of agency than does visual feedback in a touchless interface[25]. Our window system includes both real and virtual windows, and tangible input was originally applied to smartphones. The original smartphone operations should not be interrupted when performing tangible input in the combining interface. If a mode switch is applied, the consistency of the interaction will be reduced. Therefore, what is the proper way to apply tangible input? Is an intangible input sufficient for accuracy?
D8: Direct or indirect. In close proximity, users tend to apply direct inputs and vice versa. The combination of AR headsets and smartphones can provide a wide space. To contact and manipulate windows in different proximities, is a direct interaction method more adequate or an indirect method? Direct interaction in a virtual space causes more fatigue but is intuitive. Indirect interaction is more relaxed, but requires additional selection. Direct and indirect interactions should be applied accordingly in different window management operations.
D9: On the go or In-situ. Window management interactions on the go and in situ differ in user cognition. In addition, the users' ability to interact in the two scenarios differs as well. When users are in situ, it is more appropriate to show spatially referenced windows, and their actions are more accurate. When on the go, it is more reasonable to display windows in a user-referenced frame, and the users' actions are less precise. In different scenarios, what window management operations are in demand? How can we ensure stability and accuracy while interacting using different methods?
D10: Context-aware. Users feel it is more realistic to attach virtual windows to real surfaces. When designing window management interactions, we should consider whether context information is involved. Thus, physical laws must be applied to this fusion interface. In addition, there is a need for an appropriate occlusion relationship. In what scenarios should context-aware window-display capabilities be provided?
5 Interaction design
We demonstrate our window management operations in five categories: transfer, selection, transformation, close, and navigation. Different situations exist in each category. Finally, 14 operations were performed in total, as shown in Figure 2.
We propose an example interaction set that allows users to interact through the touch and posture of a 3D-tracked smartphone, which we call phone-aware interactions. There are three main advantages of phone-aware interactions: (1) they provide users with an intuitive and direct interaction method, (2) they provide users with appropriate tactile feedback, and (3) users can interact with one hand.
5.1 Window status
In the AR headset + smartphone interface, users are surrounded by multiple 2D planars in virtual and real states, which resembles Ens et al.'s work[4] that provides a framework design about how 2D planar space could be like in 3D space. Based on the seven dimensions of the framework design and our observations, we introduce three types of windows: window on smartphone screen, window fixed in AR space, and window following the phone, as shown in Figure 3 with different colors. A fixed window is a static application window suspended in virtual space. The following window is a dynamic application window that follows a smartphone's position and orientation. Window on smartphones is an application window on smartphones, which we use every day.
5.2 Transfer
The most important operation in our window management system is to convert the window to a different state because interaction capabilities are distributed in both devices while interacting with multiple windows simultaneously. In this study, we propose six interactions for one-to-one conversion.
(1) Screen to fixed. To transfer a window from the smartphone screen to a fixed virtual space (Figure 2a), users press a blank space and slide up with one finger (O4, O7, O9). Then, the window appears in the air at the same height and angle as the AR headset. Note that in our system, windows will always appear 50cm ahead of the users as a default setting (O15). A tangible method is utilized through smartphone screen touch, and a slide operation is a hint of the transition of a window. This enables users to manage the windows in an intuitive manner.
(2) Fixed to screen. Users move their phone to the location where the fixed window lies, press long, and slide down with one finger, as shown in Figure 2b (O4, O9, O11). Users can access this window through a phone screen. Further operations can be performed on a smartphone through precise touch input. According to the observation of consistency from the formative interviews, this operation is designed exactly opposite to the last one.
(3) Screen to following. To transfer a window from the phone screen to the following state (Figure 2c), users press a blank space and slide left or right (O4, O9). After releasing the finger, the window appears next to the phone at the same height as the phone screen. Any time users move or rotate the phone, the following windows tack and maintain the relative position. Extended information can be displayed in this manner to supplement the smartphone screen window (O2).
(4) Following to screen. Corresponding to the last operation, users long-press the edge and slide in opposite directions with a finger to let the following window appear on the smartphone screen, as shown in Figure 2d (O9, O11). This corresponding interaction reduces memory cost. This enables users to operate the following windows on a smartphone with a quick slide.
(5) Fixed to following. To allow a fixed window to follow (Figure 2e), users can move their phone to the fixed window's location, long-press the edge of the phone, and slide down (O9). Subsequently, the fixed window will follow the phone seamlessly beside the smartphone. We introduced this interaction using the pullback metaphor so that users can easily perform and remember.
(6) Following to fixed. Like the last operation, users long-press the edge and slide up to transfer the following window to fix it in space (O9, O11). The window appears at the same height as the smartphone when releasing the finger, as shown in Figure 2f.
5.3 Selection
Selection is a timely operation that follows others. When selecting the following windows, users press the edge of the smartphone on the left or right, based on where the window is located (O6, O10). When selecting fixed windows, users move their smartphone to the location where the windows lie and press a blank screen to select it (O9, O11). We combined touch with a 3D-tracked smartphone to improve the accuracy.
5.4 Transform-move, rotate, scale
The transformation is available only with windows fixed in the AR space. The other two types follow the phone all the time and can be within the user's sight whenever necessary, thus requiring little adjustment. To transform (Figure 2i, 2j, and 2k), users can first select a window by using the interactions below. The smartphone was then moved with the finger pressing the phone simultaneously (O6,O10). The fixed window follows the phone moving along three axes. After the finger is removed from the screen, the selected window is fixed mid-air. If users rotate the phone, the selected window will rotate at the same time with the same angle of the smartphone in the three axes. When users select the corners of the window through a smartphone, they motivate scaling, and by moving the smartphone, they can scale the window. Note that the fixed windows do not move or rotate while scaled.
5.5 Close
Closing a window on a smartphone is exactly the same as how users originally performed it. In this section, we introduce how users close windows in the following and fixed states:
(1) Close fixed window. To close a fixed window in AR, a user taps the closed button on the upper-right corner of the window (O7, O8), and then the window vanishes (Figure 2g). Tapping is an intuitive interaction that participants mentioned when interacting with a virtual content. In addition, it is a convenient one-handed interaction.
(2) Close following window. To close the following window, the user quickly shakes his/her smartphone (O7) to either side (Figure 2h). Within a moment, the window shuts. The posture of a smartphone is easy to perform and is very natural.
5.6 Navigation
To navigate all windows in the AR headset and smartphone, a user flips over his/her smartphone by pressing the smartphone screen, as shown in Figure 2l. Windows gather around the smartphone on a cambered surface and move with the smartphone. After the finger is removed from the screen, the windows are displayed ahead of the user (O7).
6 Implementation
We prototype our system on Magic Leap One AR glasses (FOV: horizontal:40°, vertical: 30°, RAM: 8GB) running a Lumin OS, and a Huawei P20 smartphone running an Android OS. The smartphone and AR headset were connected via a wireless network (Figure 4).
Our system supports the rendering of two types of window. Float windows were displayed on a helmet-centric sphere, and the following windows were displayed next to the phone: To render the following windows accurately, we used OptiTrack, which is a motion capture and 3D tracking system to track the smartphone's position and orientation. Tracked coordinates from OptiTrack are in the world coordinate system; therefore, we use the coordinate system transformation to obtain the coordinates of the phone in Unity. The transformation matrix can be calculated using several pairs of corresponding coordinates in two coordinate systems.
Currently, users are able to open various applications (such as video players, web browsers, photo editors, calendars, and maps) on a smartphone, and then transfer application windows to the virtual space in AR for browsing (Figure 5b). This enables users to enjoy the benefits of a wide display in the AR headset. Users can also adjust these virtual windows intuitively to the desired size, location, and orientation (Figure 5c) through the touch and posture input of a 3D-tracked smartphone. It is also possible to transfer a window on the smartphone screen to the following state (Figure 5a) so that the window is always available to users and could be supplementary content on the smartphone screen. In addition, users can retrieve fixed windows and the following windows in the AR back to the smartphone at any time. With multiple windows displayed in space, users can navigate to find the target window. Finally, the fixed window can be closed by tapping the upper-right corner using a smartphone (Figure 5d), and the following window can be closed by the shaking posture of the smartphone. Examples of the window management operations in our prototype are shown in Figure 5.
7 User evaluation
The goal here is to validate the overall usability of the proposed window management operations and users' subjective feedback on interactions.
7.1 Participants
We recruited 10 participants (seven males and three females) from the campus, aged between 20 and 24 years. All participants had experience using an AR headset and smartphone. Three of them were experts in AR and had developed experience with AR headsets (Hololens and Magic Leap). All participants in our study were experts on smartphones.
7.2 Procedure
The experiment consisted of two parts: a warm-up and a window placement task. The entire procedure took approximately 40 minutes in total. The participants were encouraged to express their opinions about each interaction during the process.
Step one: warm up (10‒15 minutes). We introduced all the following interactions to the user: transfer, navigation, selection, transformation, and closing. They could try each interaction as long as they wanted until they became familiar with it.
Step two: window placement task (20‒25 minutes). In this step, we simulated the window placement task in practice. This helped validate whether our window management operations were useful and flexible. There are seven windows in our system: video, book, photo, calendar, map, website, and chat. First, users were required to place five arbitrary windows into three states: phone screen, following window, and float window. The participants were then asked to place the windows in a predefined order. Finally, they were required to close all the windows.
After the experiment, participants indicated their agreement with the following five categories: (1) ease of learning, (2) ease of memory, (3) ease of use, (4) fatigue level, and (5) willingness to use.
7.3 Results
Overall, participants' feedback was positive. The results show that our interactions are natural, efficient, and fun to use, and that participants are interested in using AR headsets and smartphone interfaces in the future for mobile use.
Figure 9 shows the subjective feedback of the users regarding the five interactions. The results showed that all 14 interactions were well received by the participants. They were easy to learn (score=6.4), easy to remember (score=6.1), convenient to use (score=6.2), causing little fatigue (score=6.1), and desired by the users (score=6.1). These results also indicate the usability and possibility of the AR headset+smartphone interface for future mobile device use. In addition, we analyzed user feedback to better understand the strengths and limitations of the window system.
(1) Transfer. Transfer operations between the smartphone window and fixed window are the most satisfying operations. Observations and comments from participants indicated that these interactions are intuitive, easy, and convenient to perform ("This is amazing, I think I can fully understand and master this operation after using it for the first time"). However, transfer operations between the following window and fixed window, and operation transferring the following window to the smartphone are relatively less preferred because of the difficulty in reaching the edge ("It's a little bit hard to reach the edge"). Overall, users consider transfers to be a useful operation. They suggest transferring a video or photo to a fixed position in space for better viewing. They also suggest transferring maps or books to follow the smartphone so that they are able to take notes or search on the phone as well.
(2) Selection. Participants reported selection as a basic window-management interaction. The majority of the participants found the selection of both windows easy and intuitive to understand. However, selecting the following window was less preferred because of the inconvenience in reaching the edge. Comments also revealed that the participants were willing to select a fixed window using a smartphone. They reported that it was intuitive and efficient to use. This is consistent with our observations from the interviews (O5 and O7). Specifically, some participants stated that using a phone for selection gave them more confidence. Due to O4, we assume that the feeling of "touch" is stronger when using a smartphone than a mid-air gesture ("I think this is great. I prefer this to the hand gestures. It seems more accurate").
(3) Transform. The participants greatly appreciated the transform operations. They considered them useful for combining interfaces. Participants also indicated that it was helpful to adjust the layout of the windows to better fit their personal work environment. They consider it intuitive and more efficient compared with bare hand gestures ("I think it's better than using the hand directly. It seems more accurate"). It is acceptable to reach out when performing operations within the arm-reach space because window management operations are temporary, frequent, and take little time. Participants also found it easy to learn and per- form ("It's quite easy to understand and the operation is very simple").
(4) Close. Close is an operation that uses a smartphone gesture as an input. The results showed that the participants were extremely satisfied with this interaction. They think that shaking a smartphone to close a following window is particularly interesting, simple, and convenient ("I like this operation a lot, it's really fun. I feel the virtual window is closely connected with my smartphone, so shake to close is really natural and magical"). They found it easy and intuitive to close a fixed window using a smartphone to tap on the right upper corner. It is direct for users to perform the operation without recalling it. Through the results, we find a significant possibility of using smartphone gestures to interact with content closely connected to the phone, because it gives users a greater sense of control. Overall, the participants were positive about close interactions. They consider closing as a necessary operation, helping to quickly shut down application windows no longer needed.
(5) Navigation. The original intention of navigation is to help users quickly locate a target window with a limited FOV. However, participants were less concerned with this operation. Instead of using a phone to navigate, they prefer turning their head ("I would like to turn my head even though this operation is simple. It's less laborious"). In addition, the majority of the participants found it easy to learn and perform. Some participants reported that flipping over a smartphone was interesting. They consider that such phone gestures are fun when connecting with AR content.
8 Discussion
In this study, we emphasized the lack of associations between the applications running on two devices. Note that we were not concerned with data synchronization. We focused on providing an avenue that enables users to assign diverse inputs and outputs from an application to a suitable device. In our window management system, users were able to transfer and manipulate operations to windows that are distributed in the virtual space and smartphones. It serves as an effective supplement to improve the efficiency when AR headsets and smartphones are combined for general use. Participants also demonstrated the importance and potential of the AR headset+smartphone interface used in the future.
The interactions we proposed were designed based on the collection of users' preferences and suggestions. Using the touch and posture of a 3D-tracked smartphone as an interaction method has proven to be efficient. This enables users to directly manage windows distributed in both devices with a small seam. The results indicate good efficiency and feasibility, even though we provide only a small set of possible interactions as an example. Further window management operations must be explored. We believe that our design factors could help designers to present more operations and alternative interactions in diverse scenarios.
Our study has three limitations. First, we used the OptiTrack equipment to track the position and orientation of the smartphone. Although this could reduce the time required to identify the smartphone, it limits the mobility of the system. Moreover, the markers must be stuck on the device. This will affect the appearance of the smartphone and even occlude a part of the user interface. Second, in our prototype, when the user performs interactions, especially transform operations, there is occlusion between the smartphone and the virtual windows. This will affect users' cognition of space. Furthermore, this study was conducted subjectively. Users evaluated the system through subjective descriptions and scoring, which had some limitations. In the future, we expect to quantitatively measure the user experience, including speed and error rate.
9 Conclusion
We are the first to design and evaluate window management operations in an AR headset+smartphone interface to better leverage the wide display and precise input capabilities. We introduced a series of window management operations to address the absence of an association between the applications running on these two devices. We proposed design factors that informed the design of our interactions, based on a formative interview. Our design factors can also benefit researchers and designers in designing alternative interactions. We validated the usability of our window management system by implementing a prototype that supported all the operations we introduced, and we conducted a user study to demonstrate the feasibility and potential of our window management system for use in the future.



Fitzmaurice G W. Situated information spaces and spatially aware palmtop computers. Communications of the ACM, 1993, 36(7): 39–49 DOI:10.1145/159544.159566


Wallis C. The multitasking generation. Time, 2006, 167(13): 48–55


Ens B M, Finnegan R, Irani P P. The personal cockpit: a spatial interface for effective task switching on head-worn displays. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2014, 3171–3180 DOI:10.1145/2556288.2557058


Ens B, Hincapié-Ramos J D, Irani P. Ethereal planes: a design framework for 2D information space in 3D mixed reality environments. In: Proceedings of the 2nd ACM symposium on Spatial user interaction. 2014, 2–12 DOI:10.1145/2659766.2659769


Grubert J, Heinisch M, Quigley A, Schmalstieg D. MultiFi: multi fidelity interaction with displays on and around the body. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 2015, 3933–3942 DOI:10.1145/2702123.2702331


Millette A, McGuffin M J. DualCAD: integrating augmented reality with a desktop GUI and smartphone interaction. In: 2016 IEEE International Symposium on Mixed and Augmented Reality. Merida, Mexico, IEEE, 2016, 21–26 DOI:10.1109/ismar-adjunct.2016.0030


Grandi J G, Debarba H G, Nedel L, Maciel A. Design and evaluation of a handheld-based 3D user interface for collaborative object manipulation. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 2017, 5881–5891 DOI:10.1145/3025453.3025935


Wang J, Lindeman R. Coordinated 3D interaction in tablet- and HMD-based hybrid virtual environments. In: Proceedings of the 2nd ACM symposium on Spatial user interaction. 2014, 70–79 DOI:10.1145/2659766.2659777


Kaufmann H, Schmalstieg D. Mathematics and geometry education with collaborative augmented reality. Computers & Graphics, 2003, 27(3): 339–345 DOI:10.1016/s0097-8493(03)00028-1


Bowman D A, Wineman J, Hodges L F, Allison D. Designing animal habitats within an immersive VE. IEEE Computer Graphics and Applications, 1998, 18(5): 9–13 DOI:10.1109/38.708555


Szalavári Z, Gervautz M. The personal interaction panel-a two-handed interface for augmented reality. Computer Graphics Forum, 1997, 16(3): C335–C346 DOI:10.1111/1467-8659.00137


Billinghurst M, Bowskill J, Jessop M, Morphett J. A wearable spatial conferencing space. In: Digest of Papers. Second International Symposium on Wearable Computers. Pittsburgh, PA, USA, IEEE, 1998, 76–83 DOI:10.1109/iswc.1998.729532


Feiner S, MacIntyre B, Haupt M, Solomon E. Windows on the world: 2D windows for 3D augmented reality. In: Proceedings of the 6th annual ACM symposium on User interface software and technology. Atlanta, Georgia, USA, New York, ACM Press, 1993, 145–155 DOI:10.1145/168642.168657


Billinghurst M, Bowskill J, Dyer N, Morphett J. An evaluation of wearable information spaces. In: Proceedings of IEEE 1998 Virtual Reality Annual International Symposium. Atlanta, GA, USA, IEEE, 1998, 20–27 DOI:10.1109/vrais.1998.658418


Lee J H, An S G, Kim Y, Bae S H. Projective windows: bringing windows in space to the fingertip. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. Montreal QC Canada, New York, NY, USA, ACM, 2018 DOI:10.1145/3173574.3173792


Pierce J S, Forsberg A S, Conway M J, Hong S, Zeleznik R C, Mine M R. Image plane interaction techniques in 3D immersive environments. In: Proceedings of the 1997 symposium on Interactive 3D graphics. Providence, Rhode Island, USA, New York, ACM Press, 1997, 39 DOI:10.1145/253284.253303


Valkov D, Steinicke F, Bruder G, Hinrichs K. 2D touching of 3D stereoscopic objects. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2011, 1353–1362 DOI:10.1145/1978942.1979142


Henderson S J, Feiner S. Evaluating the benefits of augmented reality for task localization in maintenance of an armored personnel carrier turret. In: 2009 8th IEEE International Symposium on Mixed and Augmented Reality. Orlando, FL, USA, IEEE, 2009, 135–144 DOI:10.1109/ismar.2009.5336486


Al-Sada M, Ishizawa F, Tsurukawa J, Nakajima T. Input Forager: a user-driven interaction adaptation approach for head worn displays. In: Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia. 2016, 115–122 DOI:10.1145/3012709.3012719


Arora R, Kazi R H, Grossman T, Fitzmaurice G, Singh K. SymbiosisSketch: combining 2D & 3D sketching for designing detailed 3D objects in situ. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 2018, 1–15 DOI:10.1145/3173574.3173759


Wang J, Lindeman R W. Object impersonation: towards effective interaction in tablet- and HMD-based hybrid virtual environments. 2015 IEEE Virtual Reality (VR), 2015, 111–118 DOI:10.1109/vr.2015.7223332


Yee K P. Peephole displays: pen interaction on spatially aware handheld computers. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2003, 1–8 DOI:10.1145/642611.642613


Chen X, Marquardt N, Tang A, Boring S, Greenberg S. Extending a mobile device's interaction space through body-centric interaction. In: Proceedings of the 14th international conference on Human-computer interaction with mobile devices and services. 2012, 151–160 DOI:10.1145/2371574.2371599


Chan L W, Kao H S, Chen M Y, Lee M S, Hsu J, Hung Y P. Touching the void: direct-touch interaction for intangible displays. In: Proceedings of the 28th international conference on Human factors in computing systems. Atlanta, Georgia, USA, New York, ACM Press, 2010, 2625–2634 DOI:10.1145/1753326.1753725


Cornelio Martinez P I, de Pirro S, Vi C T, Subramanian S. Agency in mid-air interfaces. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. Denver Colorado USA, New York, NY, USA, ACM, 2017, 2426–2439 DOI:10.1145/3025453.3025457