log in | register | forums
Show:
Go:
Forums
Username:

Password:

User accounts
Register new account
Forgot password
Forum stats
List of members
Search the forums

Advanced search
Recent discussions
- Aemulor (Gen:16)
- !OBrowse reviewed (News:1)
- DDE reaches release 28 and above (News:)
- Elesar quicks dispels stormy clouds (News:2)
- RISC OS London Show 2017 - Notes from the talks (News:2)
- RISC OS London Show 2017 (News:)
- RISC OS London Show 2017 - Pictures (News:)
- October News (News:2)
- Retrospective thoughts on 12 months of Titanium ownership (News:4)
- RISC OS London Show 2017 (News:1)
Latest postings RSS Feeds
RSS 2.0 | 1.0 | 0.9
Atom 0.3
Misc RDF | CDF
Site Search
 
Article archives
The Icon Bar: General: Website
 
  Website
  adrianl (14:38 10/2/2015)
  leeshep (22:40 25/2/2015)
    adrianl (18:01 26/2/2015)
  adrianl (03:32 22/3/2015)
    sirbod (08:30 23/3/2015)
      adrianl (04:46 24/3/2015)
        sirbod (03:49 26/3/2015)
          adrianl (10:48 29/3/2015)
            sirbod (00:20 1/4/2015)
              adrianl (08:56 1/4/2015)
                sirbod (06:20 6/4/2015)
 
Adrian Lees Message #123504, posted by adrianl at 14:38, 10/2/2015
Member
Posts: 1565
Hi, I can no longer access and update the drobe account where I used to make available my non-commercial software, so I thought it worth mentioning that I have a new home:
http://sendiri.co.uk
I shall release any updates to Geminus, the Cortex build of Aemulor, and any other free software there.

Best wishes,

A
  ^[ Log in to reply ]
 
Lee Shepherd Message #123514, posted by leeshep at 22:40, 25/2/2015, in reply to message #123504
Member
Posts: 23
Thanks for all your great work Adrian.
Do you think Cino will see a release for the new machines?

Lee
  ^[ Log in to reply ]
 
Adrian Lees Message #123515, posted by adrianl at 18:01, 26/2/2015, in reply to message #123514
Member
Posts: 1565
Possibly. It would be a shame not to see it one day reach fruition now that there are machines available that have sufficiently capable memory systems, and the advantage of the NEON SIMD extension. That said, it remains a lot of work, not least because Cino was never developed beyond the prototype stage once the IOP321's memory latency was found to be a major impediment, and it is unlikely to be commercially viable given the size of the market nowadays.

So, if it does, I can offer no timescale etc; it would likely have to be a labour of love conducted in spare time outside of sustainable, paid-for work.
  ^[ Log in to reply ]
 
Adrian Lees Message #123529, posted by adrianl at 03:32, 22/3/2015, in reply to message #123504
Member
Posts: 1565
Spent a few hours converting and uploading, amongst other things, a couple of articles about Aemulor that I wrote way back, for anyone who's interested and hasn't seen them. They describe the internal technical details of its operation, and what additions were made for Aemulor Pro. Originally written for Foundation RISC User magazine, and since published by CJE, they are accessible under the Aemulor section of the above site.
  ^[ Log in to reply ]
 
Jon Abbott Message #123530, posted by sirbod at 08:30, 23/3/2015, in reply to message #123529
Member
Posts: 563
Thanks for taking the time to do that, I'm off to have a read now, as it's an area you're aware I'm very involved in.

If you don't mind, I'll post any questions publicly on here, should anyone else also be interested in the technical side of what you've achieved.
  ^[ Log in to reply ]
 
Adrian Lees Message #123531, posted by adrianl at 04:46, 24/3/2015, in reply to message #123530
Member
Posts: 1565
Thanks for taking the time to do that, I'm off to have a read now, as it's an area you're aware I'm very involved in.

If you don't mind, I'll post any questions publicly on here, should anyone else also be interested in the technical side of what you've achieved.
No problem. May I ask, what are your goals?
  ^[ Log in to reply ]
 
Jon Abbott Message #123535, posted by sirbod at 03:49, 26/3/2015, in reply to message #123531
Member
Posts: 563
May I ask, what are your goals?
My goal is simple, get every piece of software/game ever written for Arthur up running seamlessly on RISCOS 5 and hopefully leave something of long term use to the community to ensure legacy software isn't lost.

By way of response, I've combined some questions I have around design decisions and general interest in your implementation with some information on how I implemented them in ADFFS.

I've read though both Technical explanation of the operation of Aemulor and Introduction to Aemulor Pro which I hope are the documents you were referring too and commented in chronological order...

the reader should understand that the development of Aemulor has been split almost 50/50 between the RO4 API emulation and the 26-bit CPU emulation. Contrary perhaps to the expectations of most people, emulating a 26-bit CPU on the 32-bit XScale has not been the most difficult part of developing Aemulor.
I agree with you about CPU emulation not being the most difficult part of development. In my case, I spent 6 months designing the JIT and codelets on paper and then coded the JIT core over a two week period. Since then, it's just had a few tweeks and the odd bug fix, but essentially its remained unchanged.

I'd say my split is probabably 5/10/85, 5% being the CPU core, 10% being IOC machine emulation or translation and 85% on RISCOS interaction.

SWIs and CPU flags / SWI API changes
This is an area I've really struggled with, I've been unable to find the kind of detail I require to implement this to date. I've reached out to the community for assistance, but have sadly had no responses. If you have any documetation on the subject you could share, I'd really appreciate it.

SWI flag preservation to date hasn't been a problem for games, but I know it will be an issue when I implement WIMP interaction and 26bit module support, so plan to implement in parallel.

SWI API changes have been a problem for a few games. It's been a case of trial an error running them on an A305/RO2 and gradually working my way up through hardware and OS revision until they break. I then have to pick the code apart until I find the offending SWI and start looking at the RISCOS source to see what's changed. Its a very time consuming process that would be so much easier with a consise list of changes between the various RISCOS versions.

ADFFS will eventally cover all RISCOS versions from Arthur right up to 4.x, so it's a big task in the long run. To date I've concentrated on providing a RO3.11 VMM as that covers the bulk of all games.

Internally, ADFFS acts as an IOMD machine, translating IOC/MEMC/VIDC to their IOMD equivelents. I took this approach to both allow physical IOMD machines to run IOC games natively and to save coding seperate core VMM's later. I've yet to add any specific IOMD additions though, its only translating what's required for IOC.

The 32-bit version supplied with the IYONIX pc does not provide support for the 26-bit procedure calling standard (APCS-R) used by 26-bit C programs. Aemulor therefore loads the 26-bit version that is supplied with the Castle C/C++ Development Suite, and hides the 32-bit one so that 26-bit applications see only the 26-bit SharedCLibrary.
Had I have implemented 26bit module support up front, I probably would have taken the same approach. It's an area I'm going to revisit later as I want to provide APCS-A. For the time being, I translate calls to/from APCS-R / APCS-32.

Aemulor RMA
...
From RISC OS's perspective, the Aemulor RMA is a normal dynamic area, but Aemulor remaps the memory at an address below 64MB so that it becomes addressable within the 26-bit environment. Because this emulated RMA is visible to all applications, native 32-bit applications are also restricted to a maximum size of 28MB each (as per RISC OS 4) whilst Aemulor is running. It is hoped that this limitation can be removed with a later version.
This is an area I plan to resolve from the outset. It's a particular challenge in my case as the Modules are running natively on the CPU and not under CPU emulation.

The JIT has been designed to allow code to be relocated, at the minute that's fixed with 0 being translated to 1200000 but I'll be making it dynamic for RMA support and will run Modules in a DA up high with a stubs in the RMA for interaction with the system.

As you point out, due to the 26-bit barrier, its implementation is tied in with WIMP Application support which break it when task switching.

A Simple Interpreter
The circular buffer is a very neat idea, how did you determine 32 instructions as being an optimal size?

Also, in practice, most self-modifying code actually modifies code that follows the current position (PC) rather than preceding it.
Very true, ADFFS isn't as optimal here as it scans and translates ahead up to 128 instructions, walking branches where possible. I figured that was a better approach as in the main, very few instructions actually self-modify so the benifts of running the translated code natively far outweighed the cost of having to re-encode the odd self-modifying code sequence.

Legacy C code is a fine example here, where SWI instructions are written to memory prior to execution as CallASWI didn't exist back then.

A Faster Interpreter
A very neat solution to multiple loop sequences.

Hardware Breakpoints
Do you cache the breakpoint locations so it's not continually scanning ahead?

Avoiding the cache flush is a stroke of genious. ADFFS has to flush as I've opted to make no processor specific optmizations past SA, so it's future proofed. I've yet to get the Iyonix to flush correctly though, even calling OS_SynchroniseCodeArea results in it writing utter rubbish to memory.

Just In Time Compilation
Do you cache any JIT sequences? Or are they disposed of once executed? I took the approach in ADFFS to leave them resident and only remove if the original instruction is self-modified or overwritten by OS_File etc.

Sound & Filing Systems
...
Emulating 26-bit voice generators poses some interesting technical challenges because Aemulor itself must be capable of running in IRQ mode with IRQs enabled
What were the challenges? I managed to avoid this issue as the voice code is running natively under ADFFS. The two challenges I did face were:

1. Ensuring no register corruption between the interaction of SoundDMA and the 26bit code and tied in with this potential IRQ stack corruption
2. Sample rate/buffer size mis-matches

For 1 I had to resort to studying the RISCOS source to determine which registers it was preserving and which were corruptable. It doesn't appear this was ever documented fully, so I took the opportunity to update the Wiki in parallel to coding/testing.

Issue 2 was simple to resolve by implementing a circular audio buffer that's topped up at a faster rate where required. Where I came unstuck was in not realising the tuning in RO5 was seriously breaking legacy audio. Once I'd established that, it was just a case of turning Auto Tuning off when providing a legacy environment.

Screen mode emulation
I'm interested to understand why you went with 8bpp. I initially did the same but later switched to 24bit to provide T1 based palette swapping support. I did code a very neat 8bpp routine that changed the 8bpp palette in realtime, but due to overheads in the mailbox on the Pi had to drop it. I may yet have to go back to this approach if I'm to get Iyonix support working, but from the tests I've done the 24bpp routine seems to happily run at 120fps on the Iyonix.

Until I get a game running on Iyonix (having resolved the cache flush issue), I'm unsure of the performance impact.

The emulated screen modes are implemented by telling RISC OS to use a different area of memory for the screen image, so that the full graphics capabilities of the RISC OS kernel are still available, including scrolling and bankswitching. This ensures much better compatibility than would be achieved by intercepting all graphics operations and reimplementing them ourselves.
..
This conversion is done up to 15 times per second to ensure a smooth update, which is particularly important for games. A future version of Aemulor Pro may use a different approach to screen updating but the idea is currently unproven so I won't describe it here.
GraphicsV was being implemented by Jeffrey, whilst I was planning this so I had the luxury of working with Jeffrey to resolve some of the issues around palette swapping and mailbox delays. For the buffer I opted to leave RISCOS using DA2 and simply blit it to the GPU at the frame rate the game was running it. GraphicsV was a godsend here as it easily allows VSyncs and DAG sets to be intercepted and make RISCOS believe it's running on a machine that uses DA2.

To determine the framerate, I used a combination of counting physical and virtual VSyncs and tracking OS_Byte 19 / 112 calls. This was implemented right up front in the very early versions of ADFFS, before I even considered Pi support, so only needed a few tweaks to handle double buffering on the GPU as well as DA2. It updates in a PLL fashion to avoid tearing and drops frames where required. In reality you end up with microstutter if the monitor isn't running at 50Hz, but it's a small price to pay considering the alternative tearing or slow game update - most require 50Hz.

Hardware emulation
We've probably taken the same approach here. I'm interested to know how much you ended up emulating, did you go for full emulation or just what was required. Do you for example provide T0, T2 and T3?

I don't yet have the detail to provide some of IOMD, but I'm fairly confident I won't need to. I don't expect any software written for the RiscPC to be touching hardware directly, games written for RiscPC seem to only touch VIDC20 and T1 directly which is already covered by ADFFS.

What was your experience in this area?

One hardware feature currently not emulated by Aemulor Pro is the screen configuration registers of the VIDC and VIDC20 devices. These are manipulated by a small number of games to achieve some scrolling effects, so it's anticipated that they too will be emulated in a future release of Aemulor Pro.
This was designed into ADFFS early on to allow RiscPC's to run IOC games natively. The only difference on the Iyonix/Pi is that instead of writing the value to VIDC20, it writes it to a VIDC20 register soft-copy. The blitter then uses these to determine the screen geometry and issues GraphicsV Set Mode to something suitable (ie 32 pixel aligned), fixing up odd width screen within the blitter.

It's fairly light weight and doesn't really impact system performance at all, its still work in progress with the next release of ADFFS containing a large update. I've yet to implement hardware cursor support but for 99.9% of games its accurate.

For hardware cursor support, I'm considering options around including it in the blitter or makeing use of the actual hardware. I've yet to look in any detail to see if the Iyonix and Pi can for example handle full screen height cursors.

Just In Time Compilation
..
The inclusion of compilation in Aemulor Pro is particularly exciting because it opens up the possibility of many more optimisations such as substituting XScale-tuned equivalent code or increasing the performance of floating point maths by avoiding the repeated decoding of instructions. Whilst it should be stated that JIT compilation will not work for all applications, it is expected to work for the vast majority of StrongARM-aware programs, and the other emulation engines will always be available for those applications which cannot be used with the compilation.
How far did you get with this and how were you planning to resolve the self-modifying code problem?

The route I took was to change memory pages to Read only if they contained an instruction and handle writes/self-modifying code in an Abort handler. It's faily impressive to watch the Pi handling millions of Aborts a second - although obviously not the most optimal approach, it doesn't seem to impact game speed. Zarch for example will run at 700fps unthrottled and trigger 1.1m Abort/sec.

Ironically, the JIT is so fast I need to look at throttling it back somehow. Stalling on VSync is not option for some games, although it does allow me to drop the CPU into a low power state for large chunks of the time.

I'm going to look at locking the Abort handler core into the cache, but it requires RISCOS extensions to do that as I don't want to get into CPU specific optimizations that will change with later chips. If it's in RISCOS, its future proofed. The Abort handler isn't particularly big, I only need to lock in the code that determines if it's self-modifying code, so it can proxy any writes quickly. Self-modifying support can remain outside of the cache as it's unlikely to occur in most programs.

Task display
What was the deciding factor of using a seperate Task display vs letting RISCOS display them in the regular Task manager?

Improved handling of low-resolution screen modes
..
To solve these problems, Aemulor Pro will in future be capable of rescaling the screen image in software so that the monitor can be operated in a suitable higher-resolution, higher frame-rate mode, whilst the game sees the resolution and frame rate that it expects. The user interface will be extended to allow the user to specify which native screen mode should be used for each emulated mode.
Steve coded a very good blitter to do exactly this, we've yet to use it though as Iyonix support is only now getting any real focus. With the Pi we have the luxury of a GPU to handle the scaling and I've utilized that fully to avoid any kind of scaling done in software.

Control of the emulation speed
I'm interested to know what methods you used here. As mentioned above, ADFFS in the most part stalls at OS_Byte 19 / 112 and OS_Word 21, 0 although I'm looking at other technics for games that rely on slow processor speed alone.

Windowed emulation of single-tasking applications
This is something I plan to implement via a RISCOS Hypervisor. It's actually a fairly trivial piece to implement as I just need to put a sprite header on the front of the GPU frame and can leave the OS to handle the rest. Provided RISCOS is eventually extended to use the GPU for scaling sprites its performance won't be an issue, scaling will be smoothed and it will be future proofed.
  ^[ Log in to reply ]
 
Adrian Lees Message #123541, posted by adrianl at 10:48, 29/3/2015, in reply to message #123535
Member
Posts: 1565
Baring in mind that Aemulor is now 12 years old, and that I've written an awful lot of other code since then, some of the details, particularly about the rationale for design decisions, may be lost in the mists of time, but I hope the following is of some use:


SWIs and CPU flags / SWI API changesThis is an area I've really struggled with, I've been unable to find the kind of detail I require to implement this to date. I've reached out to the community for assistance, but have sadly had no responses. If you have any documetation on the subject you could share, I'd really appreciate it.
I'm afraid I don't think there really is a definitive list anywhere; some changes were, I'm sure, not listed even in the pretty minimal documents that Castle Technology did supply. There was some documentation, but I had to discover quite a lot by simply testing and reverse-engineering the software.

Also, I rather suspect that your requirements will be somewhat different to those of Aemulor, because one key design decision in the current versions of Aemulor was never to change an address; all code and data is deliberately kept at the address for which it was built. This both speeds up the emulation, and sidesteps the massive undertaking of having to relocate all addresses passed to 32-bit code, including those contained within data structures.

If it's any help, SWIs that Aemulor treats specially may be grouped into the following categories:

- SWIs which need special handling of CPU flags upon return, because they actually need to return information via the flags (I ascertained this list by reading through the PRMs from start to finish!)

- SWIs which have had their API redefined because they did not accept a full 32-bit address in one or more registers; rather they used some bits of the register for another purpose (OS_ReadLine, OS_HeapSort, OS_SubstituteArgs, OS_File)

- SWIs which accept the address of some code, and thus must be intercepted by Aemulor (so that it can step intercept callsbacks into 26-bit code; may not affect ADFFS). I think there are far more SWIs in this category.

The circular buffer is a very neat idea, how did you determine 32 instructions as being an optimal size?
Simple performance measurements on some key test applications, combined with the typical engineer's bias for powers of two wink
(Actually there is good reason for using a power-of-two, because it makes the buffer size more easily represented as a rotated ARM constant, and aligns all the cache sets nicely.)

Legacy C code is a fine example here, where SWI instructions are written to memory prior to execution as CallASWI didn't exist back then.
True, but I presume that you've realised the SWI value is picked up from the instruction by a data read, so there's no need to clean caches in this case, even on the StrongARM/XScale Harvard architecture. This is also true for FP instructions, provided that all the FP maths is being performed by the FPEmulator (ie. no FPA), but you must ensure that the slot is pre-filled with a instruction that is guaranteed to raised an Undefined Instruction exception.

Do you cache the breakpoint locations so it's not continually scanning ahead?
Aemulor has 3 engines; the ARM610-compatiable interpreter engine and the StrongARM-compatible breakpointing approach both make very heavy use of cacheing to improve performance and avoid instruction scanning/decoding. In fact I recently measured the instruction execution:decode ratio as 3400:1. The third engine, labelled the 'ARM3' engine uses minimal cacheing to achieve maximum compatibility with older, less well-behaved software.

Avoiding the cache flush is a stroke of genius. ADFFS has to flush as I've opted to make no processor specific optmizations past SA, so it's future proofed. I've yet to get the Iyonix to flush correctly though, even calling OS_SynchroniseCodeArea results in it writing utter rubbish to memory.
Avoiding the expensive cache clean (technically different from a flush, in ARM terminology) was another key design decision, for performance reasons.

I'm unaware of any problems with the OS_SynchroniseCodeAreas SWI, and I think a lot of stuff would be broken if there were an implementation fault, so - with respect - it's probably something in your code. One oddity of this API call is that the end address is inclusive.

Do you cache any JIT sequences? Or are they disposed of once executed?
I don't really have a current JIT engine. To be honest the prototype that I wrote early in Aemulor's development was found to have disappointing performance, and it was then superceded by the breakpointing approach. It's something that I may revisit for later CPU architectures that do not make available the debug facilities that Aemulor requires.

Re IRQ handlers:
[qoute]What were the challenges?The chief complications here are handling reentrancy; if you have even a single instruction out of place and don't consider that your engine can be interrupted and re-entered at any point, then you will get crashes, and they will not be fun to track down. There is the additional complication that IRQ mode register(s) are corrupted by the execution of interrupt handlers on RISC OS; SPSR_irq certainly, and you may want to check R14_irq as well (I forget about that one). Your reply, however, suggests that you may already be aware of the details.

I'm interested to understand why you went with 8bpp.
Quite simply performance; pushing less data out to the graphics cards, and the 256-entry palette can obviously represent all the colours required for lower-depth modes. The main requirement for low-bpp modes in Aemulor was supporting Sibelius, which maintains the current desktop mode, rather than using a lower-resolution custom modes like games. This means that the amount of data being shuffled around can be enormous. (This monitor is running at 2048 x 1280, for example!)

I'm interested to know how much you ended up emulating, did you go for full emulation or just what was required. Do you for example provide T0, T2 and T3
I think I'm right in saying that the IYONIX pc offers only two timers via its HAL_ interface. A quick perusal of the source shows that I only support timers 0 and 1, but that has never been a problem.

I don't expect any software written for the RiscPC to be touching hardware directly, games written for RiscPC seem to only touch VIDC20 and T1 directly which is already covered by ADFFS.
I do implement emulation of most IOMD/IOC registers; they are very similar devices, but remember that my main focus was RiscPC-era software, so I considered IOMD more important, and IOC compatibility a 'bonus.'

Useful resources, by the way, which you may be able to pick up on ebay, for example. I'm sure that they're all out of print now:

Acorn RISC Machine Family Data Manual (VLSI Technology Inc, Prentice Hall, ISBN 0-13-781618-9)

Technical Reference Manuals for A3000 and RiscPC.

How far did you get with this and how were you planning to resolve the self-modifying code problem?
Self-modifying code isn't all that common in the applications that people want to run under Aemulor, because they have for the main part been updated for StrongARM-compatibility at least. It would be supported, but I do not expect it to be inexpensive in terms of performance.

[qupte]The route I took was to change memory pages to Read only if they contained an instruction and handle writes/self-modifying code in an Abort handler. It's faily impressive to watch the Pi handling millions of Aborts a second - although obviously not the most optimal approach, it doesn't seem to impact game speed. Zarch for example will run at 700fps unthrottled and trigger 1.1m Abort/sec.Am I to presume that you then have a problem with self-modifying priviledged-mode code? If you're performing the write by proxy in an abort handler, that is presumably only for USR mode aborts. If you are actually changing the page attributes and then performing the write, you'll find a limited-range cache clean of the instruction(s) is surely cheaper, because otherwise the OS will have to clean the page tables out to memory, as well as flushing the TLBs.

What was the deciding factor of using a seperate Task display vs letting RISCOS display them in the regular Task manager?
Well, it just provides a bit more information about the 26-bit tasks, showing the per-application breakdown of the memory being used by Aemulor (which with a JIT could be variable), confirmation of the emulation engine being used for each application, and the status of the 26-bit RMA.

Control of the emulation speedI'm interested to know what methods you used here. As mentioned above, ADFFS in the most part stalls at OS_Byte 19 / 112 and OS_Word 21, 0 although I'm looking at other techniques for games that rely on slow processor speed alone.
To be honest, I don't think I concentrated upon that too much effort, adopting the attitude that well-written games would by VSync-driven using those OS_Byte calls, and that provided suitable (/close-enough) screen modes are employed, with the appropriate frame rate, most games should be okay.

This is something I plan to implement via a RISCOS Hypervisor.
I wish you luck and hope that you have the time. I, alas, do not have the luxury of lavishing extended amounts of time on RISC OS software development, although I too have an interest in hypervisors, having just been reading up on the features of the later ARM architecture versions, including their VM support.
  ^[ Log in to reply ]
 
Jon Abbott Message #123543, posted by sirbod at 00:20, 1/4/2015, in reply to message #123541
Member
Posts: 563
I rather suspect that your requirements will be somewhat different to those of Aemulor, because one key design decision in the current versions of Aemulor was never to change an address; all code and data is deliberately kept at the address for which it was built. This both speeds up the emulation, and sidesteps the massive undertaking of having to relocate all addresses passed to 32-bit code, including those contained within data structures.
I spent the best part of a year pondering over the best route to take. Having decided early on to go with a JIT I had to devise a way of allowing self-modifying code to continue to work and still allow the JIT to recode instructions.

The only viable solution that allowed both of these was to split data and code into separate memory regions. The data remains at it's original address, along with the original code and the JIT code is located high. The also neatly avoids having to flush two regions of memory where instructions self-modify as the original code is always treated as data from it's original location and can remain in the D cache without a clean.

The only changes the JIT makes within it's version of the code is to alter PC whenever it's touched so it remains within JIT appspace. R14 etc all remain pointing at the original addresses with the PSR in the registers.

The requirements are however identical to Aemulor on the SWI front as it needs to provide what is effectively a paravirtualised RISCOS so it looks like RISCOS x.y - which is configurable on an app by app basis.
If it's any help, SWIs that Aemulor treats specially may be grouped into the following categories:

- SWIs which need special handling of CPU flags upon return, because they actually need to return information via the flags (I ascertained this list by reading through the PRMs from start to finish!)
Same here - arduously reading the PRM from start to finish.
- SWIs which have had their API redefined because they did not accept a full 32-bit address in one or more registers; rather they used some bits of the register for another purpose (OS_ReadLine, OS_HeapSort, OS_SubstituteArgs, OS_File)
This bit I'm currently missing. I've been unable to find any concise documentation of what changed when RISCOS went 32bit.
- SWIs which accept the address of some code, and thus must be intercepted by Aemulor (so that it can step intercept callsbacks into 26-bit code; may not affect ADFFS). I think there are far more SWIs in this category.
ADFFS is probably identical to Aemulor in this respect. All entry/exit points are managed, so all vector claims, callbacks, service entries, environment handlers, CLib etc. CLib is treated as one big codelet which translates APCS-X into APCS-32
Legacy C code is a fine example here, where SWI instructions are written to memory prior to execution as CallASWI didn't exist back then.
True, but I presume that you've realised the SWI value is picked up from the instruction by a data read, so there's no need to clean caches in this case, even on the StrongARM/XScale Harvard architecture. This is also true for FP instructions, provided that all the FP maths is being performed by the FPEmulator (ie. no FPA), but you must ensure that the slot is pre-filled with a instruction that is guaranteed to raised an Undefined Instruction exception.
I hadn't actually considered that, but now you mention it SWI's are a special case. Provided there's an SWI there to start with there's no requirement to flush the I cache. Ironically though, it's probably quicker to invalidate one I cache line instead of tracking all instructions in the cache line for changes. Admittedly, on StrongARM and Iyonix where there isn't the luxury of individual I invalidation, examining the instruction would be the better route in the Abort handler which deals with self-modifying code.

In the next release of ADFFS, I've got the cache clearing/invalidation quite optimal. With the exception of long code runs where a full invalidation is quicker, it rarely invalidates the cache and just invalidates a few I lines. Taking Zarch as an example, it doesn't invalidate the whole I cache at all.

Do you cache the breakpoint locations so it's not continually scanning ahead?
Aemulor has 3 engines; the ARM610-compatiable interpreter engine and the StrongARM-compatible breakpointing approach both make very heavy use of cacheing to improve performance and avoid instruction scanning/decoding. In fact I recently measured the instruction execution:decode ratio as 3400:1. The third engine, labelled the 'ARM3' engine uses minimal cacheing to achieve maximum compatibility with older, less well-behaved software.
I've not measured the execution:decode ratio of ADFFS, as it's a one time occurrence. I'm working to improving the number of instructions decode per JIT entry. It was 1:4 and is now up to 1:10 - the figure is somewhat skewed though as large chunks of code are initially decoded (up to 128) and the figure is then pulled down by conditional branches. To improve the ratio I've implemented branch prediction of up to 8 branches.

Essentially, the more instructions the JIT can decode in the initial run, the more efficient it is.

I've yet to get the Iyonix to flush correctly though, even calling OS_SynchroniseCodeArea results in it writing utter rubbish to memory.
Avoiding the expensive cache clean (technically different from a flush, in ARM terminology) was another key design decision, for performance reasons.

I'm unaware of any problems with the OS_SynchroniseCodeAreas SWI, and I think a lot of stuff would be broken if there were an implementation fault, so - with respect - it's probably something in your code. One oddity of this API call is that the end address is inclusive.
It's undoubtedly my code, but I can't fathom out why it only affect the Iyonix - it's the same codebase across StrongARM, Iyonix and Pi so they should all act the same.
The decoder reads the instruction correctly, writes it correctly and then flushes the cache via OS_SynchroniseCodeAreas (previously I had my own flush routines and switched to rule them out), but the result is total rubbish in RAM when the write buffer is flushed. Well... I say total rubbish, it looks more like bit-rot as random bits are altered. I'm beginning to suspect an errata.

Do you cache any JIT sequences? Or are they disposed of once executed?
I don't really have a current JIT engine. To be honest the prototype that I wrote early in Aemulor's development was found to have disappointing performance, and it was then superceded by the breakpointing approach. It's something that I may revisit for later CPU architectures that do not make available the debug facilities that Aemulor requires.
The breakpoint approach is a good compromise between full emulation and a JIT as you get the best of both worlds, namely native execute of the bulk of instructions and emulation of the problematic ones.

The ratio of native:problematic instructions I've measured seems to be pretty stable at around 9:1 across all the software I've tested, so 90% of the instructions remain untouched.

Re IRQ handlers:
What were the challenges?
The chief complications here are handling reentrancy; if you have even a single instruction out of place and don't consider that your engine can be interrupted and re-entered at any point, then you will get crashes, and they will not be fun to track down. There is the additional complication that IRQ mode register(s) are corrupted by the execution of interrupt handlers on RISC OS; SPSR_irq certainly, and you may want to check R14_irq as well (I forget about that one). Your reply, however, suggests that you may already be aware of the details.
Yes...many hours of instruction tracing to find one offending instruction. Probably identical to yourself. I never figured out how to use JTAG so did it the hard way, which wasn't fun.
The solution I came up with was somewhat novel, I agonised over how to handle re-entrancy for the best part of a year, considering modifying every codelet to handle re-entrancy, implementing new instructions to turn IRQ's on/off around sensitive instructions etc. The solution I ended up with was to simply sit on the IRQ vector and examine the address being interrupted. If it's within the JIT codelet area it takes a copy of the codelet local variables and lets the IRQ proceed. On return, it puts the codelet variables back.

99.9% of IRQ's then go though unhindered, with the odd one triggering the re-entrancy handler. this avoided adding any re-entrancy handing into the codelets which would slow them down by a large factor as they'd have to either enable/disable IRQ's and wait for any pending to complete or have their own stacks - both of which would have a dramatic impact on JIT translated code.

I'm interested to understand why you went with 8bpp.
Quite simply performance; pushing less data out to the graphics cards, and the 256-entry palette can obviously represent all the colours required for lower-depth modes. The main requirement for low-bpp modes in Aemulor was supporting Sibelius, which maintains the current desktop mode, rather than using a lower-resolution custom modes like games. This means that the amount of data being shuffled around can be enormous. (This monitor is running at 2048 x 1280, for example!)
That's a very valid point, at the moment ADFFS only supports the original Archimedes MODE's as I've concentrated on gaming - apart from Sibelius, all the requests I've received have been to get games working on the Pi. I figured there's a working solution in Aemulor anyhow, so why re-invent the wheel.

I'm considering options for how to handle things under the WIMP. The front runner at the minute is to use Sprites in the GPU buffer and let the OS handle it when Windowed and the GPU handle it when full screen, but I suspect performance may be an issue on the Sprite front.

I'm interested to know how much you ended up emulating, did you go for full emulation or just what was required. Do you for example provide T0, T2 and T3
I think I'm right in saying that the IYONIX pc offers only two timers via its HAL_ interface. A quick perusal of the source shows that I only support timers 0 and 1, but that has never been a problem.
I'm only handling T1 currently. Although it has support for T0 it's turned off until I see something requiring it. T2/T3 are also supportable although will only be required for a full Hypervisor so I've left them out.

I don't expect any software written for the RiscPC to be touching hardware directly, games written for RiscPC seem to only touch VIDC20 and T1 directly which is already covered by ADFFS.
I do implement emulation of most IOMD/IOC registers; they are very similar devices, but remember that my main focus was RiscPC-era software, so I considered IOMD more important, and IOC compatibility a 'bonus.'
I started on the premise that as Aemulor had IOMD covered, I should start on IOC / Arthur and work up. I get a lot of requests for IOMD support though, so am looking to implement the bare minimum required to get software working. As you point out, IOC/IOMD overlap so I probably don't need to add much specific IOMD support to get later software running.

Useful resources, by the way, which you may be able to pick up on ebay, for example. I'm sure that they're all out of print now:

Acorn RISC Machine Family Data Manual (VLSI Technology Inc, Prentice Hall, ISBN 0-13-781618-9)

Technical Reference Manuals for A3000 and RiscPC.
I've managed to collect most if not all TRM's. Back in 1988/89 I received a lot of assistance from Acorn/VLSI whilst coding Podule drivers and sample playback routines, so have an original 1st print of the VLSI manual. Absolute gold dust, but there are errata in it I have to be careful of!

Self-modifying code isn't all that common in the applications that people want to run under Aemulor, because they have for the main part been updated for StrongARM-compatibility at least. It would be supported, but I do not expect it to be inexpensive in terms of performance.
Self-modifying code was the largest factor in my JIT design, and the reason it took six months to define. Just about every other game from 1987-89 uses it in one form or other, so I had to come up with a solution that wasn't a major performance impact when running the code natively.

Emulation was ruled out immediately due to the high ratio of decode:execute, I wanted a solution that allowed all code to run natively but still support self-modifying.

The route I took was to change memory pages to Read only if they contained an instruction and handle writes/self-modifying code in an Abort handler. It's faily impressive to watch the Pi handling millions of Aborts a second - although obviously not the most optimal approach, it doesn't seem to impact game speed. Zarch for example will run at 700fps unthrottled and trigger 1.1m Abort/sec.
Am I to presume that you then have a problem with self-modifying priviledged-mode code? If you're performing the write by proxy in an abort handler, that is presumably only for USR mode aborts. If you are actually changing the page attributes and then performing the write, you'll find a limited-range cache clean of the instruction(s) is surely cheaper, because otherwise the OS will have to clean the page tables out to memory, as well as flushing the TLBs.
Priviledged-mode code should be an issue, but it's not proved to be yet. Out of 100+ games I've tested only two switched to SVC and remained in it, the rest only switch to SVC to write to either the IRQ vector, IOC, VIDC or MEMC. IOC/VIDC/MEMC I can ignore, as there's no memory at those locations it triggers an Abort. The IRQ vector (the whole of page zero actually) is captured by code analysis on 1st pass and writes passed to a page zero translator. Once we have vectors high by default, I plan to strip all this code out and use the Abort handler to translate reads - which is an order of magnitude quicker.

The TLB entries are only changed once...when the JIT sees an instruction in the page. From there on, they remain read-only to USR and RW to privileged code. I wasn't clear enough on that point previously. This allows ADFFS to simply perform the write and the original to trigger an Abort.

To work around privileged code writing to page zero etc, I'm planning to paravirtualize the CPU mode so all code runs in User and page tables/registers switched as the guest app switches CPU mode. I probably wont implement this until there are some additions to RISCOS though, as I need to future proof switching L1/L2PT in a way that wont break the host RISCOS.

Control of the emulation speed - I'm interested to know what methods you used here. As mentioned above, ADFFS in the most part stalls at OS_Byte 19 / 112 and OS_Word 21, 0 although I'm looking at other techniques for games that rely on slow processor speed alone.
To be honest, I don't think I concentrated upon that too much effort, adopting the attitude that well-written games would by VSync-driven using those OS_Byte calls, and that provided suitable (/close-enough) screen modes are employed, with the appropriate frame rate, most games should be okay.
Approaching from Arthur/ARM2 based games I got a nasty shock here, so had to design it in from day 1. ARM3 games do tend to use VSync but a lot of ARM2 ones didn't bother because the CPU ran at a known speed and wasn't particularly fast.

All the games I've yet to slow down are pre-ARM3 games from 1987/88.

This is something I plan to implement via a RISCOS Hypervisor.
I wish you luck and hope that you have the time. I, alas, do not have the luxury of lavishing extended amounts of time on RISC OS software development, although I too have an interest in hypervisors, having just been reading up on the features of the later ARM architecture versions, including their VM support.
I can't say I'm that enthusiastic about ARM's virtualization efforts to date, its been somewhat of a scatter-shot approach with features being added then deprecated in successive ARM revisions. v8.1 sounds like it may finally be a viable solution, but it doesn't really help RISCOS as there's unlikely to be an ARMv8.1 machine any time soon.

My target from day one has always been the Pi - purely because it's cheap, so ARMv7 is the best I can hope for at the top end. Realistically though, I'll target ARMv4 and optimize for ARMv7 where possible.

If a Pi comes out with ARMv8.1 - then I'll look at upping the minimum requirement.
  ^[ Log in to reply ]
 
Adrian Lees Message #123544, posted by adrianl at 08:56, 1/4/2015, in reply to message #123543
Member
Posts: 1565
It's undoubtedly my code, but I can't fathom out why it only affect the Iyonix - it's the same codebase across StrongARM, Iyonix and Pi so they should all act the same.
The XScale CPU has 32KB each for I and D cache, and the other two have only 16KB each. The cache lines may also be longer, but I don't recall off-hand. Is there anything in your code that assumes the cache properties such as rounding of addresses?

One other potential pitfall related to self-modifying code is that the Intel-designed XScale CPU has a much longer pipeline than ARM-designed CPUs.

As you point out, IOC/IOMD overlap so I probably don't need to add much specific IOMD support to get later software running.
That's understating it somewhat. IOMD is custom logic that implemented IOC and the DRAM-related MEMC functionality (which the ARM610/other does not provide) but not the logical->physical address mapping of MEMC. Thus the required software effort was reduced, for RISC OS and podule drivers alike.

Back in 1988/89 I received a lot of assistance from Acorn/VLSI whilst coding Podule drivers and sample playback routines, so have an original 1st print of the VLSI manual.
That predates me even encountering an Archimedes machine smile I think I briefly encountered one machine running Arthur, but I never really used anything before RISC OS 2, and only got my A3000 in 1990.

If a Pi comes out with ARMv8.1 - then I'll look at upping the minimum requirement.
I wouldn't hang around for that; that would seem to be serious overkill for the current applications of the Pi, and I think the required development work would be prohibitive. You are probably aware that the majority of the engineers beyond the VideoCore IV GPU software/hardware have dispersed to other companies.
  ^[ Log in to reply ]
 
Jon Abbott Message #123549, posted by sirbod at 06:20, 6/4/2015, in reply to message #123544
Member
Posts: 563
The XScale CPU has 32KB each for I and D cache, and the other two have only 16KB each. The cache lines may also be longer, but I don't recall off-hand. Is there anything in your code that assumes the cache properties such as rounding of addresses?
At the minute, it's hardcoded to 32 byte cache lines. I'll redress this once RISCOS has been extended to provide it via an SWI.

StrongARM/80321/ARM11 are all 32 byte lines in the Acorn world and I've not looked at the Pi2 yet to see if it's changed, but suspect not. I have one here which I purchased a few weeks ago specifically to look at supporting it, but I need to code up the mis-aligned Abort handler before I can test it properly.

As you point out, IOC/IOMD overlap so I probably don't need to add much specific IOMD support to get later software running.
That's understating it somewhat. IOMD is custom logic that implemented IOC and the DRAM-related MEMC functionality (which the ARM610/other does not provide) but not the logical->physical address mapping of MEMC. Thus the required software effort was reduced
I was referring specifically to the IOC overlap with IOMD here, which provided RiscPC software doesn't use IRQC/D etc then the IOC registers match.

I've yet to implement L2PT translation for MEMC as I've yet to encounter any software that modifies it. The only translation happening at the minute is for DA2, which is always at physical 0 so doesn't really require any effort to translate.

If a Pi comes out with ARMv8.1 - then I'll look at upping the minimum requirement.
I wouldn't hang around for that; that would seem to be serious overkill for the current applications of the Pi, and I think the required development work would be prohibitive.
I'll probably be grey before the Pi catches up with ARM. Everything I've coded to date is targeted at the StrongARM and then has minor additions for post StrongARM. ADFFS itself has four builds: IOC, IOMD 26bit, IOMD 32bit and non-IOMD. They're all the same code base, differences really only being the JIT and IOC/VIDC/MEMC translation being in IOMD up and the blitter in non-IOMD.

So far as CPU specific optimizations goes, I've only used two: Invalidating individual I lines in the Abort handler and pre-load D cache on ARMv4+

[Edited by sirbod at 01:29, 11/4/2015]
  ^[ Log in to reply ]
 

The Icon Bar: General: Website